Somewhat Awesome Video ( SAV Image/Video Labelling API )

Introduction

One of the most challenges in computer vision is image classification which is basically trying to know the type of object in the image. This is a really important vision task to help the visually impaired people to see using technology. To a computer, an image is just a grid of red, green and blue pixels where each color channel is typically represented with a number between 0 and 255. Depending on pixel values or the classical computer vision techniques doesn't solve the problem accurately. That is why this system is proposed to classify both images and videos (create labels that best describe the image/video) with showing the exact time and frame where the label appeared.

Open source contribution: 

Somewhat Awesome Video (SAV) is an open source API to be considered as a Software as a Service (SaS). It is used for visual content understanding. This API has some features:

  • It labels images and videos of various length.
  • It enables the user to search within the video using a specific label and the output is the frames where the label has occurred and the time of these frames.
  • It can be hosted on normal PCs and enable the user to schedule multiple videos to be labelled.
The code is available to anyone to edit, use or deploy in their applications on my Github account: https://github.com/emanhamed/Somewhat-Awesome-Video-SAV-API

Here is an explanation about the system architecture : 

System architecture: 

The System is divided into two main modules as shown in the figure below, the API Layer and the worker layer. The API handles the communication between the system and the user, while the worker is responsible for all the processing including validating the data, extracting the key frames and doing the classification phase.


Each module's function:

Validation Phase: The worker first validates the incoming data, for example making sure it is a valid video or image file. In this step the worker makes use of Linux “file” command to make sure that the data is valid.

Key Frame Extraction Phase: This phase is bypassed if the input is an image, so it only accepts video inputs to extract the most important frames out of it. The FFMPEG library is used to detect the scenes and the key frames within each scene instead with the time stamp that indicates when this image happens and the frame number. This step is crucial instead of processing the whole video. 

Classification Phase: The classification module is implemented using Deep Convolutional Neural Networks. AlexNet model was used and it was trained on imageNet dataset 2012 which consists of 1.2 images of 1000 different classes.  The expected 1-rank accuracy for this model is 56.61% and the the 5-rank is 80.1%. Caffe Library was used for training and testing, then the code was integrated with the API module. 


The API: The API is the “middle man” between the user and the worker and it is the only interface the user can communicate with the system. A regular process is started by a user submitting a request containing a video/photo file along with the authentication details, the API checks for the authentication data and if they are valid the API proceeds to the next step. The API sends the job to the worker and receives a job id and an estimated time remaining which he passes on to the user. After the processing is finished, the user can request his resulting data from the API and the API fetches it from the database and passes it to the user. In the worker celery was used to schedule jobs submitted by the API which can integrated easily to any backend system and makes it easier in the development. Python was used in caffe library for the classifier. 

Here is some screenshots to demonstrate the results of the keyframe extraction and classification modules on this video on youtube 

https://www.youtube.com/watch?v=AJ0L7ZYsSNE





after keyframe extraction
(only some frames are shown)






The result of the classification module for each key frame alone is: 

And the final result of choosing the best labels that best describe the objects in the video along with the time stamp is: 

Importance of this project to me ! 

I'm really fascinated by how machine learning and computer vision fields can make the computers smarter which will lead to infinite innovative breakthroughs. Working on this project helped me have a hands-on experience in these two fields and start thinking about extending this project to include other features like descriptions in sentences instead of jut words. Moreover, I'm seeking a machine learning and computer vision research internship in the National University of Singapore as part of the A*Star program to enhance my skills and see what type of problems scientists are eager to solve these days. 

References

[1]     Kdnuggets.com, 'Inside Deep Learning: Computer Vision With Convolutional Neural Networks', 2015. [Online]. Available: http://www.kdnuggets.com/2015/04/inside-deep-learning-computer-vision-convolutional-neural-networks.html.
[2]     Blogs.technet.com, 'Machine Learning, meet Computer Vision - Machine Learning – Site Home - TechNet Blogs'. 2015.[online].Available: http:://blogs.technet.com/b/machinelearning/archive/2014/08/06/machine-learning-meet-computer-vision.aspx


Comments