Thursday, May 30, 2013

Basic concepts in Computer Vision and Machine Learning


Attribute: a semantic way to describe objects, human defined most times, like color, type, etc.


Feature: a piece of information which is relevant for solving the computational task related to a certain application. More specifically, features can refer to:
1. the result of a general neighborhood operation (feature extractor or feature detector) applied to the image
2. specific structures in the image itself, ranging from simple structures such as points or edges to more complex structures such as objects.
Other examples of features are related to motion in image sequences, to shapes defined in terms of curves or boundaries between different image regions, or to properties of such a region.
The feature concept is very general and the choice of features in a particular computer vision system may be highly dependent on the specific problem at hand.


Bag-of-words (BoW model): can be applied to image classification, by treating image features as words. In computer vision, a bag of visual words is a sparse vector of occurrence counts of a vocabulary of local image features.
The first step is to extract features, the second one is to represent features.
The final step for the BoW model is to convert vector represented patches to "codewords" (analogy to words in text documents), which also produces a "codebook" (analogy to a word dictionary). A codeword can be considered as a representative of several similar patches.
One of notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which is very important in image representation.
Furthermore, the BoW model has not been extensively tested yet for view point invariance and scale invariance, and the performance is unclear. Also the BoW model for object segmentation and localization is not well understood.


Scale Invariant Feature Transform (SIFT) is a feature detector and descriptor algorithm notable for incorporating robust scale and rotational invariance to feature descriptions.


Histogram of Gradient (HOG):


k-means clustering: a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results in a partitioning of the data space into Voronoi cells.
k-nearest neighbor (k-NN): a non-parametric method for classifying objects based on closest training examples in the feature space.


regression: a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.


AdaBoost: several weak classifier to construct strong classifier


Support Vector Machine (SVM): classifier


Reduce dimensionality
Principal Component Analysis (PCA): align data along the directions of greatest variance (retain directions of large variance, but large variance is not always best for classification)
Linear Discriminant Analysis (LDA): project onto a subspace of best discrimination by maximizing the separation between classes relative to separation within classes  (take into account the actual classes, ratio about the distance between classes and the one within classes)

Deformable Part Model (DPM): 

Wednesday, May 29, 2013

Methods about vehicle detection and recognition

questions, methods and papers summary in stackoverflow: link

Several papers:

1. Car-Rec: A Real Time Car Recognition System

This paper talks about the way to check whether or not a car is in the database, which includes employee cars in a certain parking lot. There are four steps in the framework:
1. feature extraction
2. word quantization
3. image database search
4. structural matching.

If I understood it correctly, the method in this paper cannot recognize the type of vehicles, like sedan, truck, van, SUV and so on.

2. Robust Classification and Tracking of Vehicles in Traffic Video Streams


Actually, the method in this paper depends on the bounding box information too much, "long" bounding box for Semi, "short" ones for Sedan and TSV, so the system sometimes recognizes TSV as Sedan. Moreover, it needs background subtraction and completely static background and dynamic objects, static objects would not work like cars in parking lots.


This paper integrates vehicle tracking and classification (three types, Sedan, Semi, Truck+SUV+Van) together on low resolution traffic video, the technique is also general to be applied to surveillance scenes besides traffic.

It mentioned that model based trackers are robust to illumination and occlusion but require models for all vehicles, limiting its scalability.

Text format accompanying Fig. 4, Track ID, class identifier number ( c0 for TSU, c1 for sedan, c2 for semi), an estimate of speed and direction of travel

The system will have one classifier trained for the entire scene and invariant to the camera pose selected by a remote operator.

The vehicle classifier was built on comparison of different classification schemes using either image based (IB) or image measurement based (IM) features. PCA or LDA was applied to reduce the dimensionality of data, remove redundant information and project the data to a space better suited for classification. A weighted K nearest neighbor classifier achieved classification.

Image Based (IB) features: The image of tracked object is used as a feature vector. To do a proper comparison each object was resized to [64x32] pixels, generating a feature vector with 2048 components.

Image Measurements Based (IM) features: cheaper computationally and storage-wise to maintain a database of features rather than images. The aim is to obtain as many simple measurements as possible and allow a classifier to decide which are best for classification.
The feature vector consist of:
~ area
~ bounding box [width, height]
~ convex area
~ ellipse [eccentricity, major axis, minor, axis]
~ extent - proportion of pixels in bounding box to object
~ solidity - proportion of pixels in convex hull to region
~ perimeter

weighted K Nearest Neighbor (wkNN): each sample is assigned to every class by a class weight while NN only gives a binary indication of class membership. The wkNN weight for each class indicates the strength of match and a label is assigned corresponding to the class with the highest weight.

The L2 norm was used as the distance metric to determine the similarity between vectors.

LDA-IM classifier was chosen for integration into tracking software because using LDA-IM generates a simple classifier with low computation complexity and that generalizes well due to scene object independence.

An adaptive background subtraction scheme was used to detect potential vehicles in Object Detection module. Taking the difference between the current video frame and the estimate background produces regions of moving objects. These regions are processed to produce vehicle blob detections.

Tracking is accomplished by using a Kalman filter on the center of mass of the detected object region. The Kalman module outputs a state vector, [x, v]^T, containing the position and velocity of the region. The Kalman filter is a state estimation tool that predicts the position of a vehicle in the next frame.

The track vehicle label is determined by building a histogram of class weights for each frame in a track T and selecting as label with the highest membership.

By binning class the soft class membership into a track histogram the Track Builder is able to recover from mis-classified examples by only assigning a final label as the most likely class along the entire track.

The TSV and Sedan are most often confused because of their proximity the LDA feature space.

Tracking based classification uses the entire track of a vehicle for classification rather than just an individual frame image. Each frame generates an individual example of a vehicle which can be classified more accurately when all occurrences in a track are combined.

The track based classification results are promising and indicate the value of doing classification over spatio-temporal detections.

3. Fine-Grained Entity Recognition 
Even though this paper proposes a way to define a fine-grained set of 112 tags, formulate the tagging problem as multi-class, multi-label classification, describes an unsupervised method for collecting training data, and presents the FIGER implementation.

In the overview section (2.1), input is a sentence in given plain text, then segment the sentence and find the candidates for tagging. It is a NLP paper, cannot be applied to vehicle recognition.

4. Inducing Fine-Grained Semantic Classes via Hierarchical and Collective Classification
Also, this is a NLP paper, cannot help to recognize vehicle types.

5. A Codebook-Free and Annotation-Free Approach for Fine-Grained Image Categorization 

The codebook method often loses subtle image information that are critical for fine-grained classification.

The annotation way takes a tedious process that is also difficult to generalize to new tasks.


Although it is talking about birds as example, the method might be applied to vehicle recognition and classification.





















Thursday, May 23, 2013

Key points in Towards Scalable Representations of Object Categories Learning a Hierarchy of Parts

This paper proposes a novel approach to constructing a hierarchical representation of visual input that aims to enable recognition and detection of a large number of object categories.

indexing (bottom-up)
robust matching (top-down)

Category-independent lower layers

Category-specific higher layers


Lower layers are learned in a category-independent way to obtain complex, yet sharable visual building blocks, which is a crucial step towards a scalable representation. 
Higher layers of the hierarchy, on the other hand, are constructed by using specific categories, achieving a category representation with a small number of highly generalizable parts that gained their structural flexibility through composition within the hierarchy.



This paper proposes a much simpler and efficient learning algorithm, and introduces additional steps that enable a higher level representation of object categories. Additionally, the proposed method is inherently incremental - new categories can be efficiently and continuously added to the system by adding a small number of parts only in the higher hierarchical layers.

Each unit in each hierarchical layer is envisioned as a composition defined in terms of spatially flexible local arrangements of units from the previous layers.


Since the learning process is incremental, categories can be efficiently added to the representation by adding a small number of parts only in the higher hierarchical layers.