Extract-Classify 1 Problem

by user

on 15 сентября 2016

Category: Documents

>> Downloads: 3

views

Report

Comments

Description

Download Extract-Classify 1 Problem

Transcript

Extract-Classify 1 Problem

Extract-Classify
1
Problem
When we have available a corpus of data which has been partitioned into a set
of distinct classes, how can we make decisions about similar data which has not
been labeled?
2
Context
In many applications, there exists a large body of pre-analyzed data; each datum
has been given a label according to some disjoint partitioning of the corpus.
Frequently, one would like to use these labeled examples to make guesses as to
which labels should be applied to previously unseen data. For example, given a
set of emails which have been marked as “Spam” or “Not Spam”, we would like
to design an email filter which will identify future emails as “Spam” or ”Not
Spam”. Another example, more consistent with the examples we will use in the
rest of this pattern: given an image, determine whether it is of some particular
object. Many computer vision benchmarks have exactly this flavor. The wellknown Caltech-101, Caltech-256, and Pascal VOC datasets are designed for
precisely this object recognition task.
However, in most cases the data is not inherently structured. That is, there
is not a trivial or obvious representation of the data which lends itself to the
classification task. In the example of spam-detection, each email is an arbitrarylength sequence of characters: this representation itself carries no information
directly useful to the task of spam detectin. Likewise in the object recognition
task, the image is an arbitrary-sized 2-D array of 8-bit pixels. This representation carries no useful inherent structure. In either case, however, the information
necessary to make the classification decision is clearly present: a human can easily determine whether an email is spam, or distinguish an image of a car from
an image of an elephant.
Many problems can be successfully understood through this application pattern. For people who are new to these fields, we hope to give them some insights
about the problems, some basic solutions, and how we can help them develop
their program by the Pattern Language and Design Patterns. In the following sections, we introduce feature extraction and classification in the field of
computer vision, as motivation for this application pattern.
3
Forces
Extract vs Classify: Since both feature extraction and classification need to
be performed for all the data we have, one can trade-off the complexity of
one with the other. We can use complex features coupled with a simple
1
classifier, or simple features coupled with complex classifiers. This tradeoff
depends on the task at hand and usually performed through experimentation. For example, it might be very hard to remove noise from the features
extracted, but using a classifier with higher tolerance to noise might get
similar results without increasing the feature extraction complexity.
Training time vs classification time: While choosing a classification algorithm/tool for a problem, it is possible to tradeoff between training time
(offline) and classification time (online). This tradeoff is very clear when
considering nearest neighbor classifier (no training time, high classification time) vs support vector machines (high training time, constant classification time). If feature selection is done carefully, both classifiers are
comparable in terms of accuracy.
Global vs local features: This force is not a tradeoff, but rather a choice
for the kind of features needed. Local features capture localized information along some dimension (spatial dimensions for images, time for
speech etc.). They are usually easy to compute, but are easily corrupted
by noise. Global features are more representative of the entire data and
usually more tolerant to noise, but can be time consuming to compute
and may be of limited use. For example, global features like histogram
counting of straight lines in an image can differentiate urban/indoor vs
natural scenes, but are not very helpful in discriminating small details.
Algorithms also exist that use global information to derive robust local
features at specific points (e.g. SIFT feature for images).
Labeled vs unlabeled data: Using labeled data (data for which we know
the correct answer) is usually less complicated in terms of classifier training, but getting large amounts of such data may be very time consuming.
If a lot of unlabeled data for the problem is also available, it might be
possible to use semisupervised or even unsupervised training to use such
data to achieve better accuracy. However, unsupervised or semisupervised
learning is harder than supervised learning. In other words, it might be
possible to trade-off training data acquisition time with classification time
and complexity.
4
Solution
The Extract Classify pattern is used in a variety of contexts and domains. In
general, the problem centers around understanding data. An overview of the
problem is presented in figure 1.
In the Extract Classify pattern, we start with some labeled training data,
which has been annotated with results known to be correct. We extract features
from this data, and then train a classifier based on these features and our known
labels. This training process produces a model. After the model is derived, we
can apply it to new, unknown data. First, the data is processed to obtain
2
Labeled Training Data
Feature Extraction
Unknown Data
Classifier Training
Feature Extraction
Classifier Model
Classification
Results
Figure 1: The Extract Classify Pattern
features, using the same methods as with the training data. Then, the data is
classified according to the model to produce the final classification results.
This pattern is used extensively in any field that deals with computer-aided
understanding of data, for example: machine learning, statistics, computer vision, control, biostatistics, computational finance, etc. It is a general pattern
that has been found useful in many domains because it breaks the problem of
data interpretation into well defined steps. The domain expert who invents new
algorithms for understanding data can then focus on filling in this pattern with
computations that are best suited for her particular problem.
For example, feature extraction can be optimized to provide better classification performance. More concretely, in image object recognition, we hope to
find features who can capture the features of objects such that one object can be
distinguished from others. In activity detection, we are investigating features
for different activities.how to extract differentiating features from images, or
strategies for creating discriminative and accurate classifiers.
4.1
Feature Extraction
The goal of feature extraction is to deal with the unstructuredness of the data.
Arbitrarily-sized arrays of pixels are not immediately useful to the task of object recognition, as classification algorithms usually are formulated as functions
on high-dimensional Euclidean spaces, or possibly infinite dimensional Hilbert
spaces. Thus, given some datum of our subject type, we need some way of
producing elements of the desired Euclidean or Hilbertian space.
Feature Extraction can be viewed as a black box whose input is some set of
data (structured or otherwise) and whose output is a set of vectors in some highdimensional space. The dimensionality of the space can vary widely depending
3
on the application, but for Computer Vision applications is typically several
hundred, or potentially several thousand. This space can be either a continuous
space (i.e. Rn ), or a discrete space (e.g. {0, 1}n ), or some combination of the
two. Feature extraction is essentially the process of turning unstructured data
into a representation with some mathematical structure, so that it is to utilize
many algorithms which can achieve the task at hand.
The later sections of this paper will discuss in more detail how these vectors
are used, but let us first consider a trivial example from the domain of computer vision. Suppose that we have a set of labeled, full-frame pictures of cars
and of elephants (i.e. we know which of these images are cars, and which are
elephants), and we wish to decide whether some previously unseen image is of
a car or of an elephant. This is a supervised classification problem: we need
to use the labeled examples to compute parameters to some binary classification model. One potentially distinguishing characteristic of our two classes is
color: cars are typically painted in bright, unnatural colors, while elepants are
typically some shade of brown or gray. Thus potentially useful feature is a histogram of the colors that appear in our images. Most importantly, once we have
extracted these features from our images, we are now able to use any number
of general-purpose classification algorithms to solve our problem: e.g. Support
Vector Machines, Logistic Regression, or Gaussian Processes. The best classifier to choose depends on the characteristics of our data and our computational
constraints.
This separation is powerful. Assuming that our features capture the most
important information to our desired classification task, we now have extraordinary flexibility in choosing our classifier. However, there is no general or theoretical model with which to gauge the effectiveness of an approach to feature
extraction. Successful feature extraction is much art as science. As illustrated
by our simple example, designing a feature requires specific knowledge about the
problem at hand: we would not have chosen color as a feature if we were trying
to distinguish cars from motorcycles, for example. The only requirement is that
the output be vectors suitable for input to more general purpose classification
algorithms. We shall now present a few examples of features used in Computer
Vision applications.
4.1.1
Point Fetures: SIFT
A popular class of image features identify particular small regions of the image
as “interesting”, and provide a vector for each such region. It is beyond the
scope of this paper to discuss much detail about the definition of “interesting”
used to identify these regions, as there is a large body of literature from the
Vision community discussing the topic. However, all definitions usually try to
achieve invariance to particular imaging conditions, so that the same regions
are identified in multiple images of the same object. For example, most feature
schemes achieve scale invariance: that is, they attempt to detect an interest
point regardless of the size of the object in the image, since the object’s distance
from the camera will be different in different images. Typically an entire object
4
is detected by detecting a set of local features features: for example, a car might
be represented by a feature for each wheel, features for the headlights, etc.
The most popular such feature is the Scale Invariant Feature Transform
(SIFT) of David Lowe. SIFT is designed for invariance to scale, position, lighting
conditions, and rotation. It describes an interesting region via a spatially-binned
histogram of the edge directions surrounding the interest point. Essentially, this
describes the shape of the region. The particular interest points detected by
SIFT are usually edges, corners, and “blobs” – that is, they do not necessarily
correspond to the features that a human would identify as interesting. However,
SIFT has proven itself useful for a wide array of classification tasks.
4.1.2
Global Features: HOG
Frequently, we wish to describe entire regions of images, rather than particular
points. In the case of elephant-car classification, we were trying to describe the
entire image (which was assumed to contain mostly the object of interest). It
is usually impossible to assume a priori that the location of interesting objects
is known. Thus, we have two choices: we can either require that the input to
our system is a full-frame picture of the objects of interest (as we assumed in
the elephant-car example), or use a “sliding-window” scheme. That is, we can
guess that the object will be with a bounding box of a certain size, and then
perform our feature extraction on bounding boxes at every possible position in
the image. Since we don’t know a priori what the size of the object will be, we
can repeat this for bounding boxes of several sizes. Obviously, we need to be
told the locations of objects in the training set, but the sliding window approach
is frequently a very effective, if simple, approach to locating objects in images.
A well-known application of this principle is the Pedestrian Detector of Dalal
and Triggs. This pedestrian detector uses a descriptor much like the SIFT descriptor for describing regions of an image which contain people standing upright, and train linear Support Vector Machines to distinguish regions containing people from regions not containing people. Essentially, describing regions in
terms of edge directions causes the SVMs to detect the silhouettes of people. In
the case of the Dalal and Triggs Pedestrian detector, the resulting “Histograms
of Oriented Gradients”, or HOG, descriptors tended to capture the silhouette
of a standing human.
4.1.3
Other Features
In addition to the SIFT and HOG features, there are some other features that
we can gather from images.
• Color features capture the color information of the images. Some representations of the color features include conventional color histogram [14],
fuzzy color histogram [15], and color correlogram [16].
• Texture features capture the spacial pattern on the images. The common approaches for extracting textures include steerable pyramid [17],
5
contourlet transform [18], Gabor wavelet transform [19], and complex directional filter bank [20].
• Contour shape features capture the image contours that are the most
agreeable by humans. The most well-known method is the gPb algorithm
[21].
• Edge shape features capture the local changes of image gradients. The
most well-known method are the Canny edge detection algorithm [22] and
the Sobel edge detector [23].
4.2
Classification
The classification problem is to find the corresponding category of a given entity.
It is a widely used high-level pattern in various area. For example, in object
recognition, we hope to identify objects in an image. In medical diagnostic, we
want to decide the disease of a patient by his symptoms. In virus detection,
we need to investigate whether an executable is infected. In order to solve this
problem, a set of data and their corresponding categories are given, such that
we can learn some classification criteria from the given categorized exemplars,
and apply the learned criteria to classify other queries. A formal definition of
the problem is given as following:
Definition 1 Let C be the set of possible categores. Let D = {(x̂1 , ŷ1 ), (x̂2 , ŷ2 ), . . . , (x̂n , ŷn )}
with ŷi ∈ C∀1 ≤ i ≤ n be the training set of n (data, category) pairs. Given
Q = x1 , x2 , . . . , xm be a set of m data, find their corresponding categories
y1 , y2 , . . . , ym with yi ∈ C∀1 ≤ i ≤ m.
A wide variety of algorithms are proposed to solve the classification problem.
The most common approaches include instance-based methods, probabilistic
models, linear models, and decision models.
The instance-based approach is to compare the query data with the training
data, find similar instances from the training data, and do the classification
based on these similar instances and their category tags. The most well-known
algorithm for this method is the K-Nearest Neighbors algorithm [6].
The probabilistic methods are to model the search space by probability distributions. Given a feature vector X, model the probabibility that it will be
recognized as in category c by a probability distribution. The training set is the
observed data. We want to find proper parameters for probability distributions
that can maximize the probability of the appearance of the observed data. Then
the trained parameters with the probability model will be applied on the query
data. The final results will be the probability distribution on the category set C
conditioned on the query vector X, P (c = Ci |X). The most well-known models
are the Naive Bayes method [7] and the Logistic Regression method [8].
The linear models are to find a plane for separating the data points. The
separating plane is found based on the training set. That is, to find the separating plane that can best separate the training data points. If the query data is
6
on one side of the plane, it will be classified as one category. If it is on the other
side of the plane, it will be classified as the other category. The most well-known
models are the Perceptron method [9] and the Support Vector Machine method
[10].
The decision models are to decide the category of a query vector by a bunch
of consecutive decisions. The training set is used to find the type of decisions we
can make, and the parameters of the decisions. Then we classify the query data
based on the decision model built by the training stage. The most well-known
models are the Decition Tree method [11], the Boosted Decision Trees method
[12], and the Random Forest method [13].
In this section, we discuss three widely used classfication algorithms.
4.3
K-Nearest Neighbor
The k-nearest neighbors algorithm (k-NN) is an algorithm for classifying objects
based on the closest k training exemplars. It is one of instance-based method,
where the function is only approximated locally. The basic idea of the k-NN
method is very simple. Given a query, find the k nearest neighbors from the
training data set, and the k nearest neighbors will vote on the category of the
query by the.
In k-NN algorithm, k is a hyperparameter that influences the performance
of the algorithm. In order to decide a proper k, training is inevitable. A general
training approach is cross-validation. In cross-validation, the training data is
divided into a training set and a validation set, train the hyperparameter k by
the training set, and validate the performance by the validation set. Finally
pick up the k value with the best performance on the validation set. Moreover,
usually k is chosen to be an odd number to avoid possible ties.
There are two ways for realizing the k-NN algorithm. One is exact k-NN, the
other is approximate k-NN. As their name suggests, the k-NN is finding the exact
k nearest neighbors, while the approximate k-NN is finding the approximate k
nearest neighbors. The exact method is achieved by an explicit sorting on
distances. The approximate method uses some specific data structures such that
it only evaluates part of the search space for finding the k nearest neighbors.
There are 3 major computations related to k-NN. Now we analyze the patterns for the computations.
• Distance computation for each pair of data.
The computation on distance of different pairs of data is task parallel. The
computation on distance of one pair of data depends on the definition
of distance. For Euclidean distance, the computation is a map-reduce
operation.
• Find the k nearest neighbors.
For the exact k-NN algorithm, the computation is a parallel sorting problem. Different parallel sorting algorithm may be realized by different pat-
7
terns. For example, insertion sort can be implemented by data parallel,
while the quick sort can be implemented by recursive splitting.
For the approximate k-NN algorithm, usually the data structure is represented by a tree for quick access of the data. Therefore, it can be be
parallelized by task parallelism.
• Voting among the k nearest neighbors.
It is a map-reduce procedure.
V. Garcia and E. Debreuve proposed a GPU implementation on exact k-NN
algorithm [1]. For distance computation, it explores the data parallelism on
distances of different pairs of data. For finding the k nearest neighbors, it also
explores the data parallelism on insertion sort. Finally, they compared the computation time for the GPU exact k-NN implementation with 3 different serial
implementations. The results are summarized in Figure 2. In this table, n refers
to the number of training data and query data, d refers to the dimensionality of
the data, BF-Matlab is the exact k-NN implemented by MATLAB, BF-C is the
exact k-NN implemented by C++, ANN-C++ is the approximate k-NN, and
BF-CUDA is the parallel exact k-NN implemented on GPU. According to the
results, the efficiency gain by exploring data parallel on GPU is very significant
when the number of total nodes and the dimension are large.
d=8
d=16
d=32
d=64
d=80
d=96
Methods
BF-Matlab
BF-C
ANN-C++
BF-CUDA
BF-Matlab
BF-C
ANN-C++
BF-CUDA
BF-Matlab
BF-C
ANN-C++
BF-CUDA
BF-Matlab
BF-C
ANN-C++
BF-CUDA
BF-Matlab
BF-C
ANN-C++
BF-CUDA
BF-Matlab
BF-C
ANN-C++
BF-CUDA
n=1200
0.51
0.13
0.13
0.01
0.74
0.22
0.26
0.01
1.03
0.45
0.39
0.01
2.24
1.71
0.78
0.02
2.35
2.13
0.98
0.02
3.30
2.54
1.20
0.02
n=2400
1.69
0.49
0.33
0.02
2.98
0.87
1.06
0.02
5.00
1.79
1.78
0.03
9.37
7.28
3.56
0.04
11.53
8.43
4.29
0.04
13.89
10.56
4.96
0.05
n=4800
7.84
1.90
0.81
0.04
12.60
3.45
5.04
0.06
21.00
7.51
9.21
0.08
38.16
26.11
14.66
0.11
47.11
33.40
17.22
0.13
55.77
39.26
19.68
0.15
n=9600
35.08
7.53
2.43
0.13
51.64
13.82
23.97
0.17
84.33
30.23
39.37
0.24
149.76
111.91
59.28
0.40
188.10
145.07
73.22
0.48
231.69
168.58
82.45
0.57
n=19200
148.01
29.21
6.82
0.43
210.90
56.29
91.33
0.60
323.47
116.35
166.98
0.94
606.71
455.49
242.98
1.57
729.52
530.44
302.44
1.98
901.38
674.88
339.81
2.29
n=38400
629.90
127.16
18.38
1.89
893.61
233.88
319.01
2.51
1400.61
568.53
688.55
3.89
2353.40
1680.37
1008.84
6.65
2852.68
2127.08
1176.39
8.17
3390.45
2649.24
1334.35
9.61
Table 1. Comparison of the computation time, given in seconds, of the methods BF-Matlab, BF-C, ANN-C++, and BF-CUDA. BF-CUDA
is up to 407 times faster than BF-Matlab, 295 times faster than BF-C, and 148 times faster than ANN-C++.
Figure 2: Computation time for different parallel implementation and serial
implementations.
Fig.
5 shows the computation time as a function of the
number of points n. The computation time increases polynomially with n except for the BF-CUDA which seems
rather insensitive to n in the range of test.
Table 1 gives the computation time of the whole kNN
search process. We studied how this total time decomposes
between each step of the BF method as a function of
parameters n, d, and k. This was done with the CUDA
profiler5 (see Tables 2, 3, and 4). Remember that n2
distances are computed and n sortings are performed. The
time proportion spent for distance computation increases
with n and d. On the opposite, the time proportion spent for
sorting distances increases with k, as seen in Section 3.2.
In the case where n = 4800, d = 32, and k = 20, the total
time decomposes in 66% for distance computation and 32%
8
Figure 5. Evolution of the computation time as a function of the
number of points for methods BF-Matlab, BF-C, BF-CUDA, and
ANN-C++. In this table, D was set to 32 and k was set to 20.
4.4
Logistic Regression
The logistic regression is a probabilistic model for the classification problem.
Given the training data D, it finds a set of parameters θ such that the maximum
log likelyhood of p(θ|D) is maximized. Then it uses the parameter set θ as well
as the probability model to classify each query data.
In logistic regression, the conditional probability p(y|x) is modeled by a
function φ(θT x), where φ is the logistic function. Therefore, the conditional
probability p(Y = k|x, θ) is modeled by the softmax-linear model:
T
eθk x
P θT x ,
l
le
(1)
where the summation in the denominator is summing over all possible category label l.
Let
T
eθ i x
µ = P θT x ,
l
le
i
(2)
we can calculate the conditional probability of the category yi given data xi
and parameter set θ by
Y
k
p(yi |xi , θ) =
(µki )yi ,
(3)
k
where k is the set of possible categories, and yik = 1 if and only if the category
of data xi is k. Therefore, the log likelyhood can be computed as
YY
l(θ|D) =
yik log uki ,
(4)
n
k
where n is the set of training data.
In order to find parameter set θ that maximizes log likelyhood, we can apply
optimization algorithms. For example, nonlinear conjugate gradient, steepest
descent, and Newton-Raphson algorithm can be applied to solve the optimization problem.
The kernel logistic regression is a nonlinear form of the logistic regression. By
applying kernel trick on the data set, the data x is mapped to a high-dimension
Hilbert space, and then apply the logistic regression on it.
Now we analyze the computation involved in the logistic regression method.
The map-reduce pattern can be applied to the dot product operation. Dense linear algebra can be applied on the iterative optimization methods for optimizing
the parameter set.
9
4.5
Support Vector Machine
Support Vector Machine are a class of classifiers based on the maximum-margin
principle. In general terms, out of all possible planes that separate the classes,
SVM picks the classifier with the largest margin between the classes. This gives
classifiers that fit the training data well, and also perform well on other data by
avoiding overfitting.
We consider the standard two-class soft-margin SVM classification problem
(C-SVM), which classifies a given data point x ∈ Rn by assigning a label y ∈
{−1, 1}.
4.5.1
SVM Training
Given a labeled training set consisting of a set of data points xi , i ∈ {1, ..., l}
with their accompanying labels yi , i ∈ {1, ..., l}, the SVM training problem can
be written as the following Quadratic Program:
max F (α) =
α
l
X
1
αi − αT Qα
2
i=1
subject to 0 ≤ αi ≤ C, ∀i ∈ 1 . . . l
(5)
yT α = 0
where xi ∈ Rn is training data point i, yi ∈ {−1, 1} is the label attached
to point xi , and αi is a set of weights, one for each training point, which are
being optimized to determine the SVM classifier. C is a parameter which trades
classifier generality for accuracy on the training set, and Qij = yi yj Φ(xi , xj ),
where Φ(xi , xj ) is a kernel function. Some popular kernel functions are
Linear : Φ(xi , xj ) = xTi xj
d
Polynomial : Φ(xi , xj ; a, r, d) =
(axi · xj + r)2 Gaussian : Φ(xi , xj ; γ) = exp −γ||xi − xj ||
Sigmoid : Φ(xi , xj ; a, r) = tanh(axi · xj + r)
In the optimal solution of the QP, the points which have non-zero α’s are
called Support Vectors.
This problem can be solved by different techniques. One of the most popular
ways to solve it is through Sequential Minimal Optimization (SMO) [5]. We refer
the reader to [5] for details of the algorithm. The important patterns in the
computation are
• Computing the gradient of the objective function
This involves a parallel map computation. Each map computes the kernel
function of a point with a point or a set of points.
• Finding the next set of points to update
This computation involves a reduce function (usually min or max). This
step is usually combined with the previous stage to produce a map-reduce
pattern.
10
• Convergence
Update the points until the objective stabilizes or the duality gap goes to
zero. This is done using a simple while loop.
Other methods for solving the training problem involve Linear Algebra, MapReduce and Data Parallelism.
4.5.2
SVM classification
The SVM classification problem is as follows: for each data point z which should
be classified, compute
(
)
l
X
ẑ = sgn b +
yi αi Φ(xi , z)
(6)
i=1
where z is a point which needs to be classified, and all other variables remain as
previously defined. The exact computational needs depend on the kernel used.
For
Pl linear SVMs, the computation involves a single dot product between z
and i=1 yi αi xi .
For other kernels, the computation involves a matrix-vector product like
operation for each classification. If more than one point is being classified, it
is better to combine all the points for classification leading to a matrix-matrix
product. Matrix-matrix product (BLAS3) operations are highly optimized for
compute throughput and are processed very efficiently. For example, the norm
operation in a Gaussian kernel can be written as ||x − y||2 = x · x + y · y − 2x · y
to recast the computation into vector norm computations and a Matrix Matrix
multiplication.
5
Related Patterns
1. Dense Linear Algebra
Dense Matrix operations are needed for most of the techniques - esp distance computations in k-NN, kernel operations in SVM etc.
2. Map Reduce
Many computations can be written as map reduce - e.g. SVM training
using SMO.
3. Data Parallelism
Data parallelism is prevalent in all these techniques - e.g classify each point
separately, distance computation can be done for each point in parallel etc.
11
References
[1] Garcia, V. and Debreuve, E. and Barlaud, M., “Fast k nearest neighbor
search using GPU,” Computer Vision and Pattern Recognition, pp. 1–6, June
2008.
[2] Jordan, M. I., “An Introduction to Probabilistic Graphical Models.”
[3] Dalal, N. and Triggs, B., “Histograms of Oriented Gradients for Human
Detection,”, Computer Vision and Pattern Recognition, pp. 886-893, 2005
[4] Lowe, David G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp 91-110, 2004
[5] Platt, J. C., ”Fast training of support vector machines using sequential minimal optimization”, In Advances in kernel methods: support vector learning,
pp. 185–208, 1999.
[6] Shakhnarovich, G., Darrell, T., and Indyk, P., “ Nearest-Neighbor Methods
in Learning and Vision,” The MIT Press, 2006.
[7] Rish, Irina, “An empirical study of the naive Bayes classifier,” IJCAI 2001
Workshop on Empirical Methods in Artificial Intelligence, 2001.
[8] Hilbe, Joseph M, “Logistic Regression Models,” Chapman and Hall/CRC
Press, 2009.
[9] Freund, Y. and Schapire, R. E. 1998, “Large margin classification using
the perceptron algorithm,” In Proceedings of the 11th Annual Conference on
Computational Learning Theory, 1998.
[10] Cristianini, N., and Shawe-Taylor, J., “An Introduction to Support Vector
Machines and other kernel-based learning methods,” Cambridge University
Press, 2000.
[11] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., “Classification and regression trees,” Wadsworth and Brooks/Cole Advanced Books and
Software, 1984.
[12] Freund, Y., and Schapire, R. E., “A decision-theoretic generalization of
on-line learning and an application to boosting,” Journal of Computer and
System Sciences, 55(1), 119-139, August 1997.
[13] Ho, Tin, “Random Decision Forest,” 3rd Int’l Conf. on Document Analysis
and Recognition, pp. 278-282, 1995.
[14] Smith, J. R., and Chang, S.-F., “Automated image retrieval using color
and texture,” Columbia University, Tech. Report CU/CTR 408-95-14, 1995.
[15] Han, J., and Ma, K., “Fuzzy color histogram and its use in image retrieval,”
IEEE Trans. on Image Processing, vol. 11, pp. 944-952, 2002.
12
[16] Huang, J., Kumar, S. R., Mitra, M., Zhu, W. J., and Zabih, R., “Image
indexing using color correlograms,” Proc. IEEE Conf. on Computer Vision
and Pattern Recognition, pp. 762-768, 1997.
[17] Portilla, J., and Simoncelli, E. P., “A Parametric Texture Model based on
Joint Statistics of Complex Wavelet Coefficients,” Int’l Journal of Computer
Vision, October, 2000.
[18] Do, Minh N. and Vetterli, Martin, “The Contourlet Transform: An Efficient
Directional Multiresolution Image Representation,” IEEE Transactions on
Image Processing, 2005.
[19] Daugman, J.G., “Two-dimensional spectral analysis of cortical receptive
field profiles”, Vision Res., 20 (10), 847-856, 1980.
[20] Oraintara S., and Nguyen, T. T., “Using Phase and Magnitude Information
of the Complex directional Filter Bank for Texture Image Retrieval,” Proc.
IEEE Int. Conf. on Image Processing,vol.4 ,pp. 61-64, 2007.
[21] M. Maire, P. Arbelaéz, C. Fowlkes, and M. Malik. “Using contours to detect
and localize junctions in natural images,” In CVPR, 2008.
[22] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans.
Pattern Analysis and Machine Intelligence, 8:679-714, 1986.
[23] R. Gonzalez and R. Woods, “Digital Image Processing,” Addison Wesley,
1992, pp 414 - 428.
6
Authors
Narayanan Sundaram, Bor-Yiing Su, Mark Murphy, Bryan Catanzaro
13