Comments
Description
Transcript
Extract-Classify 1 Problem
Extract-Classify 1 Problem When we have available a corpus of data which has been partitioned into a set of distinct classes, how can we make decisions about similar data which has not been labeled? 2 Context In many applications, there exists a large body of pre-analyzed data; each datum has been given a label according to some disjoint partitioning of the corpus. Frequently, one would like to use these labeled examples to make guesses as to which labels should be applied to previously unseen data. For example, given a set of emails which have been marked as “Spam” or “Not Spam”, we would like to design an email filter which will identify future emails as “Spam” or ”Not Spam”. Another example, more consistent with the examples we will use in the rest of this pattern: given an image, determine whether it is of some particular object. Many computer vision benchmarks have exactly this flavor. The wellknown Caltech-101, Caltech-256, and Pascal VOC datasets are designed for precisely this object recognition task. However, in most cases the data is not inherently structured. That is, there is not a trivial or obvious representation of the data which lends itself to the classification task. In the example of spam-detection, each email is an arbitrarylength sequence of characters: this representation itself carries no information directly useful to the task of spam detectin. Likewise in the object recognition task, the image is an arbitrary-sized 2-D array of 8-bit pixels. This representation carries no useful inherent structure. In either case, however, the information necessary to make the classification decision is clearly present: a human can easily determine whether an email is spam, or distinguish an image of a car from an image of an elephant. Many problems can be successfully understood through this application pattern. For people who are new to these fields, we hope to give them some insights about the problems, some basic solutions, and how we can help them develop their program by the Pattern Language and Design Patterns. In the following sections, we introduce feature extraction and classification in the field of computer vision, as motivation for this application pattern. 3 Forces Extract vs Classify: Since both feature extraction and classification need to be performed for all the data we have, one can trade-off the complexity of one with the other. We can use complex features coupled with a simple 1 classifier, or simple features coupled with complex classifiers. This tradeoff depends on the task at hand and usually performed through experimentation. For example, it might be very hard to remove noise from the features extracted, but using a classifier with higher tolerance to noise might get similar results without increasing the feature extraction complexity. Training time vs classification time: While choosing a classification algorithm/tool for a problem, it is possible to tradeoff between training time (offline) and classification time (online). This tradeoff is very clear when considering nearest neighbor classifier (no training time, high classification time) vs support vector machines (high training time, constant classification time). If feature selection is done carefully, both classifiers are comparable in terms of accuracy. Global vs local features: This force is not a tradeoff, but rather a choice for the kind of features needed. Local features capture localized information along some dimension (spatial dimensions for images, time for speech etc.). They are usually easy to compute, but are easily corrupted by noise. Global features are more representative of the entire data and usually more tolerant to noise, but can be time consuming to compute and may be of limited use. For example, global features like histogram counting of straight lines in an image can differentiate urban/indoor vs natural scenes, but are not very helpful in discriminating small details. Algorithms also exist that use global information to derive robust local features at specific points (e.g. SIFT feature for images). Labeled vs unlabeled data: Using labeled data (data for which we know the correct answer) is usually less complicated in terms of classifier training, but getting large amounts of such data may be very time consuming. If a lot of unlabeled data for the problem is also available, it might be possible to use semisupervised or even unsupervised training to use such data to achieve better accuracy. However, unsupervised or semisupervised learning is harder than supervised learning. In other words, it might be possible to trade-off training data acquisition time with classification time and complexity. 4 Solution The Extract Classify pattern is used in a variety of contexts and domains. In general, the problem centers around understanding data. An overview of the problem is presented in figure 1. In the Extract Classify pattern, we start with some labeled training data, which has been annotated with results known to be correct. We extract features from this data, and then train a classifier based on these features and our known labels. This training process produces a model. After the model is derived, we can apply it to new, unknown data. First, the data is processed to obtain 2 Labeled Training Data Feature Extraction Unknown Data Classifier Training Feature Extraction Classifier Model Classification Results Figure 1: The Extract Classify Pattern features, using the same methods as with the training data. Then, the data is classified according to the model to produce the final classification results. This pattern is used extensively in any field that deals with computer-aided understanding of data, for example: machine learning, statistics, computer vision, control, biostatistics, computational finance, etc. It is a general pattern that has been found useful in many domains because it breaks the problem of data interpretation into well defined steps. The domain expert who invents new algorithms for understanding data can then focus on filling in this pattern with computations that are best suited for her particular problem. For example, feature extraction can be optimized to provide better classification performance. More concretely, in image object recognition, we hope to find features who can capture the features of objects such that one object can be distinguished from others. In activity detection, we are investigating features for different activities.how to extract differentiating features from images, or strategies for creating discriminative and accurate classifiers. 4.1 Feature Extraction The goal of feature extraction is to deal with the unstructuredness of the data. Arbitrarily-sized arrays of pixels are not immediately useful to the task of object recognition, as classification algorithms usually are formulated as functions on high-dimensional Euclidean spaces, or possibly infinite dimensional Hilbert spaces. Thus, given some datum of our subject type, we need some way of producing elements of the desired Euclidean or Hilbertian space. Feature Extraction can be viewed as a black box whose input is some set of data (structured or otherwise) and whose output is a set of vectors in some highdimensional space. The dimensionality of the space can vary widely depending 3 on the application, but for Computer Vision applications is typically several hundred, or potentially several thousand. This space can be either a continuous space (i.e. Rn ), or a discrete space (e.g. {0, 1}n ), or some combination of the two. Feature extraction is essentially the process of turning unstructured data into a representation with some mathematical structure, so that it is to utilize many algorithms which can achieve the task at hand. The later sections of this paper will discuss in more detail how these vectors are used, but let us first consider a trivial example from the domain of computer vision. Suppose that we have a set of labeled, full-frame pictures of cars and of elephants (i.e. we know which of these images are cars, and which are elephants), and we wish to decide whether some previously unseen image is of a car or of an elephant. This is a supervised classification problem: we need to use the labeled examples to compute parameters to some binary classification model. One potentially distinguishing characteristic of our two classes is color: cars are typically painted in bright, unnatural colors, while elepants are typically some shade of brown or gray. Thus potentially useful feature is a histogram of the colors that appear in our images. Most importantly, once we have extracted these features from our images, we are now able to use any number of general-purpose classification algorithms to solve our problem: e.g. Support Vector Machines, Logistic Regression, or Gaussian Processes. The best classifier to choose depends on the characteristics of our data and our computational constraints. This separation is powerful. Assuming that our features capture the most important information to our desired classification task, we now have extraordinary flexibility in choosing our classifier. However, there is no general or theoretical model with which to gauge the effectiveness of an approach to feature extraction. Successful feature extraction is much art as science. As illustrated by our simple example, designing a feature requires specific knowledge about the problem at hand: we would not have chosen color as a feature if we were trying to distinguish cars from motorcycles, for example. The only requirement is that the output be vectors suitable for input to more general purpose classification algorithms. We shall now present a few examples of features used in Computer Vision applications. 4.1.1 Point Fetures: SIFT A popular class of image features identify particular small regions of the image as “interesting”, and provide a vector for each such region. It is beyond the scope of this paper to discuss much detail about the definition of “interesting” used to identify these regions, as there is a large body of literature from the Vision community discussing the topic. However, all definitions usually try to achieve invariance to particular imaging conditions, so that the same regions are identified in multiple images of the same object. For example, most feature schemes achieve scale invariance: that is, they attempt to detect an interest point regardless of the size of the object in the image, since the object’s distance from the camera will be different in different images. Typically an entire object 4 is detected by detecting a set of local features features: for example, a car might be represented by a feature for each wheel, features for the headlights, etc. The most popular such feature is the Scale Invariant Feature Transform (SIFT) of David Lowe. SIFT is designed for invariance to scale, position, lighting conditions, and rotation. It describes an interesting region via a spatially-binned histogram of the edge directions surrounding the interest point. Essentially, this describes the shape of the region. The particular interest points detected by SIFT are usually edges, corners, and “blobs” – that is, they do not necessarily correspond to the features that a human would identify as interesting. However, SIFT has proven itself useful for a wide array of classification tasks. 4.1.2 Global Features: HOG Frequently, we wish to describe entire regions of images, rather than particular points. In the case of elephant-car classification, we were trying to describe the entire image (which was assumed to contain mostly the object of interest). It is usually impossible to assume a priori that the location of interesting objects is known. Thus, we have two choices: we can either require that the input to our system is a full-frame picture of the objects of interest (as we assumed in the elephant-car example), or use a “sliding-window” scheme. That is, we can guess that the object will be with a bounding box of a certain size, and then perform our feature extraction on bounding boxes at every possible position in the image. Since we don’t know a priori what the size of the object will be, we can repeat this for bounding boxes of several sizes. Obviously, we need to be told the locations of objects in the training set, but the sliding window approach is frequently a very effective, if simple, approach to locating objects in images. A well-known application of this principle is the Pedestrian Detector of Dalal and Triggs. This pedestrian detector uses a descriptor much like the SIFT descriptor for describing regions of an image which contain people standing upright, and train linear Support Vector Machines to distinguish regions containing people from regions not containing people. Essentially, describing regions in terms of edge directions causes the SVMs to detect the silhouettes of people. In the case of the Dalal and Triggs Pedestrian detector, the resulting “Histograms of Oriented Gradients”, or HOG, descriptors tended to capture the silhouette of a standing human. 4.1.3 Other Features In addition to the SIFT and HOG features, there are some other features that we can gather from images. • Color features capture the color information of the images. Some representations of the color features include conventional color histogram [14], fuzzy color histogram [15], and color correlogram [16]. • Texture features capture the spacial pattern on the images. The common approaches for extracting textures include steerable pyramid [17], 5 contourlet transform [18], Gabor wavelet transform [19], and complex directional filter bank [20]. • Contour shape features capture the image contours that are the most agreeable by humans. The most well-known method is the gPb algorithm [21]. • Edge shape features capture the local changes of image gradients. The most well-known method are the Canny edge detection algorithm [22] and the Sobel edge detector [23]. 4.2 Classification The classification problem is to find the corresponding category of a given entity. It is a widely used high-level pattern in various area. For example, in object recognition, we hope to identify objects in an image. In medical diagnostic, we want to decide the disease of a patient by his symptoms. In virus detection, we need to investigate whether an executable is infected. In order to solve this problem, a set of data and their corresponding categories are given, such that we can learn some classification criteria from the given categorized exemplars, and apply the learned criteria to classify other queries. A formal definition of the problem is given as following: Definition 1 Let C be the set of possible categores. Let D = {(x̂1 , ŷ1 ), (x̂2 , ŷ2 ), . . . , (x̂n , ŷn )} with ŷi ∈ C∀1 ≤ i ≤ n be the training set of n (data, category) pairs. Given Q = x1 , x2 , . . . , xm be a set of m data, find their corresponding categories y1 , y2 , . . . , ym with yi ∈ C∀1 ≤ i ≤ m. A wide variety of algorithms are proposed to solve the classification problem. The most common approaches include instance-based methods, probabilistic models, linear models, and decision models. The instance-based approach is to compare the query data with the training data, find similar instances from the training data, and do the classification based on these similar instances and their category tags. The most well-known algorithm for this method is the K-Nearest Neighbors algorithm [6]. The probabilistic methods are to model the search space by probability distributions. Given a feature vector X, model the probabibility that it will be recognized as in category c by a probability distribution. The training set is the observed data. We want to find proper parameters for probability distributions that can maximize the probability of the appearance of the observed data. Then the trained parameters with the probability model will be applied on the query data. The final results will be the probability distribution on the category set C conditioned on the query vector X, P (c = Ci |X). The most well-known models are the Naive Bayes method [7] and the Logistic Regression method [8]. The linear models are to find a plane for separating the data points. The separating plane is found based on the training set. That is, to find the separating plane that can best separate the training data points. If the query data is 6 on one side of the plane, it will be classified as one category. If it is on the other side of the plane, it will be classified as the other category. The most well-known models are the Perceptron method [9] and the Support Vector Machine method [10]. The decision models are to decide the category of a query vector by a bunch of consecutive decisions. The training set is used to find the type of decisions we can make, and the parameters of the decisions. Then we classify the query data based on the decision model built by the training stage. The most well-known models are the Decition Tree method [11], the Boosted Decision Trees method [12], and the Random Forest method [13]. In this section, we discuss three widely used classfication algorithms. 4.3 K-Nearest Neighbor The k-nearest neighbors algorithm (k-NN) is an algorithm for classifying objects based on the closest k training exemplars. It is one of instance-based method, where the function is only approximated locally. The basic idea of the k-NN method is very simple. Given a query, find the k nearest neighbors from the training data set, and the k nearest neighbors will vote on the category of the query by the. In k-NN algorithm, k is a hyperparameter that influences the performance of the algorithm. In order to decide a proper k, training is inevitable. A general training approach is cross-validation. In cross-validation, the training data is divided into a training set and a validation set, train the hyperparameter k by the training set, and validate the performance by the validation set. Finally pick up the k value with the best performance on the validation set. Moreover, usually k is chosen to be an odd number to avoid possible ties. There are two ways for realizing the k-NN algorithm. One is exact k-NN, the other is approximate k-NN. As their name suggests, the k-NN is finding the exact k nearest neighbors, while the approximate k-NN is finding the approximate k nearest neighbors. The exact method is achieved by an explicit sorting on distances. The approximate method uses some specific data structures such that it only evaluates part of the search space for finding the k nearest neighbors. There are 3 major computations related to k-NN. Now we analyze the patterns for the computations. • Distance computation for each pair of data. The computation on distance of different pairs of data is task parallel. The computation on distance of one pair of data depends on the definition of distance. For Euclidean distance, the computation is a map-reduce operation. • Find the k nearest neighbors. For the exact k-NN algorithm, the computation is a parallel sorting problem. Different parallel sorting algorithm may be realized by different pat- 7 terns. For example, insertion sort can be implemented by data parallel, while the quick sort can be implemented by recursive splitting. For the approximate k-NN algorithm, usually the data structure is represented by a tree for quick access of the data. Therefore, it can be be parallelized by task parallelism. • Voting among the k nearest neighbors. It is a map-reduce procedure. V. Garcia and E. Debreuve proposed a GPU implementation on exact k-NN algorithm [1]. For distance computation, it explores the data parallelism on distances of different pairs of data. For finding the k nearest neighbors, it also explores the data parallelism on insertion sort. Finally, they compared the computation time for the GPU exact k-NN implementation with 3 different serial implementations. The results are summarized in Figure 2. In this table, n refers to the number of training data and query data, d refers to the dimensionality of the data, BF-Matlab is the exact k-NN implemented by MATLAB, BF-C is the exact k-NN implemented by C++, ANN-C++ is the approximate k-NN, and BF-CUDA is the parallel exact k-NN implemented on GPU. According to the results, the efficiency gain by exploring data parallel on GPU is very significant when the number of total nodes and the dimension are large. d=8 d=16 d=32 d=64 d=80 d=96 Methods BF-Matlab BF-C ANN-C++ BF-CUDA BF-Matlab BF-C ANN-C++ BF-CUDA BF-Matlab BF-C ANN-C++ BF-CUDA BF-Matlab BF-C ANN-C++ BF-CUDA BF-Matlab BF-C ANN-C++ BF-CUDA BF-Matlab BF-C ANN-C++ BF-CUDA n=1200 0.51 0.13 0.13 0.01 0.74 0.22 0.26 0.01 1.03 0.45 0.39 0.01 2.24 1.71 0.78 0.02 2.35 2.13 0.98 0.02 3.30 2.54 1.20 0.02 n=2400 1.69 0.49 0.33 0.02 2.98 0.87 1.06 0.02 5.00 1.79 1.78 0.03 9.37 7.28 3.56 0.04 11.53 8.43 4.29 0.04 13.89 10.56 4.96 0.05 n=4800 7.84 1.90 0.81 0.04 12.60 3.45 5.04 0.06 21.00 7.51 9.21 0.08 38.16 26.11 14.66 0.11 47.11 33.40 17.22 0.13 55.77 39.26 19.68 0.15 n=9600 35.08 7.53 2.43 0.13 51.64 13.82 23.97 0.17 84.33 30.23 39.37 0.24 149.76 111.91 59.28 0.40 188.10 145.07 73.22 0.48 231.69 168.58 82.45 0.57 n=19200 148.01 29.21 6.82 0.43 210.90 56.29 91.33 0.60 323.47 116.35 166.98 0.94 606.71 455.49 242.98 1.57 729.52 530.44 302.44 1.98 901.38 674.88 339.81 2.29 n=38400 629.90 127.16 18.38 1.89 893.61 233.88 319.01 2.51 1400.61 568.53 688.55 3.89 2353.40 1680.37 1008.84 6.65 2852.68 2127.08 1176.39 8.17 3390.45 2649.24 1334.35 9.61 Table 1. Comparison of the computation time, given in seconds, of the methods BF-Matlab, BF-C, ANN-C++, and BF-CUDA. BF-CUDA is up to 407 times faster than BF-Matlab, 295 times faster than BF-C, and 148 times faster than ANN-C++. Figure 2: Computation time for different parallel implementation and serial implementations. Fig. 5 shows the computation time as a function of the number of points n. The computation time increases polynomially with n except for the BF-CUDA which seems rather insensitive to n in the range of test. Table 1 gives the computation time of the whole kNN search process. We studied how this total time decomposes between each step of the BF method as a function of parameters n, d, and k. This was done with the CUDA profiler5 (see Tables 2, 3, and 4). Remember that n2 distances are computed and n sortings are performed. The time proportion spent for distance computation increases with n and d. On the opposite, the time proportion spent for sorting distances increases with k, as seen in Section 3.2. In the case where n = 4800, d = 32, and k = 20, the total time decomposes in 66% for distance computation and 32% 8 Figure 5. Evolution of the computation time as a function of the number of points for methods BF-Matlab, BF-C, BF-CUDA, and ANN-C++. In this table, D was set to 32 and k was set to 20. 4.4 Logistic Regression The logistic regression is a probabilistic model for the classification problem. Given the training data D, it finds a set of parameters θ such that the maximum log likelyhood of p(θ|D) is maximized. Then it uses the parameter set θ as well as the probability model to classify each query data. In logistic regression, the conditional probability p(y|x) is modeled by a function φ(θT x), where φ is the logistic function. Therefore, the conditional probability p(Y = k|x, θ) is modeled by the softmax-linear model: T eθk x P θT x , l le (1) where the summation in the denominator is summing over all possible category label l. Let T eθ i x µ = P θT x , l le i (2) we can calculate the conditional probability of the category yi given data xi and parameter set θ by Y k p(yi |xi , θ) = (µki )yi , (3) k where k is the set of possible categories, and yik = 1 if and only if the category of data xi is k. Therefore, the log likelyhood can be computed as YY l(θ|D) = yik log uki , (4) n k where n is the set of training data. In order to find parameter set θ that maximizes log likelyhood, we can apply optimization algorithms. For example, nonlinear conjugate gradient, steepest descent, and Newton-Raphson algorithm can be applied to solve the optimization problem. The kernel logistic regression is a nonlinear form of the logistic regression. By applying kernel trick on the data set, the data x is mapped to a high-dimension Hilbert space, and then apply the logistic regression on it. Now we analyze the computation involved in the logistic regression method. The map-reduce pattern can be applied to the dot product operation. Dense linear algebra can be applied on the iterative optimization methods for optimizing the parameter set. 9 4.5 Support Vector Machine Support Vector Machine are a class of classifiers based on the maximum-margin principle. In general terms, out of all possible planes that separate the classes, SVM picks the classifier with the largest margin between the classes. This gives classifiers that fit the training data well, and also perform well on other data by avoiding overfitting. We consider the standard two-class soft-margin SVM classification problem (C-SVM), which classifies a given data point x ∈ Rn by assigning a label y ∈ {−1, 1}. 4.5.1 SVM Training Given a labeled training set consisting of a set of data points xi , i ∈ {1, ..., l} with their accompanying labels yi , i ∈ {1, ..., l}, the SVM training problem can be written as the following Quadratic Program: max F (α) = α l X 1 αi − αT Qα 2 i=1 subject to 0 ≤ αi ≤ C, ∀i ∈ 1 . . . l (5) yT α = 0 where xi ∈ Rn is training data point i, yi ∈ {−1, 1} is the label attached to point xi , and αi is a set of weights, one for each training point, which are being optimized to determine the SVM classifier. C is a parameter which trades classifier generality for accuracy on the training set, and Qij = yi yj Φ(xi , xj ), where Φ(xi , xj ) is a kernel function. Some popular kernel functions are Linear : Φ(xi , xj ) = xTi xj d Polynomial : Φ(xi , xj ; a, r, d) = (axi · xj + r)2 Gaussian : Φ(xi , xj ; γ) = exp −γ||xi − xj || Sigmoid : Φ(xi , xj ; a, r) = tanh(axi · xj + r) In the optimal solution of the QP, the points which have non-zero α’s are called Support Vectors. This problem can be solved by different techniques. One of the most popular ways to solve it is through Sequential Minimal Optimization (SMO) [5]. We refer the reader to [5] for details of the algorithm. The important patterns in the computation are • Computing the gradient of the objective function This involves a parallel map computation. Each map computes the kernel function of a point with a point or a set of points. • Finding the next set of points to update This computation involves a reduce function (usually min or max). This step is usually combined with the previous stage to produce a map-reduce pattern. 10 • Convergence Update the points until the objective stabilizes or the duality gap goes to zero. This is done using a simple while loop. Other methods for solving the training problem involve Linear Algebra, MapReduce and Data Parallelism. 4.5.2 SVM classification The SVM classification problem is as follows: for each data point z which should be classified, compute ( ) l X ẑ = sgn b + yi αi Φ(xi , z) (6) i=1 where z is a point which needs to be classified, and all other variables remain as previously defined. The exact computational needs depend on the kernel used. For Pl linear SVMs, the computation involves a single dot product between z and i=1 yi αi xi . For other kernels, the computation involves a matrix-vector product like operation for each classification. If more than one point is being classified, it is better to combine all the points for classification leading to a matrix-matrix product. Matrix-matrix product (BLAS3) operations are highly optimized for compute throughput and are processed very efficiently. For example, the norm operation in a Gaussian kernel can be written as ||x − y||2 = x · x + y · y − 2x · y to recast the computation into vector norm computations and a Matrix Matrix multiplication. 5 Related Patterns 1. Dense Linear Algebra Dense Matrix operations are needed for most of the techniques - esp distance computations in k-NN, kernel operations in SVM etc. 2. Map Reduce Many computations can be written as map reduce - e.g. SVM training using SMO. 3. Data Parallelism Data parallelism is prevalent in all these techniques - e.g classify each point separately, distance computation can be done for each point in parallel etc. 11 References [1] Garcia, V. and Debreuve, E. and Barlaud, M., “Fast k nearest neighbor search using GPU,” Computer Vision and Pattern Recognition, pp. 1–6, June 2008. [2] Jordan, M. I., “An Introduction to Probabilistic Graphical Models.” [3] Dalal, N. and Triggs, B., “Histograms of Oriented Gradients for Human Detection,”, Computer Vision and Pattern Recognition, pp. 886-893, 2005 [4] Lowe, David G., “Distinctive Image Features from Scale-Invariant Keypoints”, International Journal of Computer Vision, pp 91-110, 2004 [5] Platt, J. C., ”Fast training of support vector machines using sequential minimal optimization”, In Advances in kernel methods: support vector learning, pp. 185–208, 1999. [6] Shakhnarovich, G., Darrell, T., and Indyk, P., “ Nearest-Neighbor Methods in Learning and Vision,” The MIT Press, 2006. [7] Rish, Irina, “An empirical study of the naive Bayes classifier,” IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001. [8] Hilbe, Joseph M, “Logistic Regression Models,” Chapman and Hall/CRC Press, 2009. [9] Freund, Y. and Schapire, R. E. 1998, “Large margin classification using the perceptron algorithm,” In Proceedings of the 11th Annual Conference on Computational Learning Theory, 1998. [10] Cristianini, N., and Shawe-Taylor, J., “An Introduction to Support Vector Machines and other kernel-based learning methods,” Cambridge University Press, 2000. [11] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., “Classification and regression trees,” Wadsworth and Brooks/Cole Advanced Books and Software, 1984. [12] Freund, Y., and Schapire, R. E., “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1), 119-139, August 1997. [13] Ho, Tin, “Random Decision Forest,” 3rd Int’l Conf. on Document Analysis and Recognition, pp. 278-282, 1995. [14] Smith, J. R., and Chang, S.-F., “Automated image retrieval using color and texture,” Columbia University, Tech. Report CU/CTR 408-95-14, 1995. [15] Han, J., and Ma, K., “Fuzzy color histogram and its use in image retrieval,” IEEE Trans. on Image Processing, vol. 11, pp. 944-952, 2002. 12 [16] Huang, J., Kumar, S. R., Mitra, M., Zhu, W. J., and Zabih, R., “Image indexing using color correlograms,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 762-768, 1997. [17] Portilla, J., and Simoncelli, E. P., “A Parametric Texture Model based on Joint Statistics of Complex Wavelet Coefficients,” Int’l Journal of Computer Vision, October, 2000. [18] Do, Minh N. and Vetterli, Martin, “The Contourlet Transform: An Efficient Directional Multiresolution Image Representation,” IEEE Transactions on Image Processing, 2005. [19] Daugman, J.G., “Two-dimensional spectral analysis of cortical receptive field profiles”, Vision Res., 20 (10), 847-856, 1980. [20] Oraintara S., and Nguyen, T. T., “Using Phase and Magnitude Information of the Complex directional Filter Bank for Texture Image Retrieval,” Proc. IEEE Int. Conf. on Image Processing,vol.4 ,pp. 61-64, 2007. [21] M. Maire, P. Arbelaéz, C. Fowlkes, and M. Malik. “Using contours to detect and localize junctions in natural images,” In CVPR, 2008. [22] Canny, J., “A Computational Approach To Edge Detection”, IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679-714, 1986. [23] R. Gonzalez and R. Woods, “Digital Image Processing,” Addison Wesley, 1992, pp 414 - 428. 6 Authors Narayanan Sundaram, Bor-Yiing Su, Mark Murphy, Bryan Catanzaro 13