Category Archives: machine learning

[FWD] Two stories from a research paper: Content Without Context is Meaningles

Two stories from a research paper: Content Without Context is Meaningless.

1.1 Machine Learning Hammer

Mark Twain once said: “To a man with a hammer, everything looks like a nail.” His observation is definitely very relevant to current trends in content analysis. We have a Machine Learning Hammer (ML Hammer) that we want to use for solving any problem that needs to be solved. The problem is neither with learning nor with the hammer; the problem is with people who fail to learn that not every problem is a new learning problem [1]. … If we can identify such a feature set, then we can easily model each object by its appropriate feature values. The challenges are read more

Tikz example – Kernel trick

In Support Vector Machines, the learning algorithms can only solve linearly separable problems. However, this isn’t strictly true. Since all feature vectors only occurred in dot-products k(xi,xj)= xi·xj, the “kernel trick” can be applied, by replacing dot-products by another kernel (Boser et al., 1992). A more formal statement of kernel trick is that

Given an algorithm which is formulated in terms of a positive definite kernel k, one can construct an alternative algorithm by replacing k by another positive definite kernel k∗ (Schlkopf and Smola, 2002).

The best known application of the kernel trick is in the case where k is the dot-product, but the trick is not limited to that case: both k and k can be nonlinear kernels. More general, given any feature map φ from observations into a inner product space, we obtain a kernel k(xi,xj)=φ(xi)·φ(xj).

This figure was drawn for “kernel trick” with samples from two classes.

Tikz example – SVM trained with samples from two classes

In machine learning, Support Vector Machines are supervised learning models used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier. To classify examples, we choose the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier.

Scale, Standardize, and Normalize Data

Note: Contents and examples in this article are partially from Scikit-learn-Preprocessing data and I normalize/standardize/rescale the data


Scaling a vector means to add/substract a constant, then multiply/divide by another constant, so the features can lie between given minimum and maximum values. The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data. Normally, the given range is [0,1]

For example, if we have a dataset like below, read more