Effective machine learning algorithms fundamentally rely on choosing a good data representation. A good data representation extracts explanatory factors of variation in the data. However, significant effort goes into creating ETL pipelines that construct these representations. This overhead, which accounts for most of the labor of Data Scientists, highlights the weakness of current algorithms: “their inability to extract and organize the discriminative information from the data.”
In other words, it would be highly desirable to make learning algorithms less dependent on feature engineering. This is what representation learning is trying to do, and it does so by automating the extraction of useful information. The information is considered useful if the learned representation is helpful as input to supervised prediction (i.e. a classifier, or other predictor).
In the longer run, this emerging field hopes to create intelligent systems that can “identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.”
Multi-task training: An interesting indirect consequence of Representation Learning
Multi-task training “is the ability of a learning algorithm to exploit commonalities between different learning tasks in order to share statistical strength, and transfer knowledge across tasks. Representation learning has an advantage for such tasks because they learn representations that capture underlying factors, a subset of which may be relevant for each particular task.”
To understand multi-task training, lets start with a concrete application called Word Embeddings. Word embeddings show how better ways of representing data can pop out of optimizing layered models. Its therefor one of the best places to gain intuition about why deep learning is so effective.
A word embedding is a parameterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions).
W(‘‘cat”)=(0.2, -0.4, 0.7, …)
W(‘‘mat”)=(0.0, 0.6, -0.1, …)
So lets take a neural net that wants to learn the task of whether a sequence of 5 words is ‘valid’ (a 5-gram). Our matrix W will be randomly initialized for the “cat” vector and the “mat” vector, and all the other word vectors. It therefor has to learn meaningful representations to fulfill the task of validating 5-grams. This meaningful representation is then tested with a ‘validator module’ called R, which classifies 5-gram validity:
R(W(‘‘cat”), W(‘‘sat”), W(‘‘on”), W(‘‘the”), W(‘‘mat”))=1
R(W(‘‘cat”), W(‘‘sat”), W(‘‘song”), W(‘‘the”), W(‘‘mat”))=0
However, building a 5-gram validator module is not that interesting. What is interesting is training W, because it ends up clustering similar words together, even though it was never told to. Its important to appreciate that this property is a side-effect of training W, even though it seems natural for similar meanings to have similar vector representations.
The effective data representation trained into W makes sense for the the task of R: if you swap similar words, the ‘validity score’ shouldn’t be affected much. Being able to swap similar words means that you can generalize the validity of one sentence (ex: “the house is red”) to an entire class of similar sentences (“the house is blue”, “the building is green”). This generalizable trait of W seems even necessary to complete the task, given that the number of possible 5-grams is massive compared to any reasonable training set. This generalization is possible because a new word exponentially increases possible sentence combinations, because it can swap for other words in its ‘class’.
Word embeddings have an even more remarkable property of being able to represent analogies by taking the difference between vectors:
W(‘‘woman”)−W(‘‘man”) ≃ W(‘‘aunt”)−W(‘‘uncle”)
W(‘‘woman”)−W(‘‘man”) ≃ W(‘‘queen”)−W(‘‘king”)
Which means, for the above example, that there is probably a dimension for gender, or at least a way of encoding gender in a consistent way. This makes sense for training 5-gram validity because “she” is not your “uncle”, and “he” cannot be your “queen”.
Again, its important to notice that all these properties of W, such as encoding analogies with difference vectors, merely popped out of the optimization process of classifying valid 5-grams. But choosing this task of validating 5-grams is almost irrelevant. We could have tried to create a predictor for the next word in a sentence. And it would have trained a similar representation, W, as what we saw above. Which means that our trained matrix W can now be applied to a bunch of tasks: word similarity, word analogy, 5-gram validation, and next word prediction. Training a neural net for one trick, and then being able to use it for a bunch of other tricks is sometimes called multi-task training, or transfer learning.
For this to work, we need to optimize for an additional property: translated words that we know are similar should be close together. This co-trains a joint word embedding that can represent both languages.
But this joint embedding makes intuitive sense because both languages share a similar ‘shape’, and all we had to do was line them up at different points. Deep learning has recently taken this a step further by finding a joint embedding between words and images. For this, we create a vector representation of an image through a ConvNet:
Image(‘‘cat”)=(-0.6, 0.05, 0.1, …)
We then place these image representations near words that describe the image.
Multi-task training (one data representation to solve many tasks) and shared embeddings (joining multiple data types into a single representation) are very exciting areas of research, and shows why representation learning is so compelling.
Actionable Next Steps: Build a word embedding in TensorFlow!