Thursday, May 14, 2015

Deep Learning generally used the word vector is not mentioned school shooting by One-hot Representa


Natural language understanding issues to be converted to machine learning problem, the first step is certainly looking for a way to mathematical symbols. NLP is the most intuitive, school shooting is by far the most commonly used word representation is One-hot Representation, this way each word is represented as a long vector. This dimension school shooting is the word vector table size, most of them, is 0, there is only one dimension of the value of 1, this dimension represents the current word. For chestnut, "microphone" is expressed as [0,001,000,000,000,000 ...] "Mike" school shooting is expressed as [0,000,000,010,000,000 ...] Each word is a vast sea 0 1. This One-hot Representation sparse if stored, will be very simple: that is assigned to each word a numeric ID. For example, the earlier example, the microphone school shooting recorded as 3, Mike recorded as 8 (assuming starts counting from 0). If you are programming, then use the Hash Table is assigned school shooting to each word a number on it. So succinct representation with the maximum entropy, SVM, CRF, etc. algorithms have done a good mainstream task NLP field. Of course, this representation there is an important issue is "vocabulary gap" phenomenon: are isolated between any two words. Light from these two vectors was not clear whether the two words have a relationship, even a microphone and synonyms so Mike could not survive.
Deep Learning generally used the word vector is not mentioned school shooting by One-hot Representation indicated that very long term vectors, but with Distributed school shooting Representation (I do not know how this should translate, because there is a called school shooting "Distributional Representation" representation, but also a low-dimensional real vector a different concept) representation. This vector school shooting is generally grow into something like this: [0.792, -0.177, -0.107, 0.109, -0.542, ...]. Dimensions 50 100 peacekeeping dimension more common. This vector representation is not unique, it will be mentioned at present calculated this vector school shooting mainstream methods. (I think) the largest contribution to the Distributed representation school shooting is to make related or similar words, at a distance closer. Distance school shooting vector can be used most traditional Euclidean distance to measure can also be used to measure the angle cos. Vector expressed in this way, "Mike" and "Microphone" will be far less than the distance "Mike" and "weather." Ideally may "Mike" and "Microphone" indicates you should be exactly school shooting the same, but because some people will the English name "Mike" is written as "Mike", leading to "Mike" word to bring some of the names of semantics, it will not and the "Microphone" exactly the same. 1. The word vector of origin
Distributed representation school shooting Hinton was first in the 1986 essay "Learning school shooting distributed school shooting representations of concepts" proposed. Although the article did not say the words you want to do Distributed school shooting representation, (it does not make sense I guess even the article school shooting is to give BP network he had just proposed to advertise,) but at least this advanced thinking at that time in people heart planted the fire, after the year 2000 began to be valued. Distributed representation used to represent words, commonly school shooting referred to as "Word Representation" or "Word Embedding", Chinese commonly known as "the word vector." Really can only be called "known", not really translate. school shooting Six months ago I wanted to translate, but he just could not think of how Embedding should translate later on so called accustomed -_- ||| welcome if there is a good translation. (Update: @ South Zhihua given a proper translation in this micro-Bo: word embedded) Embedding meaning of the term may refer to the appropriate Wikipedia page (link). After all mentioned school shooting "word vector" mean the word vector with Distributed Representation representation.
If the traditional sparse notation word in solving certain tasks when (such as building language models) will cause the curse of dimensionality [Bengio school shooting 2003]. school shooting The use of low-dimensional word vector no such problems. From a practical point of view at the same time, high-dimensional feature if you want to apply Deep Learning, its complexity is almost unacceptable, and therefore low-dimensional vector word here also suffering sought. At the same word vector distance mentioned in the previous section, similar words were similar, which allows some of the models based on the word vector design comes with a smooth function, so that the model looks very beautiful. 2. The word vector of the training is how to introduce the word vector is trained, you have to mention the language model. So far all the training methods I learned in training language model is at the same time, the way to get the word vectors. It is relatively easy to understand, from the period of non-natural text labels school shooting to learn some things, nothing more than word frequency school shooting statistics, the co-occurrence of the word, the word with such information. The statistics from the natural text and create a language model, is undoubtedly a task requires the most accurate (not rule out later it was more useful to create a better way). Since the task of constructing a language model requirements so high, which must also need to be more precise language of statistics and analysis, but also a need for better models, more data to support. Currently the best word vector comes from this, it is not hard to understand. Work presented here are from unlabeled data in plain text unsupervised learning the word vector (language model is based on this idea originally came), you can guess, if there is marked school shooting corpus spend training vector word The method will certainly be more. However, depending on the size of the current corpus, or the use of unlabeled corpus of some tricky way. Words most classic training vectors have three work, C & W 2008, M & H 2008, Mikolov 2010. Of course, talking about the work before, I have to tell us about this series Bengio classic. 2.0 Language model profile advertising school shooting inserts, brief language model, know can ignore this section. In fact, the language model is to see a word is not normal to say it. This stuff is useful, such as machine translation, speech recognition obtained after a number of candidates, the model can be used to pick the language as much as possible the results of a fly. In the other tasks in the NLP can also be used. Formal description language model is given a string to see if it is the natural language of probabilities $ P (w_1, w_2, ..., w_t) $. $ W_1 $ to $ w_t $ respectively for each word in this sentence. There is a very simple inference is: $ P (w_1, w_2, ..., w_t) = P (w_1) \ times P (w_2 | w_1) \ times P (w_3 | w_1, w_2) \ times ... \ times P (w_t | w_1, w_2, ..., w_ {t-1}) $ commonly used language school shooting models are seeking approximately $ P (w_t | w_1, w_2, ..., w_ {t-1}) $. Such as n-gram model is to use the $ P (w_t | w_ {t-n + 1}, ..., w_ {t-1}) $ approximated the former. Incidentally, since the sign behind the differences in each paper to introduce the use of too much, this Bowen Bengio 2003 attempt to consistent use of the symbol system (be slightly simplified), in order to make comparisons and analysis between the methods. Classic drawing school shooting 2.1 Bengio the lowermost $ w_ {t-n + 1}, ..., w_ {t-2}, w_ {t-1} $ is the former $ n-1 $ words. Now we need to predict the next word $ w_t $ According to this known $ n-1 $ words. $ C (w) $ denote word $ w $ corresponding word vector, the entire model used is a unique school shooting word vector, there exists a matrix $ C $ (a $ | V | \ times m $ matrix) in. Where $ | V | $ represents the size of vocabulary school shooting (corpus total number of words), $ m $ represents a dimension word vectors. $ W $ to $ C (w) $ conversion line is removed from the matrix. The first layer (input layer) network is $ C (w_ {t-n + 1}), ..., C (w_ {t-2}), C (w_ {t-1}) $ which $ n- 1 $ vectors put together end to end to form a $ (n-1) m $ dimensional vector, hereinafter referred to as the $ x $. The second layer (hidden layer) like an ordinary neural network, the network directly using the $ d + Hx $ calculated. $ D $ is a biased school shooting term. After this, use $ \ tanh $ as the activation function. The third layer (output layer) networks, a total of $ | V | $ nodes, each node $ y_i $ indicates a word for $ i $ not normalized log probability. Finally use softmax activation school shooting function output value $ y $ normalized to probabilities. Eventually, the formula $ y $ are:
Formulas of $ U $ (a $ | V | \ times h $ matrix) is a hidden layer to the output layer parameters, most concentrated in the calculation school shooting of the whole model matrix multiplication school shooting $ U $ and the hidden layer. The three work later mentioned, there are aspects to this simplification, to enhance the speed calculation. There is also a formula matrix $ W $ ($ | V | \ times (n-1) m $), the matrix contains from the input layer to the output layer straight edge. Straight edge is from the input layer is directly converted into a linear output layer, it seems, a common neural network techniques (not carefully examine them). If you do not directly connected to the side, it will $ W $ is set to 0 on it. In the final experiment, Bengio find direct side effects, although not to upgrade the model, but can be less than half the number of iterations. He also suspect that if there is no direct connection side may be able to generate a better word vector.
Now everything is ready, drop by stochastic gradient method to optimize the model out of it. It should be noted that the input layer of the neural network is generally only one input value, but here, the input layer $ x $ is the parameter ($ C $ in existence), also needs to be optimized. After the optimization, the word vector with the language model also. Language model thus obtained comes with a smooth, without the traditional n-gram model the complex smoothing algorithm. Bengio on APNews datasets do comparative experiments show that his model is better than the ordinary smoothing school shooting algorithm designed n-gram algorithm is better than 10% to 20%.
Before the end of the presentation of the classic works of Daniel Bengio then plug some gossip. In the next period of its JMLR paper work, he put an energy function, the input vector and the output vector uniform consideration, and to minimize the energy function as the target for optimization. Later, M & H work is undertaken on this basis. He referred to the polysemy to be resolved, nine years after Huang proposed a solution. He also paper casually (not written in Future Work) is mentioned: You can use a number of methods to reduce the number of parameters, such as using recurrent neural network. Later on along this direction Mikolov published a lot of papers, until graduate school. Daniel is Daniel. 2.2 C & W of SENNA
Ronan Collobert and Jason Weston ICML in the 2008 publication of "A Unified Architecture school shooting for Natural Language Processing: Deep Neural Networks with Multitask Learning" which first introduced the calculation method of the word vector of their questions. Niu and previous school shooting similar, if we look at it, they should go to see put into the paper JMLR in 2011 "Natural Language Processing (Almost) from Scratch". This paper summarizes a number school shooting of their work, very systematic. This JMLR thesis topic is also very domineering ah: from the beginning to engage in NLP. They also wrote a paper system open source, called SENNA (Home Link), more than 3500 lines of pure C code that is written in a very clear. I was relying on this code is only slowly school shooting read this paper. Unfortunately, the code is only part of the test, there is no training component.
C & W is not really the main purpose of this paper is to generate a good word vector, and even do not want to train the language model, but to use this word vector to complete NLP inside the various tasks, such as speech tagging, named entity recognition, phrase recognition, semantic role labeling, and so on. Due to the different, C & W's word vector school shooting training methods aim in my opinion is the most special. They do not seek to be approximately $ P (w_t | w_1, w_2, ..., w_ {t-1}) $, but directly to try to approximate $ P (w_1, w_2, ..., w_t) $. In practice, they did not go to seek the probability of a string, but seek continuous $ n $ window word scoring $ f (w_ {t-n + 1}, ..., w_ {t-1}, w_t ) $. Rate $ f $ illustrate this statement more higher school shooting then normal; low scores illustrate this sentence is not too reasonable; school shooting if it is random to accumulate in a few words

No comments:

Post a Comment