Eventually, the purpose of choosing different loss functions is to get
1. A smoothed partial derivative with respect to weight
2. A good convex curve, to achieve global minimum. However, a lot of other factors come into play while finding a global minimum (learning rate, shape of function, etc.).
For backpropagation to work, two basic assumptions are taken regarding the Error function.
1. Total error can be written as a summation of individual error of training samples/minibatch,
E = sum(Ex)
2. Error can be written as a function of outputs of the network
Backpropagation consists of two parts:
1. Forward pass, wherein we initialize the weights and make a feedforward network to store all the values
2. Backward pass, which is performed to have the stored values update the weights
Partial derivatives, chain rules, and linear algebra are the main tools required to deal with backpropagation.
Note: Using CBOW over smaller datasets results in smoothening of the distributional information, as the model treats the entire context as a single observation.
Subsampling Frequent Words
The subsampling rate makes the key decision on whether to keep the frequent words. (Distribution of the Survival function, P(x) = {(sqrt(x/0.001) + 1) * (0.001/x)} for a constant value of 0.001 for sampling rate (Credits : http://www.mccormickml.com))
Negative Sampling
Negative sampling is a simplified form of the noise contrastive estimation (NCE) approach, as it makes certain assumptions while selecting the count of the noise, or negative, samples, and their distribution.



雷达卡







京公网安备 11010802022788号







