Generates skipgram word pairs.
skipgrams( sequence, vocabulary_size, window_size = 4, negative_samples = 1, shuffle = TRUE, categorical = FALSE, sampling_table = NULL, seed = NULL )
A word sequence (sentence), encoded as a list of word indices
(integers). If using a
sampling_table, word indices are expected to match
the rank of the words in a reference dataset (e.g. 10 would encode the
10-th most frequently occuring token). Note that index 0 is expected to be
a non-word and will be skipped.
Int, maximum possible word index + 1
Int, size of sampling windows (technically half-window).
The window of a word
w_i will be
float >= 0. 0 for no negative (i.e. random) samples. 1 for same number as positive samples.
whether to shuffle the word couples before returning them.
FALSE, labels will be integers (eg.
[0, 1, 1 .. ]),
TRUE labels will be categorical eg.
[[1,0],[0,1],[0,1] .. ]
1D array of size
vocabulary_size where the entry i
encodes the probabibily to sample a word of rank i.
couples is a list of 2-element integer vectors:
labels is an integer vector of 0 and 1, where 1 indicates that
was found in the same window as
word_index, and 0 indicates that
categorical is set to
TRUE, the labels are categorical, ie. 1 becomes
and 0 becomes
This function transforms a list of word indexes (lists of integers) into lists of words of the form:
(word, word in the same window), with label 1 (positive samples).
(word, random word from the vocabulary), with label 0 (negative samples).
Read more about Skipgram in this gnomic paper by Mikolov et al.: Efficient Estimation of Word Representations in Vector Space