How is WordPiece tokenization helpful to effectively deal with rare words problem in NLP?

WordPiece and BPE are two similar and commonly used techniques to segment words into subword-level in NLP tasks. In both cases, the vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the symbols in the vocabulary are iteratively added to the vocabulary. Consider the WordPiece algorithm …

Read more

Embedding in pytorch

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e. of the size of the vocabulary x the dimension of each vector embedding, and a method that does the lookup. When you create an embedding layer, the Tensor is initialised randomly. It is only when you train it when this similarity between similar words should appear. …

Read more

What does tf.nn.embedding_lookup function do?

Yes, this function is hard to understand, until you get the point. In its simplest form, it is similar to tf.gather. It returns the elements of params according to the indexes specified by ids. For example (assuming you are inside tf.InteractiveSession()) params = tf.constant([10,20,30,40]) ids = tf.constant([0,1,2,3]) print tf.nn.embedding_lookup(params,ids).eval() would return [10 20 30 40], …

Read more