updated 2019 embeddings handling apostrophe tokenization better

Merged Deyan Ginev requested to merge embeddings-2019-updated into master

In the 2019 version of the embeddings I have made an effort to preserve text mode punctuation into the plain text model, so that we at least have the option of using the full context, as is now customary in very large corpora pre-training of language models.

However, some of that is more subtle than seen at first glance. In particular the best practice for apostrophes is to conserve a fixed set of English abbreviated uses as individual word entries. So I updated the llamapun generation to do so, keeping ten hand-selected words and using a standalone "apostrophe token" in all other cases - which are mostly single 'quote' quotations.

The resulting entries in the GloVe vocabulary are:

vocabulary rank word corpus frequency
240 's 6623738
267 ' 5697131
2545 't 327186
8069 'un 40822
9238 'th 30858
9768 'll 27310
11821 'il 18036
11946 'd 17683
12283 've 16714
15093 're 10872
20422 'm 5803

This leads to a further denoised vocabulary, now counting 989,136 distinct words (freq 5+). I'll upload to gitlab, and merge the pull request when the data looks good.

Merge request reports