updated 2019 embeddings handling apostrophe tokenization better
In the 2019 version of the embeddings I have made an effort to preserve text mode punctuation into the plain text model, so that we at least have the option of using the full context, as is now customary in very large corpora pre-training of language models.
However, some of that is more subtle than seen at first glance. In particular the best practice for apostrophes is to conserve a fixed set of English abbreviated uses as individual word entries. So I updated the llamapun generation to do so, keeping ten hand-selected words and using a standalone "apostrophe token" in all other cases - which are mostly single 'quote' quotations.
The resulting entries in the GloVe vocabulary are:
vocabulary rank | word | corpus frequency |
---|---|---|
240 | 's | 6623738 |
267 | ' | 5697131 |
2545 | 't | 327186 |
8069 | 'un | 40822 |
9238 | 'th | 30858 |
9768 | 'll | 27310 |
11821 | 'il | 18036 |
11946 | 'd | 17683 |
12283 | 've | 16714 |
15093 | 're | 10872 |
20422 | 'm | 5803 |
This leads to a further denoised vocabulary, now counting 989,136 distinct words (freq 5+). I'll upload to gitlab, and merge the pull request when the data looks good.