updated 2019 embeddings handling apostrophe tokenization better
In the 2019 version of the embeddings I have made an effort to preserve text mode punctuation into the plain text model, so that we at least have the option of using the full context, as is now customary in very large corpora pre-training of language models.
However, some of that is more subtle than seen at first glance. In particular the best practice for apostrophes is to conserve a fixed set of English abbreviated uses as individual word entries. So I updated the llamapun generation to do so, keeping ten hand-selected words and using a standalone "apostrophe token" in all other cases - which are mostly single 'quote' quotations.
The resulting entries in the GloVe vocabulary are:
vocabulary rank | word | corpus frequency |
---|---|---|
240 | 's | 6623738 |
267 | ' | 5697131 |
2545 | 't | 327186 |
8069 | 'un | 40822 |
9238 | 'th | 30858 |
9768 | 'll | 27310 |
11821 | 'il | 18036 |
11946 | 'd | 17683 |
12283 | 've | 16714 |
15093 | 're | 10872 |
20422 | 'm | 5803 |
This leads to a further denoised vocabulary, now counting 989,136 distinct words (freq 5+). I'll upload to gitlab, and merge the pull request when the data looks good.
Merge request reports
Activity
Just to record awareness -
'un
and'il
in particular come from French paragraphs, as they manage to sneak through the language check test at times. It may need to be hardened to drop even more documents if the noise becomes unbearable. For the statement task this isn't an immediate issue, as these word entries tend to be mostly parallel to the English ones in the vocabulary.Edited by Deyan Ginevadded 4 commits
-
aaeb3476...5fad79dc - 3 commits from branch
master
- 175ee721 - updated 2019 embeddings handling apostrophe tokenization better
-
aaeb3476...5fad79dc - 3 commits from branch
mentioned in commit 52191699