I have already posted the Glossika statistics wibr compiled and posted on Chinese-Forums.com (thank you, wibr!) on my statistics page. Out of curiosity — and because I know one of my three readers wants more Glossika statistics — I decided to do a brief analysis of Glossika’s Fluency 1 – 3 program using imron’s Chinese Text Analyzer. Let’s try answer the question: “How many words are there in Glossika?”
Here is how I did it over the course of 15 tedious minutes. I copied each pdf, pasted each into notepad++, and then copied the word-list index of each pdf into a new file (so I had Fluency 1 Traditional, Fluency 1 Simplified, Fluency 2 Traditional, etc. wordlists). Then, I copied each of those files into the Chinese Text Analyzer for the individual results. Finally, I combined them all together and pasted into CTA to get the cumulative totals.
Here are the results:
|Unique Simplified Words||Unique Simplified Characters||Unique Traditional Words||Unique Traditional Characters|
Since wibr’s analysis only compared Glossika’s text to a wordlist from TOP/TOFL, I guessed that it might have underestimated the actual number of words in the Glossika series. Indeed, it estimated around 2000 words total for Fluency 1 – 3 and 3000 for the series as a whole. If my results, are accurate, Fluency 1 – 3 has nearly 1000 more words than my estimate using wibr’s data.
There are few pitfalls, of course. The Chinese Text Analyzer appears to count characters on their own as words. I also did not closely look at each wordlist index to see if names were included (there are many western transliterations that appear in Glossika, one of the series’ faults), which could inflate the total number of words. Although I haven’t checked to see if the latter is indeed an issue, I did a crude test of the former.
Fluency 3’s Mandarin word index spans about 73 pages. Each page has around 40 words on it. That works out to about 3000 words in Fluency 3’s index. Its index does not appear to be comprehensive. That is, it does not include page numbers for Fluency 1 and 2 words. I assume they do not appear in Fluency 3’s index unless they appear in a Fluency 3 lesson. Therefore, if anything I have undercounted the number of words in Glossika.
Update: I took another look at the indicies. Many words are repeated several times on a page (for instance, dao4 appears in a bunch of lessons, so it occupies a couple positions). Mystery solved: there are 2200 unique words in Fluency 3, not 3000.
Update 2: Transliterated names appear in the index as well. This also explains why each index has so many words in it. If Glossika counts these names as Chinese words that would explain the claim that the series has 4000+ words.
To conclude, I’m not sure I’ve learned, or contributed, anything useful. wibr’s stats showed 30% of TOP’s 7339 words appear in Glossika 1 – 3, so about 2200 words. My method showed there were around 3000 words. A more crude look at the indicies revealed there are 3000 alone in Fluency 3. The Glossika website advertises 4000+ words, although it’s not clear if that 4000+ comes from combining the Simplified and Traditional word counts. Most Glossika levels advertise having 1000+ words, so it doesn’t seem unreasonable for the Simplified and Traditional versions of Fluency 1 – 3 to have between 3000 – 4000 words in total.
I JUST stopped myself from posted the text files. Thank goodness my law talking guy reminded me…
I would love to share them openly so others could benefit/someone smarter than me could do an analysis. Oh well!