Un plan para leer noticas en español
TL;DR — I made a Python script that crawls Spanish news articles and creates vocabulary decks for Anki, a spaced repetition schedule app.
My goal is to memorize 5000 most common words in Spanish within a year. In English, 1000 most common words can cover 75% of daily conversations [source]. 5000 sounds like a reasonable amount to be somewhat ‘fluent’ in Spanish.
I also tried to read the news in Spanish. It is a fun way to increase my exposure to Spanish while indulging in my habit of reading news. My vocabulary is not strong enough to read every day news articles in Spanish just yet, so I frequently stop and search for meanings. I realize need a supplementary vocabulary list on the top of 5000 words that I have been memorizing.
Instead of manually compiling ‘interesting words’ from news articles, I decided to write a Python script. Here is how I did it.
Here is an overview of what the script does.
- Aggregating news —a crawler visits news websites and collects news articles automatically.
- Processing the texts— the aggregated articles are parsed and tokenized.
- Finding interesting words — each article typically contains 400~600 words. The script finds interesting words from each article by filtering out uninteresting words.
- Exporting to Anki —the results are packaged to an Anki format.
The actual implementation of the codes can be seen at this Github repository. In this post, I left out minute details to increase legibility.
I used the following modules to aggregate news.
- requests — A Python package that handles HTTP request
- Beautiful Soup — A Python package for parsing HTML and XML data
The script visits multiple websites and collects news data by parsing ‘html.’
Processing the texts
A computer cannot ‘understand’ text the way we, humans, do. It does not know what sentence (or word) is. Tokenization is a process of splitting texts to a list of tokens (sentences or words) while assigning metadata, such as Part-of-Speech information.
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. [wiki]
Part-of-Speech : a category to which a word is assigned in accordance with its syntactic functions. In English, the main parts of speech are noun, pronoun, adjective, determiner, verb, adverb, preposition, conjunction, and interjection. [Oxford Languages]
I chose spaCy as a Spanish tokenizer. Symbols, numbers, and white spaces are removed during tokenization.
Note: lemmatization is another common processing technique that finds a base form of a token (went -> go; hablo -> hablar). I intentionally left out lemmatization because I want to memorize Spanish words in their conjugated forms. Spanish conjugation was actually created by aliens.
Finding interesting words
To find interesting words, the tokens go through these filters.
- Cognates filter
- Stop words filter
- Term Frequency–Inverse Document Frequency (TF-IDF) filter
Cognates are words in Spanish and English that share the same Latin and/or Greek root, are very similar in spelling, and have the same or similar meaning. About 90% of Spanish cognates have the same meaning in English. [NYU]
English and Spanish have many cognates words (necessary, necesario), and the script filters out these cognates.
- Use Google translate API to get English translations of Spanish tokens.
- Compute normalized Jaro-Winkler similarities between Spanish and English words using Python package ‘textdistance.’
- I settled on normalized Jaro-Winkler similarity after reading few technical blogs. If I had more time, I would have built a custom cognate classifier.
- Accents characters (á) have different ASCII representations than non-accented counterparts (a). Thus, accented characters need to be de-accented first.
- Spanish tokens that are sufficiently similar (95% percentile in normalized Jaro-Winkler similarity or higher) to English words are filtered out.
Stop words filter
Stop words are excessively common words (such as “the”, “a”, “an”, “in” in English). I use Python package ‘stop-words’ to filter out Spanish stop words.
Term Frequency–Inverse Document Frequency (TF-IDF) filter
At first glance, finding words that appear most frequently (rank ordering by term frequency) seem like the most straight forward way to find ‘interesting words.’ However, term frequency alone cannot effectively find ‘interesting words’ because some words are too common. For instance, words like “corona”, “virus”, “COVID-19” appeared on almost every article recently (as of July 2020.) Without properly normalizing TF, the algorithm will incorrectly assign high relative importance to these words. Inverse Document Frequency is inversely proportionally to the number of articles that each term appears. By multiplying two terms, thus TF-IDF, the algorithm can assign more accurate relative importance to each word. Words with low importance get filtered out. I use ‘scikit-learn’ to implement TF-IDF pipeline.
Exporting to Anki
‘Interesting words’ are packaged to an Anki format. For this, I use ‘genanki.’ I also wrote a helper function to automatically update the generated file to Google Drive.
- Completeness > Perfection: I found myself going back and forth with my architecture because I wanted to write a ‘perfect’ script. I quickly realize I wasn’t progressing as fast as I had planned. I had to embrace imperfections and shortcomings to complete the project in time. A working script with many flaws is infinitely better than an incomplete script with a nice plan.
- It is cheaper to buy (sometimes): Me being super cheap, I spent a lot of time looking for a free alternative to Google Translate API. But I was not able to find anything satisfactory. It is literally much cheaper to pay for a nice service than wasting multiple hours looking for a non-existing alternative.
- Engineering matters. I built a prototype script on Jupyter Notebook in few hours, but there was a lot of more engineering work than I had anticipated: Google API integration, Google service account and authentication, Google credentials management, saving/loading intermediate results, DB choice and schema, and many more. As a result, I am buying $GOOG stocks.
- Side projects are essential. I was joyful to learn and to create something new.
P.S. — If you are interested in getting decks, please fill out a form here. I will send you a weekly deck. Feel free to share the link with anyone interested in learning Spanish.
P.P.S. — If you are a native Spanish speaker, I can use your help. Please let me know if you can.