Un plan para leer noticas en español

Aliens created Spanish conjugation…es verdardero

Motivation

An example Anki deck

Overview

  1. Aggregating news —a crawler visits news websites and collects news articles automatically.
  2. Processing the texts— the aggregated articles are parsed and tokenized.
  3. Finding interesting words — each article typically contains 400~600 words. The script finds interesting words from each article by filtering out uninteresting words.
  4. Exporting to Anki —the results are packaged to an Anki format.

Aggregating news

  • requests — A Python package that handles HTTP request
  • Beautiful Soup — A Python package for parsing HTML and XML data

Processing the texts

Finding interesting words

  1. Cognates filter
  2. Stop words filter
  3. Term Frequency–Inverse Document Frequency (TF-IDF) filter
  1. Use Google translate API to get English translations of Spanish tokens.
  2. Compute normalized Jaro-Winkler similarities between Spanish and English words using Python package ‘textdistance.’
    - I settled on normalized Jaro-Winkler similarity after reading few technical blogs. If I had more time, I would have built a custom cognate classifier.
    - Accents characters (á) have different ASCII representations than non-accented counterparts (a). Thus, accented characters need to be de-accented first.
  3. Spanish tokens that are sufficiently similar (95% percentile in normalized Jaro-Winkler similarity or higher) to English words are filtered out.

Exporting to Anki

Lesson Learned

  1. Completeness > Perfection: I found myself going back and forth with my architecture because I wanted to write a ‘perfect’ script. I quickly realize I wasn’t progressing as fast as I had planned. I had to embrace imperfections and shortcomings to complete the project in time. A working script with many flaws is infinitely better than an incomplete script with a nice plan.
  2. It is cheaper to buy (sometimes): Me being super cheap, I spent a lot of time looking for a free alternative to Google Translate API. But I was not able to find anything satisfactory. It is literally much cheaper to pay for a nice service than wasting multiple hours looking for a non-existing alternative.
  3. Engineering matters. I built a prototype script on Jupyter Notebook in few hours, but there was a lot of more engineering work than I had anticipated: Google API integration, Google service account and authentication, Google credentials management, saving/loading intermediate results, DB choice and schema, and many more. As a result, I am buying $GOOG stocks.
  4. Side projects are essential. I was joyful to learn and to create something new.

--

--

--

A curious person who would like to observe the world

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Scramble Pt. 2: Levels & Spawn Points

How to Twitter Cards Not Showing Images on Wordpress

A Quick Guide to Using ffuf with Burp Suite

Reduce Cost and Increase Productivity with Value Added IT Services from buzinessware — {link} -

It’s time for QA’s to evolve

What is the Cost Factor? Why does it matter if a company develops mobile apps?

Exploiting Apache Tomcat manager-script role

Introducing Domain-Driven Hypermedia-Oriented Design (DHD)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
A Curious Can of Warmth

A Curious Can of Warmth

A curious person who would like to observe the world

More from Medium

How to install Linux OS on window OS with the help of Oracle VM VirtualBox

Find anything online with Google dorks — part 1

Preparing for CompTIA PenTest+

Gunbot guide to emotionless trading