Much of my day-to-day (academic) work is writing code in one way or another.
I do data cleaning almost exclusively in R and I am fervent supporter of tidy data and the tidyverse. I am aware that there are other R packages which can be considerably faster and more appropriate if you have a lot of data, e.g. data.table
. However, I like the balance that tidyverse packages strike between speed and usability as well as expressiveness of the written code.
Some potentially helpful hints from hundreds—or maybe at this point even thousands—of hours of cleaning and merging data in R with dplyr
and tidyr
:
left_join()
data frames, never ever cbind()
!Build in frequent checks to ensure that the code does what you presume it does with stopifnot()
. E.g. whenever you do a left join (aka merge) on two data frames on unique keys, check that no entries in the first data frame were duplicated in the join:
stopifnot(anyDuplicated(d_merged$unique_key) == 0)
Use the marvellous functions mutate_at()
and rename_at()
to batch-recode and batch-rename variables.
Many text analyses I run for work on different parts of my dissertation are investigating how the ubiquitous “bag-of-words” assumption in text models can be relaxed. Thus, I rely on tools which analyse text linguistically—what I call “natural language processing”. These tools are usually written by computational linguists who have a penchant for the Python programmaing language, mostly due to the nltk
. The introduction to Python for Computational Linguististics I attended as an undergraduate was hands-down the best applied programming course I ever attended—courtesy of Simon Clematide and other people at the University of Zurich’s Institute of Computational Linguistics!
I started using spacy
, a natural language processing framework written in Python, a couple of years ago. It integrates a state-of-the-art POS-tagger, dependency parser, and named entity recognizer in an easy to use framework. The language models are already available in a number of different (European) languages and the coverage is increasing.
If you do not need to relax the bag-of-words assumption—and often you don’t— I can recommend the excellent quanteda
package in R. I not (only) say this because my primary supervisor, Ken Benoit, is the driving force behind the package and I have a small involvement in it myself. At this point, the package is solid, stable, and still gets better with every release.
It is written in heavily optimised C++ code (mostly by Haiyan Wang and Kohei Watanabe) and blazingly fast as well as memory efficient. It also plays nicely with most other text analysis packages you might want to use in the analysis.
Some, but unfortunately not all of the code, is open source. Here is a list of some of the projects which are publicly available: