Materials for course on ""Other" Computational Social Science Skills" at the fourth MPIDR Summer Incubator Program, June 10, 2025, Rostock, Germany (and online)
Here is the folder structure and what to expect:
- Root folder
- This ReadMe.md text file and other files for license etc.
- Slides
- Open the PDF of slides and follow the steps described to know how to use the files in this repository for Hands-on part, and introduction to concepts presented in practical examples.
- Slides also include instructions on what software to install for Hands-on part
- Hands_on
- Includes scripts in R, Python, and SQL using DuckDB with toy data and larger dataset to download from internet
Note and disclaimer: I will only share leads, links, and show you "example" scripts in R, Python, and SQL in this introductory short session; you would need to extend them on your own.
- Parallelization of extract, transform, load (ETL) tasks in Python and R;
- Faster Input/output (I/O) w/ file formats such as Parquet, feather, versus row-based CSV; All columns needed? Strings?
- Functional programming versus Object-Oriented Programming;
- Tabular versus relational databases versus graph databases;
- In-memory databases for faster and parallelized ETL tasks such as DuckDB;
- DuckDB's interface to R, Python, etc, to manage I/O and ETL tasks
- Using Dask to parallelize familiar data constructs, Pandas DF, Numpy array
- Use workflow managers, e.g., SnakeMake, for reproducibility