Skip to content

lsg551/matricula-online-scraper

Repository files navigation

Matricula Online Scraper

PyPI - Python Version GitHub License PyPI - Version Publish to PyPi

Matricula Online is a non-profit initiative that digitizes parish records in Central Europe and hosts them for free on their website. matricula-online-scraper is a command-line interface (CLI) tool that enables you to download data directly from it.

This includes the scanned parish registers (scanned books about baptism, marriage, death records etc.) as well as metadata about the parishes and their registers. Data can be downloaded in CSV or JSON, images are generally provided as JPEG files.

Installation

Make sure to meet the minimum required version of Python. You can install this tool via pip:

$ pip install -u matricula-online-scraper
Or use a container ๐Ÿณ

For every version, an OCI container image is built and published to GHCR.io (GitHub's own container registry). This is especially useful if you do not want to deal with Python environments, multiple Python versions and package managers. Or you could use matricula-online-scraper in an automated environment this way. The image can also be used as a disposable container, leaving no dependencies or build artifacts on your system.

Simply copy and paste the following command into your terminal, it will automatically pull the latest image and run it:

$ docker run --rm -it ghcr.io/lsg551/matricula-online-scraper:latest

This will print the default help message and exit โ€“ but from the container and the output will be visible in your terminal.

You can append any command of matricula-online-scraper to the end to run it directly, e.g. to list all parishes:

# docker run --rm -it <IMAGE> <SUBCOMMAND>
$ docker run --rm -it ghcr.io/lsg551/matricula-online-scraper:latest parish list --place Paderborn -h

If you want to scrape data and save it to your local filesystem, you will have to create a bind mount via the -v flag though. Otherwise, the data would be saved inside the container, but not on your own machine.

Let's say you want to scrape this parish register and save it to your current working directory.

$ docker run -v "$(pwd):/data" --rm -it ghcr.io/lsg551/matricula-online-scraper:latest \
    parish fetch https://data.matricula-online.eu/de/deutschland/muenster/anholt-st-pankratius/KB001_1/?pg=1 \
    -o /data/matricula

It will write directly from the container to a subfolder in your current working directory (pwd) called matricula, which is mounted to /data/ inside the container.

Lastly, you can also get an interactive shell in the container

$ docker run --rm -it --entrypoint /bin/ash ghcr.io/lsg551/matricula-online-scraper:latest
root@abc123:/app# matricula-online-scraper --version
0.8.0

This will keep the container running until you exit it with exit, so you can run any command inside as you like.

NOTE: You could also use podman, a drop-in replacement for docker, if you like. The commands are the same.

Or build from source

If you want to get the latest version or just build from source, you can clone the repository and install it manually, favorably via uv:

$ git clone https://github.com/lsg551/matricula-online-scraper.git
$ cd matricula-online-scraper
$ uv venv && uv sync

If you do not have uv installed, you can install it via pip:

$ pip install -r requirements.txt

Usage

Once installed, you can can append the --help flag to any command to see its usage and options.

$ matricula-online-scraper --help

 Usage: matricula-online-scraper [OPTIONS] COMMAND [ARGS]...

 Command Line Interface (CLI) for scraping Matricula Online
 https://data.matricula-online.eu.

 You can use this tool to scrape the three primary entities from Matricula:
 1. Scanned parish registers (โ†’ images of baptism, marriage, and death records)
 2. A list of all available parishes (โ†’ location metadata)
 3. A list for each parish with metadata about its registers, including dates ranges,
 type etc.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --verbose,--debug     -v        Enable verbose logging (DEBUG).                        โ”‚
โ”‚ --quiet               -q        Suppress all output (CRITICAL).                        โ”‚
โ”‚ --version                       Show the CLI's version.                                โ”‚
โ”‚ --install-completion            Install completion for the current shell.              โ”‚
โ”‚ --show-completion               Show completion for the current shell, to copy it or   โ”‚
โ”‚                                 customize the installation.                            โ”‚
โ”‚ --help                          Show this message and exit.                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Commands โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ parish     Scrape parish registers (1), a list with all available parishes (2) or a    โ”‚
โ”‚            list of the available registers in a parish (3).                            โ”‚
โ”‚ newsfeed   Scrape Matricula Online's Newsfeed.                                         โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

 Attach the --help flag to any subcommand for further help and to see its options. Press
 CTRL+C to exit at any time.
 See https://github.com/lsg551/matricula-online-scraper for more information.

Examples

(1) Download a scanned parish register (i.e. images)

Imagine you opened a certain parish register on Matricula and want to download the entire book or a single page. Let's say you want to download the death register of Bautzen, Germany, starting from 1661. Copy the URL of the register when you are in the image viewer, this might look like https://data.matricula-online.eu/en/deutschland/dresden/bautzen/11/?pg=1.

Then run the following command and paste the URL into the prompt:

$ matricula-online-scraper parish fetch https://data.matricula-online.eu/en/deutschland/dresden/bautzen/11/?pg=1

Run matricula-online-scraper parish fetch --help to see all available options.

(2) List all available parishes on Matricula

$ matricula-online-scraper parish list

This command will fetch all parishes from Matricula Online, effectively scraping the entire "Fonds" page. The resulting data looks like:

country    , region                          , name                 , url                                                                          , longitude         , latitude
Deutschland, "Passau, rk. Bistum"            , Arbing-bei-Neuoetting, https://data.matricula-online.eu/en/deutschland/passau/arbing-bei-neuoetting/, 12.7081934381511  , 48.32953342002908
ร–sterreich , Oberรถsterreich: Rk. Diรถzese Linz, Eberschwang          , https://data.matricula-online.eu/en/oesterreich/oberoesterreich/eberschwang/ , 13.5620985        , 48.15550995
Polen      , "Breslau/Wroclaw, Staatsarchiv" , Hermsdorf            , https://data.matricula-online.eu/en/polen/breslau/hermsdorf/                 , 15.642741683666767, 50.84699257482722

It may take a few minutes to complete and will yield a few thousand rows. Each url value leads to the main page of the parish and can bepiped into the next command (3) to fetch metadata about the parish's registers.

Run matricula-online-scraper parish list --help to see all available options.


Cache Parishes GitHub last commit (branch)

NOTE: The data only changes rarely. A GitHub workflow automatically executes this command once a week and pushes to cache/parishes. This has the advantage that you can download the data without having to run and waiting for the command yourself as well as taking some load off the Matricula servers.

Click here to download the entire CSV: ๐Ÿ‘‰ parishes.csv ๐Ÿ‘ˆ

Or with cURL:

curl -L https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz | gunzip > parishes.csv

(3) List all registers available in a specific parish

This command will download a list of all available registers for a single parish, including certain metadata such as the type of register, the date range, and the URL to the register itself etc.

$ matricula-online-scraper parish show https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/

A sample from the output (here JSON Lines) looks like this:

{
    "name": "Taufen",
    "url": "https://data.matricula-online.eu/en/deutschland/muenster/muenster-st-martini/KB001/",
    "accession_number": "KB001",
    "date": "1715 - 1800",
    "register_type": "Taufen",
    "date_range_start": "Jan. 1, 1715",
    "date_range_end": "Dec. 31, 1800"
}

Run matricula-online-scraper parish show --help to see all available options.

(4) Combine and chain these commands to download all registers within a certain region.

The three examples above only highlight a single command for different data types each. However, this data is not unconnected and can be linked together. The CLI is designed with this in mind, so you can easily combine commands, pipe data around, and chain them together to achieve more complex tasks.

For example, after you have obtained a complete list of all parishes (2), you can filter that list to only include parishes within a certain region, such as "Paderborn" in Germany, and then pipe these parish URLs from that list into the next command to download a list for each parish with metadata about its registers (3). Finally, you can pipe the URLs of the registers into the next command to download the images of the registers (1).

The following command will download the cached list with all parishes (2) (faster than matricula-online-scraper parish list), filter all parishes within the region "Paderborn", and pipe the parish URLs to matricula-online-scraper parish show to get the metadata about the registers for each parish (3). Then, matricula-online-scraper parish fetch will be called for all registers of each parish and proceeds to download the images of the registers (1).

curl -sL https://github.com/lsg551/matricula-online-scraper/raw/cache/parishes/parishes.csv.gz \
    | gunzip \
    | csvgrep -c region -m "Paderborn" \
    | csvcut -c url \
    | csvformat --skip-header \
    | xargs -n 1 -P 4 matricula-online-scraper parish show -o - \
    | jq -r ".url // empty" \
    | matricula-online-scraper parish fetch

It uses csvkit for processing the CSV data. Make sure to install it via pip install csvkit or your package manager of choice if you want to replicate this example. Also make sure to have jq installed, as it is used to parse and manipulate the JSON output of some commands.

These examples are obviously not exhaustive, but they should give you an idea of how to use the CLI tool and how to combine commands to achieve more complex tasks. With the data from Matricula Online, matricula-online-scraper and 3rd party tools like csvkit and jq, you could build geolocation searches form the coordinates provided for each parish, filter the parishes within a certain data range and region, narrow down the registers to a specific type (e.g. only baptism records), regularly backup your most important parish registers, and so on.

License

This project is licensed under the MIT License - see the LICENSE file for details.

You can read more about Matricula Online's terms of use and data licenses on their page or check out their robots.txt file at data.matricula-online.eu/robots.txt regarding restrictions of the use of automated tools (as of March 2025, they have none).

Packages

 
 
 

Contributors 2

  •  
  •