Skip to content

KorAP/defako

Repository files navigation

Conversion tools for Deutsches Fachliteraturkorpus (DeFaKo@DNB)

Prerequisites

Saxon EE License

This project requires a Saxon EE license for XML processing. The license file (saxon-license.lic) is not included in this repository for security reasons.

For CI/CD environments, set the SAXON_LICENSE environment variable with the license content. For local development, place your saxon-license.lic file in the lib/ directory.

Testing

Run TEI I5 conversion tests on local test data

make -j $(nproc) test

Build test index

make -j $(nproc) test index

Run local KorAP with test index

INDEX=./target/dnf.index docker compose -p defako --profile=lite -f korap4dnb-compose.yml up -d

xdg-open http://localhost:4001/?q=Test

With ssh tunnel from localhost to the DeFaKo@DNB server

ssh -L 4001:localhost:4001 korap.dnb.de
xdg-open http://localhost:4001/?q=Test

Stop local KorAP

docker compose -p defako down

Convert PDFs to TEI P5

This is actually the first step, but usually not necessary, as the comparatively expensive TEI P5 files in p5 folder are not deleted by make clean.

Start GOBID server

docker run --rm --init -v ./grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml --ulimit core=0 -e JAVA_OPTS=-Xmx400g -p 8070:8070 grobid/grobid:0.8.1

Run client to convert PDFs to TEI P5

java -jar lib/org.grobid.client-0.5.4-SNAPSHOT.one-jar.jar -n 100 -in /mnt/data/Diss-Sample/PDF -out p5

HTTPD configuration

Configure Apache2 to proxy requests to the local KorAP server:

ProxyPass /defako http://localhost:4001
ProxyPassReverse /defako http://localhost:4001

References

Kupietz, Marc/Leinen, Peter/Diewald, Nils (2024): Towards a Very Large German Academic Corpus: Step 1: Building and Making Available a Corpus of 10,000 Doctoral Dissertations. Talk given at the Workshop on Comparable and Interoperable Corpora of Academic Texts @CLARIN2024 on 2024-10-18, Barcelona. https://corpora.ids-mannheim.de/slides/2024-10-17-Towards-a-German-Academic-Corpus/#/.

About

Conversion tools for the German Akademic Corpus (DeFaKo@DNB)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •