Skip to content

Reference not extracted in TEI processing #8

@lfoppiano

Description

@lfoppiano

This is not extracted by the PDF, the TEI extracts this, but misses the reference:

{
    "rawForm" : "WHO COVID-19 dashboard",
    "type" : "dataset-name",
    "dataset-name" : {
      "rawForm" : "WHO COVID-19 dashboard",
      "normalizedForm" : "WHO COVID-19 dashboard",
      "offsetStart" : 138,
      "offsetEnd" : 160
    },
    "normalizedForm" : "WHO COVID-19 dashboard",
    "inDataAvailabilitySection" : true,
    "context" : "For the H1N1 outbreak in 2009 we used the case data provided by FluNet [62, 63] (the column AH1N12009), for the COVID-19 cases we use the WHO COVID-19 dashboard [64] accessed through ourworldindata.org, the number of sequenced samples was accessed through GISAID [65-67] using the file gisaid_variants_statistics.json.",
    "sequenceIds" : [ "_66tmVw6" ],
    "mentionContextAttributes" : {
      "used" : {
        "value" : true,
        "score" : 0.9999715089797974
      },
      "created" : {
        "value" : false,
        "score" : 1.1888024346262682E-5
      },
      "shared" : {
        "value" : false,
        "score" : 1.6568916180403903E-5
      }
    },
    "documentContextAttributes" : {
      "used" : {
        "value" : true,
        "score" : 0.9999715089797974
      },
      "created" : {
        "value" : false,
        "score" : 1.1888024346262682E-5
      },
      "shared" : {
        "value" : false,
        "score" : 1.6568916180403903E-5
      }
    }
  }

And flunet should also provide data, but we miss the reference:

{
    "rawForm" : "FluNet",
    "type" : "dataset-name",
    "dataset-name" : {
      "rawForm" : "FluNet",
      "normalizedForm" : "FluNet",
      "offsetStart" : 64,
      "offsetEnd" : 70
    },
    "normalizedForm" : "FluNet",
    "inDataAvailabilitySection" : true,
    "context" : "For the H1N1 outbreak in 2009 we used the case data provided by FluNet [62, 63] (the column AH1N12009), for the COVID-19 cases we use the WHO COVID-19 dashboard [64] accessed through ourworldindata.org, the number of sequenced samples was accessed through GISAID [65-67] using the file gisaid_variants_statistics.json.",
    "sequenceIds" : [ "_66tmVw6" ],
    "mentionContextAttributes" : {
      "used" : {
        "value" : true,
        "score" : 0.9999715089797974
      },
      "created" : {
        "value" : false,
        "score" : 1.1888024346262682E-5
      },
      "shared" : {
        "value" : false,
        "score" : 1.6568916180403903E-5
      }
    },
    "documentContextAttributes" : {
      "used" : {
        "value" : true,
        "score" : 0.9999715089797974
      },
      "created" : {
        "value" : false,
        "score" : 1.1888024346262682E-5
      },
      "shared" : {
        "value" : false,
        "score" : 1.6568916180403903E-5
      }
    }
  }

We also miss another reference, which has the following URL: https://www.oag.com/airline-schedules-data

Document: journal.pcbi.1011775.pdf
TEI: journal.pcbi.1011775.tei.xml.zip
JSON from the TEI: journal.pcbi.1011775_datastet_output.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions