ODI Crossref Deposit Pipeline

This tool automates the generation of Crossref-compliant XML metadata for Open Data Institute (ODI) publications. It scrapes publication data directly from URLs provided, resolves author ORCIDs by visiting individual profile pages linked to those URLs, and compiles the data into a schema-validated XML file ready for batch upload.

Features

Automated Scraper: Extracts title, date, and author lists from ODI "Report" and "Insight" pages.
ORCID Resolution: Follows author profile links to extract and validate ORCID iDs, correcting common formatting errors (e.g., trailing characters).
Schema Compliance: Generates XML strictly adhering to Crossref Schema 5.3.1, ensuring correct element ordering for contributors, titles, and affiliations.
Audit Trail: Produces a companion CSV file detailing exactly what data was scraped for internal records.

Requirements

Python 3.8+
The following Python packages:

pip install requests beautifulsoup4 pandas

Configuration

Open main.py in a text editor.
Update Credentials: Locate the configuration block at the top and ensure the DEPOSITOR_EMAIL matches your Crossref account email.
Input Data: Populate the records_to_process list with the URLs and DOIs you wish to process:

records_to_process = [
    {
        "url": "https://theodi.org/insights/reports/example-report",
        "doi": "10.61557/EXAMPLESUFFIX"
    },
    # Add additional records here...
]

User involvement

This requires a user to collect urls for reports they want to upload. However, it also requires them to provide a unique doi for each URL. They can find dois to use in unused_dois.txt and copy one doi from the file per url they have in records_to_process. However, to ensure sustainability and ease of use, the user is then required to delete from unused_dois.txt any dois they use.

Usage

Run the script from your terminal:

python main.py

The script will process records sequentially. Note that there is a configured delay (0.5s) between requests to avoid overloading the web server.

Outputs

The script generates two files in the working directory, stamped with the current batch ID (e.g., ODI_Deposit_20251217):

.xml file: The primary output. Upload this file directly to the Crossref Admin Tool under the "Metadata Admin" tab.
_audit.csv file: A flat-file record of titles, DOIs, and resolved authors. Use this to spot-check metadata before uploading the XML.

User involvement

The .xml file requires manual upload to crossref. Neil Majithia (neil.majithia@theodi.org) is currently responsible for this.

The ODI keeps a spreadsheet to track reports and their DOIs, accessible here: https://docs.google.com/spreadsheets/d/1yHVptdEF8--hTXjisNnzGQCMnLYuJjIUm1ochzZKeHU/edit?usp=sharing. Users should use _audit.csv to update this spreadsheet after an upload.

Troubleshooting

XML Validation Errors: If the upload fails, check the Crossref error log emailed afterward. The script is tuned for the 5.3.1 schema; ensure you are not uploading to an endpoint expecting an older version.
Missing Authors: If an author appears in the CSV but has no ORCID, verify that their profile link on the ODI website works and explicitly lists their ORCID.
Connection Errors: If the script fails to load pages, ensure you are not being rate-limited by the server. Increase time.sleep() in the script if necessary.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
unused_dois.txt		unused_dois.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ODI Crossref Deposit Pipeline

Features

Requirements

Configuration

User involvement

Usage

Outputs

User involvement

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

License

theodi/crossref-uploads

Folders and files

Latest commit

History

Repository files navigation

ODI Crossref Deposit Pipeline

Features

Requirements

Configuration

User involvement

Usage

Outputs

User involvement

Troubleshooting

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages