Spack scraping lq #154

lfquintaz · 2025-08-21T22:10:08Z

Summary

If merged this pull request will add the Spack build cache data scraper script, the script that creates the Spack SQLite database, test files for testing the data scraper script, and a README.md for using both spack_db.py and Create_spack_DB.py.

Proposed changes

scripts for scraping Spack build cache and generation of Spack SQLite database.

…ast processed package

… tarball hashes

nightlark · 2025-08-28T20:50:35Z

dataset-generation/spack_db/README.md

@@ -0,0 +1,61 @@
+# Spack Build Cache Data Scraper & SQLite Database
+
+This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database. 


Suggested change

This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database.

This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a SQLite database that maps file names back to the Spack package that contains that file.

nightlark · 2025-08-28T20:51:50Z

dataset-generation/spack_db/README.md

+* Retrieves binary tarballs and extracts file lists
+* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information


Suggested change

* Retrieves binary tarballs and extracts file lists

* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information

* Retrieves package binary tarballs and extracts file lists

* Creates and maintains a canonical JSON index that maps package to its manifest and tarball information

nightlark · 2025-08-28T20:52:04Z

dataset-generation/spack_db/README.md

+* Retrieves binary tarballs and extracts file lists
+* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information
+* Contains multiple checkpoints for safe restart/resume of the program
+* Records skipped/malformed manifests, missing hashes, failed tarbll downloads


Suggested change

* Records skipped/malformed manifests, missing hashes, failed tarbll downloads

* Records skipped/malformed manifests, missing hashes, failed tarball downloads

nightlark · 2025-08-28T20:53:28Z

dataset-generation/spack_db/README.md

+    The rest of the necessary modules are part of Python's standard library.
+
+2. Provide a database file
+    Update the file_name in `main()` if needed


From a UX perspective, adding support for a command line argument that lets a user provide the name of the database file would be nice.

nightlark · 2025-08-28T20:57:12Z

dataset-generation/spack_db/Create_spack_DB.py

@@ -0,0 +1,143 @@
+import os


Suggested change

import os

# /// script

# dependencies = [

# "dapper-python",

# ]

# ///

import os

Adding inline script metadata for capturing a list of the dependencies needed is useful as both documentation, as well as making it possible to use uv or pipx to run the script without having to worry about manually installing dependencies.

(https://peps.python.org/pep-0723 has more info on this inline script metadata and its format)

nightlark · 2025-08-29T22:51:05Z