Skip to content

Conversation

@lfquintaz
Copy link
Collaborator

Summary

If merged this pull request will add the Spack build cache data scraper script, the script that creates the Spack SQLite database, test files for testing the data scraper script, and a README.md for using both spack_db.py and Create_spack_DB.py.

Proposed changes

scripts for scraping Spack build cache and generation of Spack SQLite database.

@@ -0,0 +1,61 @@
# Spack Build Cache Data Scraper & SQLite Database

This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database.
This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a SQLite database that maps file names back to the Spack package that contains that file.

Comment on lines +30 to +31
* Retrieves binary tarballs and extracts file lists
* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Retrieves binary tarballs and extracts file lists
* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information
* Retrieves package binary tarballs and extracts file lists
* Creates and maintains a canonical JSON index that maps package to its manifest and tarball information

* Retrieves binary tarballs and extracts file lists
* Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information
* Contains multiple checkpoints for safe restart/resume of the program
* Records skipped/malformed manifests, missing hashes, failed tarbll downloads
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Records skipped/malformed manifests, missing hashes, failed tarbll downloads
* Records skipped/malformed manifests, missing hashes, failed tarball downloads

The rest of the necessary modules are part of Python's standard library.

2. Provide a database file
Update the file_name in `main()` if needed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a UX perspective, adding support for a command line argument that lets a user provide the name of the database file would be nice.

@@ -0,0 +1,143 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
import os
# /// script
# dependencies = [
# "dapper-python",
# ]
# ///
import os

Adding inline script metadata for capturing a list of the dependencies needed is useful as both documentation, as well as making it possible to use uv or pipx to run the script without having to worry about manually installing dependencies.

(https://peps.python.org/pep-0723 has more info on this inline script metadata and its format)

Comment on lines +61 to +63
def _to_posix(p:str) -> str:
return Path(p).as_posix()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _to_posix(p:str) -> str:
return Path(p).as_posix()

This function is a duplicate of the one below it.

Comment on lines +182 to +194
# This -> "compiler-wrapper/compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.spec.manifest.json"
# is turned into this -> "compiler-wrapper__compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.spec.manifest.json"
package_name = package.replace('/','__')

# if is_spec is true, meaning the file ends with ".spec.manifest.json",
# then the file is not saved, but the reponse is returned to remove_lines_spec_manifest() for further manipulation
# if the file ends with .tar.gz
# then the file is saved in BINARY_CACHE_DIR
cache_dir = BINARY_CACHE_DIR if not is_spec else None

# full file path then is:
# "cache/spec_manifests/compiler-wrapper/compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3"
#cache/manifest\\compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of indentation, I'd place the # for all of these comments at the same level of indentation as the lines of code (then just add extra spaces to the right of the #). If you ran a tool like ruff on your code to follow consistent formatting, it would do the correct indentation for you throughout this entire file. Check https://docs.astral.sh/ruff/formatter/ if you want to try it out.


# removing the placeholder directories in the file path
def remove_placeholder_directories(i, name, package):
# i is the counter for file enumeration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I came across another placeholder directory name used by some Windows packages, which is not being removed by this function. (spack install/morepadding/uztmt5lglkxj3h42tuutgfcd7ypdsmgv/bin/lz4.dll is an example -- not removing the spack install/morepadding/<hash> stuff for a binary package compiled for Windows)

Comment on lines +516 to +520
def main():
#file_name = "myMedjson.json"
# file_name = "myjson.json"
# file_name = 'Med_w_compilerwrapper_packages_at_end.json'
file_name = "e2a6969c742c8ee33deba2d210ce2243cd3941c6553a3ffc53780ac6463537a9"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making your script take the file name as an argument would make it more flexible. Or making it also fetch the latest version of the spack index db file (e.g. that happened to have the name e2a6969c7 when we originally looked at how the spack package repository stores stuff).

Comment on lines +136 to +143
def readmyfile(myfile):
try:
with open(myfile, 'r') as file:
# database is the spack database json, within the spack build cache
db = json.load(file) # 8.6 seconds to read in large json file

# returns database
return db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest renaming this function and the myfile argument to be more descriptive, something like read spack index db and db file.

@nightlark nightlark added the enhancement New feature or request label Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants