-
Notifications
You must be signed in to change notification settings - Fork 4
Spack scraping lq #154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spack scraping lq #154
Conversation
…ast processed package
| @@ -0,0 +1,61 @@ | |||
| # Spack Build Cache Data Scraper & SQLite Database | |||
|
|
|||
| This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a Spack SQLite database. | |
| This project aims to scrape the Spack build cache by downloading, cleaning, and indexing spec manifests and binary tarballs into a local cache, then convert the data into a SQLite database that maps file names back to the Spack package that contains that file. |
| * Retrieves binary tarballs and extracts file lists | ||
| * Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Retrieves binary tarballs and extracts file lists | |
| * Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information | |
| * Retrieves package binary tarballs and extracts file lists | |
| * Creates and maintains a canonical JSON index that maps package to its manifest and tarball information |
| * Retrieves binary tarballs and extracts file lists | ||
| * Creates and maintains a canonical JSON index that maps package to it's manifest and tarball information | ||
| * Contains multiple checkpoints for safe restart/resume of the program | ||
| * Records skipped/malformed manifests, missing hashes, failed tarbll downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Records skipped/malformed manifests, missing hashes, failed tarbll downloads | |
| * Records skipped/malformed manifests, missing hashes, failed tarball downloads |
| The rest of the necessary modules are part of Python's standard library. | ||
|
|
||
| 2. Provide a database file | ||
| Update the file_name in `main()` if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a UX perspective, adding support for a command line argument that lets a user provide the name of the database file would be nice.
| @@ -0,0 +1,143 @@ | |||
| import os | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import os | |
| # /// script | |
| # dependencies = [ | |
| # "dapper-python", | |
| # ] | |
| # /// | |
| import os |
Adding inline script metadata for capturing a list of the dependencies needed is useful as both documentation, as well as making it possible to use uv or pipx to run the script without having to worry about manually installing dependencies.
(https://peps.python.org/pep-0723 has more info on this inline script metadata and its format)
| def _to_posix(p:str) -> str: | ||
| return Path(p).as_posix() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _to_posix(p:str) -> str: | |
| return Path(p).as_posix() |
This function is a duplicate of the one below it.
| # This -> "compiler-wrapper/compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.spec.manifest.json" | ||
| # is turned into this -> "compiler-wrapper__compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.spec.manifest.json" | ||
| package_name = package.replace('/','__') | ||
|
|
||
| # if is_spec is true, meaning the file ends with ".spec.manifest.json", | ||
| # then the file is not saved, but the reponse is returned to remove_lines_spec_manifest() for further manipulation | ||
| # if the file ends with .tar.gz | ||
| # then the file is saved in BINARY_CACHE_DIR | ||
| cache_dir = BINARY_CACHE_DIR if not is_spec else None | ||
|
|
||
| # full file path then is: | ||
| # "cache/spec_manifests/compiler-wrapper/compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3" | ||
| #cache/manifest\\compiler-wrapper-1.0-bsavlbvtqsc7yjtvka3ko3aem4wye2u3.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of indentation, I'd place the # for all of these comments at the same level of indentation as the lines of code (then just add extra spaces to the right of the #). If you ran a tool like ruff on your code to follow consistent formatting, it would do the correct indentation for you throughout this entire file. Check https://docs.astral.sh/ruff/formatter/ if you want to try it out.
|
|
||
| # removing the placeholder directories in the file path | ||
| def remove_placeholder_directories(i, name, package): | ||
| # i is the counter for file enumeration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I came across another placeholder directory name used by some Windows packages, which is not being removed by this function. (spack install/morepadding/uztmt5lglkxj3h42tuutgfcd7ypdsmgv/bin/lz4.dll is an example -- not removing the spack install/morepadding/<hash> stuff for a binary package compiled for Windows)
| def main(): | ||
| #file_name = "myMedjson.json" | ||
| # file_name = "myjson.json" | ||
| # file_name = 'Med_w_compilerwrapper_packages_at_end.json' | ||
| file_name = "e2a6969c742c8ee33deba2d210ce2243cd3941c6553a3ffc53780ac6463537a9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making your script take the file name as an argument would make it more flexible. Or making it also fetch the latest version of the spack index db file (e.g. that happened to have the name e2a6969c7 when we originally looked at how the spack package repository stores stuff).
| def readmyfile(myfile): | ||
| try: | ||
| with open(myfile, 'r') as file: | ||
| # database is the spack database json, within the spack build cache | ||
| db = json.load(file) # 8.6 seconds to read in large json file | ||
|
|
||
| # returns database | ||
| return db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest renaming this function and the myfile argument to be more descriptive, something like read spack index db and db file.
Summary
If merged this pull request will add the Spack build cache data scraper script, the script that creates the Spack SQLite database, test files for testing the data scraper script, and a README.md for using both spack_db.py and Create_spack_DB.py.
Proposed changes
scripts for scraping Spack build cache and generation of Spack SQLite database.