-
Notifications
You must be signed in to change notification settings - Fork 108
feat(pm): preload tgz #2549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: next
Are you sure you want to change the base?
feat(pm): preload tgz #2549
Conversation
Summary of ChangesHello @elrrrrrrr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement to the package manager by implementing a pipeline-driven approach for package installation. It allows for the concurrent fetching of package manifests and the preloading of tarball archives, leveraging a new Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant performance improvement by implementing a preloading pipeline for package tarballs. The new architecture leverages concurrent manifest resolution and tarball downloading, using a OnceMap for deduplication and libdeflater for faster gzip decompression. The changes are well-structured, separating pipeline logic into a new module and integrating it effectively with existing install processes. The addition of a benchmarking script is a valuable step for validating these performance gains.
| /// Estimate uncompressed size from gzip footer (last 4 bytes store original size mod 2^32) | ||
| fn estimate_uncompressed_size(gzip_data: &[u8]) -> usize { | ||
| if gzip_data.len() < 4 { | ||
| return gzip_data.len() * 10; // fallback estimate | ||
| } | ||
| let last_4 = &gzip_data[gzip_data.len() - 4..]; | ||
| let size = u32::from_le_bytes([last_4[0], last_4[1], last_4[2], last_4[3]]) as usize; | ||
| // Sanity check: if size is 0 or too small, use a reasonable estimate | ||
| if !(16..=512 * 1024 * 1024).contains(&size) { | ||
| gzip_data.len() * 10 | ||
| } else { | ||
| size | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The estimate_uncompressed_size function is duplicated in crates/pm/src/service/pipeline.rs. It should be extracted into a shared utility function to avoid code duplication and ensure consistent logic. Additionally, the fallback estimate gzip_data.len() * 10 could lead to very large memory allocations if the input gzip_data is large and the actual uncompressed size is small or zero. Consider adding a maximum cap to this estimate or a more robust fallback strategy to prevent excessive memory usage.
| # tokio-tar removed - using sync tar with spawn_blocking for libdeflate batch processing | ||
| tokio-util = { version = "0.7" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment for tokio-tar removal is clear, but the removal of features = ["io"] from tokio-util is not explicitly mentioned. While it might be implicitly handled by the new approach, it would be good to add a brief note about why these specific features are no longer needed or if they are now covered by other dependencies.
|
|
||
| fn get_download_semaphore() -> &'static Arc<Semaphore> { | ||
| DOWNLOAD_SEMAPHORE.get_or_init(|| { | ||
| let limit = get_manifests_concurrency_limit_sync(); | ||
| tracing::debug!("Initializing download semaphore with limit: {}", limit); | ||
| Arc::new(Semaphore::new(limit)) | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DOWNLOAD_SEMAPHORE is initialized using get_manifests_concurrency_limit_sync(). This is fine, but it's worth noting that get_manifests_concurrency_limit_sync() returns the default value if the config is not yet loaded. Ensure this behavior is intended for the semaphore's initial limit, or consider if get_manifests_concurrency_limit().await is more appropriate here if it's guaranteed to be called after config loading.
crates/pm/src/util/downloader.rs
Outdated
| if let Some(parent) = entry.path.parent() { | ||
| let parent_path = parent.to_path_buf(); | ||
| if !created_dirs.contains(&parent_path) { | ||
| crate::fs::create_dir_all(&parent_path).await.ok(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crate::fs::create_dir_all(&parent_path).await.ok() call silently ignores any errors during directory creation. While this might be acceptable if the directory is expected to exist, create_dir_all can fail for reasons other than the directory already existing (e.g., permissions, invalid path components). It's safer to handle these errors explicitly, perhaps by logging a warning or returning an error, to prevent potential issues during file writing.
|
|
||
| /// Default concurrency limit for manifest fetching | ||
| pub const DEFAULT_CONCURRENCY: usize = 20; | ||
| pub const DEFAULT_CONCURRENCY: usize = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DEFAULT_CONCURRENCY for manifest fetching has been increased from 20 to 64. This is a significant change. While it can improve performance, it also increases resource consumption (network requests, CPU for parsing manifests). It would be beneficial to add a comment explaining the rationale behind this specific value, perhaps referencing benchmarks or expected system capabilities.
| esac | ||
|
|
||
| local start=$(date +%s.%N) | ||
| eval "$cmd" >/dev/null 2>&1 || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eval "$cmd" >/dev/null 2>&1 || true command suppresses all output and errors from the package manager commands and forces the script to continue even if the command fails. While this keeps the benchmark output clean, it can make debugging failures very difficult. Consider logging errors to a file or conditionally showing output for failed commands to aid in troubleshooting.
Summary
Test Plan