Skip to content

Conversation

@elrrrrrrr
Copy link
Contributor

Summary

Test Plan

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @elrrrrrrr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement to the package manager by implementing a pipeline-driven approach for package installation. It allows for the concurrent fetching of package manifests and the preloading of tarball archives, leveraging a new OnceMap utility to efficiently manage and deduplicate these operations. The change also upgrades the gzip decompression mechanism to use the highly optimized libdeflater library, further speeding up the extraction phase. These architectural improvements aim to reduce overall installation times and provide a more responsive user experience, as validated by the new benchmarking script.

Highlights

  • Pipeline Architecture for Tarball Preloading: Implemented a new pipeline system where tarball downloads begin immediately upon package resolution, running concurrently with further manifest fetching. This significantly reduces overall installation time by overlapping I/O operations.
  • Optimized Gzip Decompression: Switched from async-compression and tokio-tar to libdeflater for gzip decompression. libdeflater is a highly optimized library, leading to faster extraction of tarball archives. The decompression now happens in a blocking task executor to avoid blocking the async runtime.
  • Concurrent Download and Extraction Management: Introduced a global OnceMap to deduplicate download requests and allow multiple parts of the system to wait for and reuse a single ongoing download. Concurrency limits are now more granular, separating download concurrency from file cloning concurrency.
  • Enhanced Progress Reporting: The progress bar now displays real-time statistics for packages currently downloading and extracting, providing better visibility into the installation process.
  • New Benchmarking Script: Added an end-to-end benchmarking script (e2e/pm-bench.sh) to compare utoo's performance against other package managers like pnpm and bun under various caching scenarios (cold, warm, hot installs).
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/pm-bench.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance improvement by implementing a preloading pipeline for package tarballs. The new architecture leverages concurrent manifest resolution and tarball downloading, using a OnceMap for deduplication and libdeflater for faster gzip decompression. The changes are well-structured, separating pipeline logic into a new module and integrating it effectively with existing install processes. The addition of a benchmarking script is a valuable step for validating these performance gains.

Comment on lines +74 to +87
/// Estimate uncompressed size from gzip footer (last 4 bytes store original size mod 2^32)
fn estimate_uncompressed_size(gzip_data: &[u8]) -> usize {
if gzip_data.len() < 4 {
return gzip_data.len() * 10; // fallback estimate
}
let last_4 = &gzip_data[gzip_data.len() - 4..];
let size = u32::from_le_bytes([last_4[0], last_4[1], last_4[2], last_4[3]]) as usize;
// Sanity check: if size is 0 or too small, use a reasonable estimate
if !(16..=512 * 1024 * 1024).contains(&size) {
gzip_data.len() * 10
} else {
size
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The estimate_uncompressed_size function is duplicated in crates/pm/src/service/pipeline.rs. It should be extracted into a shared utility function to avoid code duplication and ensure consistent logic. Additionally, the fallback estimate gzip_data.len() * 10 could lead to very large memory allocations if the input gzip_data is large and the actual uncompressed size is small or zero. Consider adding a maximum cap to this estimate or a more robust fallback strategy to prevent excessive memory usage.

Comment on lines +48 to +49
# tokio-tar removed - using sync tar with spawn_blocking for libdeflate batch processing
tokio-util = { version = "0.7" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for tokio-tar removal is clear, but the removal of features = ["io"] from tokio-util is not explicitly mentioned. While it might be implicitly handled by the new approach, it would be good to add a brief note about why these specific features are no longer needed or if they are now covered by other dependencies.

Comment on lines +75 to +81

fn get_download_semaphore() -> &'static Arc<Semaphore> {
DOWNLOAD_SEMAPHORE.get_or_init(|| {
let limit = get_manifests_concurrency_limit_sync();
tracing::debug!("Initializing download semaphore with limit: {}", limit);
Arc::new(Semaphore::new(limit))
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DOWNLOAD_SEMAPHORE is initialized using get_manifests_concurrency_limit_sync(). This is fine, but it's worth noting that get_manifests_concurrency_limit_sync() returns the default value if the config is not yet loaded. Ensure this behavior is intended for the semaphore's initial limit, or consider if get_manifests_concurrency_limit().await is more appropriate here if it's guaranteed to be called after config loading.

if let Some(parent) = entry.path.parent() {
let parent_path = parent.to_path_buf();
if !created_dirs.contains(&parent_path) {
crate::fs::create_dir_all(&parent_path).await.ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The crate::fs::create_dir_all(&parent_path).await.ok() call silently ignores any errors during directory creation. While this might be acceptable if the directory is expected to exist, create_dir_all can fail for reasons other than the directory already existing (e.g., permissions, invalid path components). It's safer to handle these errors explicitly, perhaps by logging a warning or returning an error, to prevent potential issues during file writing.


/// Default concurrency limit for manifest fetching
pub const DEFAULT_CONCURRENCY: usize = 20;
pub const DEFAULT_CONCURRENCY: usize = 64;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DEFAULT_CONCURRENCY for manifest fetching has been increased from 20 to 64. This is a significant change. While it can improve performance, it also increases resource consumption (network requests, CPU for parsing manifests). It would be beneficial to add a comment explaining the rationale behind this specific value, perhaps referencing benchmarks or expected system capabilities.

esac

local start=$(date +%s.%N)
eval "$cmd" >/dev/null 2>&1 || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The eval "$cmd" >/dev/null 2>&1 || true command suppresses all output and errors from the package manager commands and forces the script to continue even if the command fails. While this keeps the benchmark output clean, it can make debugging failures very difficult. Consider logging errors to a file or conditionally showing output for failed commands to aid in troubleshooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants