Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inspect_packages pipeline takes a long time to run #1398

Open
JonoYang opened this issue Oct 1, 2024 · 3 comments
Open

inspect_packages pipeline takes a long time to run #1398

JonoYang opened this issue Oct 1, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@JonoYang
Copy link
Member

JonoYang commented Oct 1, 2024

I am running the inspect_packages pipeline on a very large codebase that is an npm package. The pipeline takes a very long time at the scan_for_application_packages step on the package assembly portion of package scanning (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L459), where we are running this code (https://github.com/aboutcode-org/scancode-toolkit/blob/develop/src/packagedcode/npm.py#L82)

Since this codebase is an npm package, we pretty much consider all files in it to be part of the npm package. We are running code that is originally intended for a scancode-toolkit codebase. Walking a codebase using sctk code on a scio codebase is not performant because each call to a sctk codebase traversal method is an individual query to the database. The methods here are called multiple times and .save() is called on each Resource when performing package assembly.

@JonoYang JonoYang added the bug Something isn't working label Oct 1, 2024
@JonoYang
Copy link
Member Author

JonoYang commented Oct 1, 2024

An idea to speed things up would be to perform the package assembly step in memory by creating a commoncode.resource.Codebase object and using that instead of the Project.

@pombredanne
Copy link
Member

@JonoYang Another idea would be to have an option to skip entirely the package assembly when this is not needed, say to populate the PurlDB afterwards.

JonoYang added a commit that referenced this issue Oct 1, 2024
@JonoYang
Copy link
Member Author

JonoYang commented Oct 1, 2024

The problem when we don't run the package assembly step is that DiscoveredPackages are not created (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/pipes/scancode.py#L470). We are failing some tests because we are not creating the top level package (https://github.com/aboutcode-org/scancode.io/blob/main/scanpipe/tests/test_pipelines.py#L798)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants