Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track repository_url #495

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

prabhu
Copy link

@prabhu prabhu commented May 2, 2024

Supersedes #494 by also setting component evidence for 1.5 spec.

With this PR, purl for the components would include repository_url qualifier for repositories that is not https://repo.maven.apache.org/maven2 as per purl spec.

Example BOM generated for the repo https://github.com/eclipse-jkube/jkube is attached.

bom.xml.txt
bom.json.txt

Things to clarify

package-url/purl-spec#303

prabhu added 4 commits May 2, 2024 00:09
Signed-off-by: Prabhu Subramanian <[email protected]>
Signed-off-by: Prabhu Subramanian <[email protected]>
Signed-off-by: Prabhu Subramanian <[email protected]>
Signed-off-by: Prabhu Subramanian <[email protected]>
@prabhu prabhu marked this pull request as draft May 3, 2024 08:05
@prabhu
Copy link
Author

prabhu commented May 3, 2024

Making this a draft to try another approach. Please feel free to use the PR for testing purposes.

Update: Ready for review.

@prabhu prabhu marked this pull request as ready for review May 3, 2024 10:34
@prabhu
Copy link
Author

prabhu commented May 11, 2024

@hboutemy @stevespringett Would appreciate a review.

@hboutemy
Copy link
Contributor

IIUC, repository_url is an origin repository url for a dependency?

from Maven experience on such feature for dependencies reports, if a repository manager is used, it's not possible to identify the origin repository of a dependency: that's why we removed the feature a few years ago

I don't think this is feasible in a consistent way, that's a general Maven issue, shared by every plugin that tried to do that

@prabhu
Copy link
Author

prabhu commented May 15, 2024

@hboutemy Thank you for the comments. If a repository manager is used and if we get the url to the manager instance that is still a useful information for enterprise customers. Currently, there is no visibility about either the repository manager or the origin.

@hboutemy
Copy link
Contributor

I just looked more at the example SBOM given, here is an example component with the new data that I see

    {
      "publisher" : "Red Hat",
      "group" : "io.fabric8",
      "name" : "kubernetes-model-common",
      "version" : "6.12.1",
      "description" : "Java client for Kubernetes and OpenShift",
      "scope" : "required",
      "hashes" : [
... usual hashes ...
      ],
      "licenses" : [
... usual licenses ...
      ],
      "purl" : "pkg:maven/io.fabric8/[email protected]?type=jar&repository_url=https%3A%2F%2Frepo1.maven.org%2Fmaven2%2F",
      "externalReferences" : [
... usual external references ...
      ],
      "evidence" : {
        "identity" : {
          "field" : "purl",
          "confidence" : 1.0,
          "methods" : [
            {
              "technique" : "hash-comparison",
              "confidence" : 0.9,
              "value" : "7b3cb18e7a6d5c53be9f9e5ab1128409793e06fc"
            },
            {
              "technique" : "filename",
              "confidence" : 0.1,
              "value" : ".m2/repository/io/fabric8/kubernetes-model-common/6.12.1/kubernetes-model-common-6.12.1.jar"
            }
          ]
        }
      },
      "type" : "library",
      "bom-ref" : "pkg:maven/io.fabric8/[email protected]?type=jar"
    },

I see 2 parts:

  1. the new purl repository_url parameter addition &repository_url=https%3A%2F%2Frepo1.maven.org%2Fmaven2%2F for some dependencies but not all: I don't get why some dependencies have the field, and some don't have it?
  2. the evidence addition: IIUC, it is completely independent from this repository_url aspect: IMHO, every piece of data for this one requires to be discussed separately (I won't start here, as it would make the discussion too complex)

let's discuss the repository_url aspect: when is it added or not? do we have examples of cases where the value is more useful than the value of Maven Central url?

@prabhu
Copy link
Author

prabhu commented May 16, 2024

Thanks @hboutemy.

The repository_url is added when maven resolves and pulls the jar from a remote repository and not from a local cache. There is an if condition to ensure this is the case.

if (repository instanceof RemoteRepository) {

When executing this PR branch with a custom repository proxy and maven repo directory using MAVEN_OPTS, nearly all the components receive the repository_url. I say nearly, because while building the plugin some dependencies are downloaded, which are getting resolved from the local cache when testing with jkube (many gradle libraries). The new bom files are attached.

export MAVEN_OPTS="-Dmaven.repo.local=/tmp/m2"

bom.json.txt
bom.xml.txt

@prabhu
Copy link
Author

prabhu commented May 30, 2024

@hboutemy any feedback based on my last comment?

@hboutemy
Copy link
Contributor

hboutemy commented Jun 3, 2024

as you can see from your second example generated SBOM, the resulting repository_url is the url of a proxy = something unusable (I'm even surprised by the IP in http://0.0.0.0:8080/releases value)

I understand the dream of this PR, but this is a dream: Maven cannot do that reliably
I don't want to add that code that will generate unreliable noisy data, particularly in the highly visible purl field

Thinking about it, if you want to work on this aspect, see #245, I'd propose to focus on the distribution-intake external reference: from intake of a dependency, defining the distribution external reference of that dependency can make sense

and please, when sharing example, please share first a simple example before sharing complex ones: simple is to define precisely the feature, while complex is useful to see at scale

@prabhu
Copy link
Author

prabhu commented Jun 3, 2024

@hboutemy I disagree with the word unusable. It is showing all the packages that got downloaded from a private internal repository (happens to run on localhost). It gives the confidence that there was no local cache (could be malicious) that was used. From the administrative settings of the internal registry, we can find the upstream public repository that was used for caching and it is absolutely fine if the SBOM tool doesn't have this information.

The second part of the PR sets identity evidence, which IMHO, is quite important for any SBOM tool.

@VinodAnandan
Copy link

VinodAnandan commented Nov 10, 2024

Apologies for the delayed message. I was hoping the issue might resolve itself, but it seems the PR has stalled. I believe it’s important to move this forward to provide improved transparency for users.

I am not an expert in PURL, so I would greatly appreciate any feedback from others on this matter.

Based on my understanding, the ideal PURL structure should look something like this ?:

  1. Default(https://repo.maven.apache.org/maven2) download directly from Maven Central - pkg:maven/org.apache.xmlgraphics/[email protected]
  2. Local cache PURL - pkg:maven/org.apache.xmlgraphics/[email protected]?repository_url=localhost
  3. Other repository PURL - pkg:maven/org.apache.xmlgraphics/[email protected]?repository_url=internal.artifactory.com

Also, I think including evidence is particularly valuable when dependencies are sourced from local caches or other repositories, as these sources can be susceptible to manipulation by attackers. However, it would be beneficial to include evidence for all components including from maven central, providing more detailed information on how each component's identity was concluded. By incorporating evidence, we offer transparency regarding the techniques and data sources used to establish component identity, thereby enhancing the reliability and trustworthiness of the SBOM.

@stevespringett @jkowalleck @pombredanne @skhokhlov @aloubyansky I would be grateful for your input and thank you in advance for your guidance.

@jkowalleck jkowalleck requested a review from hboutemy November 11, 2024 07:22
@aloubyansky
Copy link

A little disclaimer: I'm just a community member, and don't consider myself an expert. But given you asked :)

The PURL spec says:

repository_url is an extra URL for an alternative, non-default package repository or registry. When a package does not come from the default public package repository for its type a purl may be qualified with this extra URL.

I think it'd be great to clarify it further when it comes to SBOMs for Maven dependencies. There could be a couple of perspectives.

An artifact producer deploys artifacts to specific Maven repositories, which, I guess, should be considered the official distributions of those artifacts.
Note, that the distribution repository info included in the POMs will often not match the target distribution repository, since often POMs include some staging repository as the target, from which artifacts are released to the public/user repositories. So that's another little complication when it comes to generating SBOMs with proper repository_url for artifacts that are yet to be released publicly.
Let's imagine though that a producer managed to generate a proper SBOM with correct repository_url for the deployed artifacts.

Now there is a consumer perspective. As @hboutemy mentioned, it's not always possible to track the original repository an artifact was downloaded from in case mirrors and/or proxies are involved. Sometimes it will be possible, sometimes it won't be. In many cases it won't be.
Typically, unless the artifact was built and installed locally, when an artifact is resolved from a local repository cache, Maven would expose the remote repository from which it was downloaded through API.
The issue is that, the remote repository often might not be the one the producer of that artifact chose to distribute the artifact from.

In this case, from the supply chain perspective, it could be useful to record from where a given artifact was downloaded. It seems like it would fit well the evidence metadata. The question is though whether that repository should be a part of the PURL if it's different from the official distribution repository, which might be problematic to track back to, unless the original SBOM for that artifact is pulled along that artifact all the time.

@skhokhlov
Copy link
Member

This is potentially very nice feature. However, I have some considerations about data consistency and reproducibility.

  1. What happens when remote artifact already downloaded to local maven repository? Will remote repo be resolved in this scenario?
  2. What happens when artifact is available in multiple repositories. For example if both maven central and custom repository enabled and both contain the component. How this will be resolved?

@aloubyansky
Copy link

What happens when remote artifact already downloaded to local maven repository? Will remote repo be resolved in this scenario?

Maven resolver exposes this info, so I'd expect the remote repo to be properly recorded. I haven't tested the changes in this PR though.

What happens when artifact is available in multiple repositories. For example if both maven central and custom repository enabled and both contain the component. How this will be resolved?

The resolver reports the repository that was used to resolve the artifact, i.e. the first one that successfully resolved it.

So this would basically be manifesting supply chain from the build tool perspective with possibly lacking information about the original distribution repositories of dependencies.

Copy link
Contributor

@hboutemy hboutemy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maven resolver exposes this info, so I'd expect the remote repo to be properly recorded. I haven't tested the changes in this PR though.

this lacks a unit test showing expected result: please add

this will also will permit further work to clarify edge cases: yes, the effective remote repository is properly recorded
effective means "downloaded through your Nexus proxy" when you use a mrm = something that everybody should do to lower pressure on Maven Central, and that any company also should do: I can expect lack of mrm only from personal developers working at home

and also please remove unrelated changes on DefaultModelConverter, which require a separate dioscussion

* @param component Component for which the evidence needs to be attached
* @param artifact Maven artifact
*/
private static void addComponentEvidence(final Component component, final Artifact artifact) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all that code is unrelated to repository_url tracking: please remove, unrelated

@VinodAnandan
Copy link

@hboutemy What are your thoughts on modifying the PR title to include relevant evidence as well?

@prabhu could you please help us with the unit test? It would be much appreciated.

@hboutemy
Copy link
Contributor

What are your thoughts on modifying the PR title to include relevant evidence as well?

changing title is easy, but if we do that, we start the discussion here instead of doing it on #494 or a new issue based on it: there are so many questions on all the evidence model, how it maps to build tools, how it is done in other ecosystems

are you sure you want to start here? and mix?
this is a simple separate file, it's really easy to split back

@jkowalleck
Copy link
Member

jkowalleck commented Nov 21, 2024

Just a remark from the side:

How to make reviewer's/maintainer's life easy:

  1. create a ticket discussing issue/problem and expected outcome -- define a (non-)scope!
    Result: no discussions in the pullrequest, everything is clear before the ticket is "ready for development".
  2. pull-request the changes for one(1) issue at once -- keep the scope!
    Result: less frustration, less discussion, more acceptance, more fun.

These simple rules make reviews easy. But more important: they make change-management easy.
Each pullrequest is a change, it needs to be revised, considered, merged, communicated, and published according to the package-lifecycle models.
Mixing topics makes it all time-consuming and hard.

My 2 cents from one maintainer to another maintainer:
I support @hboutemy.
I would close this (outdated) PR as "not accepted", and ask the author to open multiple individual clearly-scoped pull-requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants