Skip to content

[rapid7_insightvm] expand documents to map each vulnerability per asset #13878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

brijesh-elastic
Copy link
Contributor

@brijesh-elastic brijesh-elastic commented May 12, 2025

Proposed commit message

rapid7_insightvm: expand documents to map each vulnerability per asset.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

How to test this PR locally

  • Clone integrations repo.
  • Install elastic package locally.
  • Start elastic stack using elastic-package.
  • Move to integrations/packages/rapid7_insightvm directory.
  • Run the following command to run tests.

elastic-package test

Related issues

@brijesh-elastic brijesh-elastic self-assigned this May 12, 2025
@brijesh-elastic brijesh-elastic requested a review from a team as a code owner May 12, 2025 03:47
@brijesh-elastic brijesh-elastic added enhancement New feature or request Integration:rapid7_insightvm Rapid7 InsightVM Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Team:Sit-Crest Crest developers on the Security Integrations team [elastic/sit-crest-contractors] labels May 12, 2025
@elasticmachine
Copy link

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

@elastic-vault-github-plugin-prod
Copy link

elastic-vault-github-plugin-prod bot commented May 12, 2025

🚀 Benchmarks report

Package rapid7_insightvm 👍(0) 💚(0) 💔(2)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
asset 3333.33 2551.02 -782.31 (-23.47%) 💔
vulnerability 4201.68 3021.15 -1180.53 (-28.1%) 💔

To see the full report comment with /test benchmark fullreport

@@ -4,7 +4,7 @@ rules:
responses:
- status_code: 200
body: |
{"data":[{"assessed_for_policies":false,"assessed_for_vulnerabilities":true,"critical_vulnerabilities":0,"exploits":0,"id":"452534235-25a7-40a3-9321-28ce0b5cc90e-default-asset-199","ip":"10.1.0.128","last_assessed_for_vulnerabilities":"2020-03-20T19:19:42.611Z","last_scan_end":"2020-03-20T19:19:42.611Z","last_scan_start":"2020-03-20T19:18:13.611Z","malware_kits":0,"moderate_vulnerabilities":2,"os_architecture":"x86_64","os_description":"CentOS Linux 2.6.18","os_family":"Linux","os_name":"Linux","os_system_name":"CentOS Linux","os_type":"General","os_vendor":"CentOS","os_version":"2.6.18","risk_score":0,"severe_vulnerabilities":0,"tags":[{"name":"lab","type":"SITE"}],"total_vulnerabilities":2,"new":[],"remediated":[]},{"assessed_for_policies":false,"assessed_for_vulnerabilities":true,"critical_vulnerabilities":1,"exploits":9,"host_name":"host.domain.com","id":"452534235-25a7-40a3-9321-28ce0b5cc90e-default-asset-198","ip":"10.4.24.164","last_scan_end":"2020-03-20T19:12:39.766Z","last_scan_start":"2020-03-20T19:05:06.766Z","malware_kits":0,"moderate_vulnerabilities":11,"os_architecture":"","os_description":"Ubuntu Linux 12.04","os_family":"Linux","os_name":"Linux","os_system_name":"Ubuntu Linux","os_type":"","os_vendor":"Ubuntu","os_version":"12.04","risk_score":12251.76171875,"severe_vulnerabilities":16,"tags":[{"name":"all_assets2","type":"CUSTOM"},{"name":"all_assets","type":"CUSTOM"},{"name":"Linux","type":"CUSTOM"},{"name":"docker hosts","type":"SITE"},{"name":"lab","type":"SITE"}],"total_vulnerabilities":28,"new":[],"remediated":[],"unique_identifiers":{"id":"4421d73dfe04f594df731e6bcd8156a","source":"R7 Agent"}}],"metadata":{"number":0,"size":2,"totalResources":2195,"totalPages":1098,"cursor":null},"links":[{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=0&size=2","rel":"first"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=0&size=2","rel":"self"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=1&size=2&cursor=1542252837:::_S:::12474375-34a7-40a3-9821-28db0b5cc90e-default-asset-10","rel":"next"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=1097&size=2","rel":"last"}]}
{"data":[{"assessed_for_policies":false,"assessed_for_vulnerabilities":true,"critical_vulnerabilities":0,"exploits":0,"id":"452534235-25a7-40a3-9321-28ce0b5cc90e-default-asset-199","ip":"10.1.0.128","last_assessed_for_vulnerabilities":"2020-03-20T19:19:42.611Z","last_scan_end":"2020-03-20T19:19:42.611Z","last_scan_start":"2020-03-20T19:18:13.611Z","malware_kits":0,"moderate_vulnerabilities":2,"os_architecture":"x86_64","os_description":"CentOS Linux 2.6.18","os_family":"Linux","os_name":"Linux","os_system_name":"CentOS Linux","os_type":"General","os_vendor":"CentOS","os_version":"2.6.18","risk_score":0,"severe_vulnerabilities":0,"tags":[{"name":"lab","type":"SITE"}],"total_vulnerabilities":2,"new":[],"remediated":[]},{"assessed_for_policies":false,"assessed_for_vulnerabilities":true,"credential_assessments":[{"port":22,"protocol":"TCP","status":"NO_CREDS_SUPPLIED"}],"critical_vulnerabilities":1,"exploits":1,"id":"8bcfe121-1234-5678-9012-c4a6abcdabcde-default-asset-4","ip":"175.16.199.1","last_assessed_for_vulnerabilities":"2025-05-08T06:51:31.736Z","last_scan_end":"2025-05-08T06:51:31.736Z","last_scan_start":"2025-05-08T06:51:19.193Z","mac":"00:00:5E:00:53:00","malware_kits":0,"moderate_vulnerabilities":0,"os_architecture":"","os_description":"Ubuntu Linux","os_family":"Linux","os_name":"Linux","os_system_name":"Ubuntu Linux","os_type":"","os_vendor":"Ubuntu","risk_score":1268,"severe_vulnerabilities":1,"tags":[{"name":"test","type":"SITE"}],"total_vulnerabilities":2,"unique_identifiers":[],"new":[],"remediated":[],"same":[{"check_id":null,"first_found":"2025-05-08T06:51:31Z","key":"","last_found":"2025-05-08T06:51:31.736Z","nic":null,"port":22,"proof":"<p><ul><li>Running SSH service</li><li>Product OpenSSH exists -- OpenBSD OpenSSH 8.9p1</li><li>Vulnerable version of product OpenSSH found -- OpenBSD OpenSSH 8.9p1</li></ul><p>Vulnerable version of OpenSSH detected on Ubuntu Linux</p></p>","protocol":"TCP","solution_fix":"<p>Download and apply the upgrade from: <a href=\"https://ftp.openbsd.org/pub/OpenBSD/OpenSSH\">https://ftp.openbsd.org/pub/OpenBSD/OpenSSH</a></p>","solution_id":"openbsd-openssh-upgrade-latest","solution_summary":"Upgrade to the latest version of OpenSSH","solution_type":"rollup","status":"VULNERABLE_VERS","vulnerability_id":"openbsd-openssh-cve-2024-6387"},{"check_id":null,"first_found":"2025-05-08T06:51:31Z","key":"","last_found":"2025-05-08T06:51:31.736Z","nic":null,"port":22,"proof":"<p><ul><li>Running SSH service</li><li>Insecure MAC algorithms in use: hmac-sha1,[email protected]</li></ul></p>","protocol":"TCP","solution_fix":"<p>Consult the product documentation for instructions to disable any insecure MD5 or 96-bit HMAC algorithms within the SSH configuration.</p>","solution_id":"ssh-weak-message-authentication-code-algorithms","solution_summary":"Disable any MD5 or 96-bit HMAC algorithms within the SSH configuration","solution_type":"workaround","status":"VULNERABLE_VERS","vulnerability_id":"ssh-weak-message-authentication-code-algorithms"}]}],"metadata":{"number":0,"size":2,"totalResources":2195,"totalPages":1098,"cursor":null},"links":[{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=0&size=2","rel":"first"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=0&size=2","rel":"self"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=1&size=2&cursor=1542252837:::_S:::12474375-34a7-40a3-9821-28db0b5cc90e-default-asset-10","rel":"next"},{"href":"https://us.api.insight.rapid7.com:443/vm/v4/integration/assets?page=1097&size=2","rel":"last"}]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pretty print this and use { minify_json ….

changes:
- description: Expand documents to map each vulnerability per asset.
type: enhancement
link: https://github.com/elastic/integrations/pull/1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
link: https://github.com/elastic/integrations/pull/1
link: https://github.com/elastic/integrations/pull/13878

"proof": "\n\n\nRunning SSH service\n\nProduct OpenSSH exists -- OpenBSD OpenSSH 8.9p1\n\nVulnerable version of product OpenSSH found -- OpenBSD OpenSSH 8.9p1\n\n\nVulnerable version of OpenSSH detected on Ubuntu Linux\n\n",
"protocol": "TCP",
"solution": {
"fix": "\nDownload and apply the upgrade from: https://ftp.openbsd.org/pub/OpenBSD/OpenSSH\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want the scars from the html_strip processor here and in proof above. Suggest adding a trim left and right \n to these fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trimming the other cases of html_strip?

/cc @kcreddy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth trimming the other cases of html_strip?

Yes that would make sense 👍🏼. Thanks @efd6.

@brijesh-elastic brijesh-elastic requested a review from efd6 May 12, 2025 09:29
Comment on lines -31 to -38
- set:
target: url.params.comparisonTime
value: '[[formatDate (now (parseDuration "-{{interval}}")) "RFC3339"]]'
response.pagination:
- set:
target: url.params.comparisonTime
value: '[[.last_response.url.params.Get "comparisonTime"]]'
fail_on_template_error: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know why this part if removed?

Copy link
Contributor

@kcreddy kcreddy May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efd6 @jamiehynds, I would like to get your thoughts on this. Its related to #9354 which requires splitting current documents to make one doc per asset vulnerability.

What we currently have is an incremental ingestion using url.params.comparisonTime inside httpjson. With this, as per the assets API doc, we are going to get 3 arrays: new[], same[], and remediated[] . If we go with incremental ingestion, we might need to split on both new[] and remediated[] (because thats where the incremental updates are going to be), which httpjson is not capable of. httpjson cannot split on 2 arrays.

To get around this limitation, following are our options:

  1. Switching to full sync by removing url.params.comparisonTime, this way we always have only 1 array: same[] where split works using httpjson. This is what current PR is doing. I think if we are okay with this, we should atleast increment major version because of drastic increase in volume for users.
  2. Rewrite in CEL and continue incremental ingestion by splitting on both new[] and remediated[]. A future enhancement (for CDR work) is going to happen which joins our current asset API data onto vulnerability API data. This join is also going to be easier to implement inside CEL instead of httpjson chain calls. Because of this, I think it is better to rewrite in CEL now itself.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more points to consider after discussion with @efd6 on the above 2 options:

  • Both options will lead to increase in volume, the first permanently and the second until the collection reaches where the cursor was up to.
  • Both options leave users with duplicates as we don't have fingerprint processor.
  • Both options need a bump in major version because there's going to be duplicates and volume increase.

I think CEL option (both inputs cohabiting) might be lesser evil of the two options. We could leave the httpjson input for users that still want to use it.

@jamiehynds let me know your thoughts if we can go ahead with adding new CEL input.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kcreddy If CEL is the better long-term technical fit, I think it makes sense to rewrite in CEL now. That said, we should avoid keeping both input types side-by-side long-term, as it inevitably creates user confusion — we’ve already seen this play out with the Google Workspace integration.

As part of the major version bump, could we mark the httpjson data stream within the integration as legacy (or something equivalent) and will be removed in a future update. We also need to clearly document what’s included in the update. We can surface the breaking changes to users through the new notification mechanism and required their acknowledgement before updating.

While it's not possible to hide inputs for users not currently using the integration, maybe this could be a forcing function and could bring to the Fleet team. i.e. for any user who doesn't currently use the Rapid7 integration, they'll only get presented with the CEL input.

On the topic of increased data volume — do we have a rough estimate of the scale (2x, 10x, etc.)? Even if it’s a temporary spike with CEL, we’ll need to give users a heads-up.

And regarding duplicates — are we getting a duplicate for every vulnerability, or is it certain events or records being duplicated? Just want to be clear on what exactly users will see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamiehynds When @kcreddy and I were talking, we were thinking of the model that we've used in the mimecast integration, where the HTTPJSON input data stream is labelled legacy (without reference to the input) and the CEL input data stream is labelled v2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main prerequisite for this PR was to analyse existing data properly for CDR work item: https://github.com/elastic/security-team/issues/12534, but it looks like we need not merge this PR, as I was able to perform the analysis based on this PR code itself (without merging). I was able to unblock myself by testing this PR change and doing the analysis.

For now, I think it makes sense to mark this PR as Draft and work on adding CDR changes to this PR itself. That way we only have 1 breaking-change i.e., all changes for split docs + rewrite in CEL (deprecate httpjson) + CDR mappings (and transform) instead of having 2 breaking-changes. It would thus make more sense now to rewrite this in CEL.
cc: @brijesh-elastic


@efd6, with mimecast both httpjson and CEL co-existing would make more sense because CEL is for v2 API while httpjson is for older v1 API and also this older v1 API is not deprecated.
In Rapid7 case, both httpjson and CEL will be relying on same API. CEL will have more enriched data (because it calls vulnerability API as well) but essentially it ingests same asset vulnerabilities. In Rapid7, I believe there could be slightly more user confusion than mimecast if we make both inputs coexist. I think we should mark httpjson as (deprecated) by clearly documenting the update and hope our Fleet team can handle presenting only CEL input to new users. WDYT?

@jamiehynds, I think it makes sense to me to add a deprecation to the title suffix such as (deprecated) for httpjson input so that users become aware when they see it. If Fleet team can base their input off of this title suffix, it will be great because we don't have a proper way to deprecate input.

Regarding the questions:

On the topic of increased data volume — do we have a rough estimate of the scale (2x, 10x, etc.)? Even if it’s a temporary spike with CEL, we’ll need to give users a heads-up.

And regarding duplicates — are we getting a duplicate for every vulnerability, or is it certain events or records being duplicated? Just want to be clear on what exactly users will see.

The estimated increase in volume is directly proportional to number vulnerabilities in user's environment. If the user's environment has N vulnerabilities in their assets, the increase is going to be ~Nx.
This is because existing code was ingesting 1 document per asset (all vulnerabilities are bundled together in this document), but after this change we are going to ingest 1 document per asset per vulnerability.

The duplication is difficult to categorise. The initial pull of CEL will ingest all existing assets along with their vulnerabilities (historical) before it reaches where the previous cursor was for httpjson. One could say that an asset information, which was earlier in 1 document, will now be in 1 (httpjson)+ N (CEL) documents . The vulnerability information, which is in 1 document, will now be in 1 (httpjson)+ 1 (CEL) documents.


If we want to avoid all the fuss, we also have another option: We could add a new data_stream called asset_vulnerabilities on top of the existing data_streams asset or vulnerabilities. There won't be any duplicates if we take this approach. The existing data_streams can still continue to work.

  1. asset data_stream.
    • Existing data_stream based on httpjson.
    • Relies on Search Assets API
    • Has details on assets belonging to user's environment along with top-level in information on all vulnerabilities per asset inside each document (proof, solution, fix, etc.)
  2. vulnerability data_stream:
    • Existing data_stream based on httpjson.
    • Relies on Search Vulnerabilities API.
    • Has detailed vulnerability information of all known Rapid7 vulnerabilities (cve, score, severity, etc.)
  3. asset_vulnerabilities data_stream:
    • New data_stream based on CEL.
    • Relies on both Search Assets API and Search Vulnerabilities API to enrich asset vulnerability data.
    • Has details on assets belonging to user's environment along with detailed information of all vulnerabilities belonging to these assets with 1 document per asset per vulnerability (proof, solution, fix, cve, score, severity etc.)

Let me know your thoughts @jamiehynds @efd6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mark httpjson as (deprecated) by clearly documenting the update and hope our Fleet team can handle presenting only CEL input to new users. WDYT?

This is what I was thinking.

Leaving the asset_vulnerabilities query to @jamiehynds, but it does look appealing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving the httpjson input version of the asset_data_stream makes sense to avoid significant breaking changes for existing users. It's far from ideal, but the only solution we have right now is to mark it as deprecated. This situation is going to come up again I'm sure and reiterates the need for seamless method to move from httpjson to CEL and the ability to hide deprecated data streams for new users of an integration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input @jamiehynds. Just to confirm it, are we going with:

  1. Create new data_stream asset_vulnerabilities in CEL and then also deprecate asset data_stream (based on httpjson) completely?
    OR
  2. Use existing asset data_stream, add CEL input, and deprecate httpjson input?

Option 1 will avoid significant breaking changes, because its a new data_stream.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just got clarification from @jamiehynds below:

Going with the new data stream seems like the best path, to avoid major breaking changes and hopefully new users to the integration will use that integration when they see the old one is deprecated.

@brijesh-elastic, please go with option 1 from above comment, i.e., creating new data_stream asset_vulnerability and deprecate existing asset data_stream.
I think we should close this PR and start on new one because the changes here are no longer valid.
Please create 1 PR adding this new data_stream (incrementing major version) that closes: #9354, #13775, and #13776

"dataset": "rapid7_insightvm.asset",
"ingested": "2023-05-23T16:17:08Z",
"ingested": "2025-05-12T09:23:16Z",
"kind": "state",
"original": "{\"assessed_for_policies\":false,\"assessed_for_vulnerabilities\":true,\"critical_vulnerabilities\":0,\"exploits\":0,\"id\":\"452534235-25a7-40a3-9321-28ce0b5cc90e-default-asset-199\",\"ip\":\"10.1.0.128\",\"last_assessed_for_vulnerabilities\":\"2020-03-20T19:19:42.611Z\",\"last_scan_end\":\"2020-03-20T19:19:42.611Z\",\"last_scan_start\":\"2020-03-20T19:18:13.611Z\",\"malware_kits\":0,\"moderate_vulnerabilities\":2,\"new\":[],\"os_architecture\":\"x86_64\",\"os_description\":\"CentOS Linux 2.6.18\",\"os_family\":\"Linux\",\"os_name\":\"Linux\",\"os_system_name\":\"CentOS Linux\",\"os_type\":\"General\",\"os_vendor\":\"CentOS\",\"os_version\":\"2.6.18\",\"remediated\":[],\"risk_score\":0,\"severe_vulnerabilities\":0,\"tags\":[{\"name\":\"lab\",\"type\":\"SITE\"}],\"total_vulnerabilities\":2}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update config.yml so that one of the documents containing same is populated instead of this one (doesn't have same)

Copy link
Contributor

@kcreddy kcreddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brijesh-elastic don't close #9354 as the later part of the issue description indicates a requirement for preserving only 1 asset vulnerability inside the index. I think this can only be truly closed after implementing latest transform which stores latest state of each asset vulnerability inside destination index.

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @brijesh-elastic

Copy link

@brijesh-elastic brijesh-elastic marked this pull request as draft May 19, 2025 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Integration:rapid7_insightvm Rapid7 InsightVM Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] Team:Sit-Crest Crest developers on the Security Integrations team [elastic/sit-crest-contractors]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants