Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GUAC PoC cloud credits #266

Open
mlieberman85 opened this issue Feb 17, 2024 · 30 comments
Open

GUAC PoC cloud credits #266

mlieberman85 opened this issue Feb 17, 2024 · 30 comments
Labels
budget TI Funding Request Quarterly TI requests for funding. Needs 5 approvals, 7d review.

Comments

@mlieberman85
Copy link
Contributor

GUAC is looking to do a PoC with both maintainers of open source projects and end users into par due to a larger effort in the Security Toolbelt.

I spoke to @SecurityCRob that we don't currently have a mechanism for this but I'm ready to work through whatever we decide makes sense here.

@mlieberman85
Copy link
Contributor Author

Related to #257

@hythloda
Copy link
Member

hythloda commented Mar 5, 2024

This was approved in TAC meeting Mar 5, 2024
[Mike] suggest taking any funding requests on an adhoc basis while the process is being set up
[Sarahj] want to ensure there is a cap on this for the POC
[Mike] can set the limits - would like to see $1k per month
[Sarah] is this for a more public service - A): not really- mainly “can this scale” and is the correct data being provided
[Sarah] would like to understand what the milestones are - A) can highlight the milestones
Only authorized POC users ca use
VOTE motion Arnaud to request the funding - Zoom vote taken and approved with 6 votes

@hythloda
Copy link
Member

hythloda commented Mar 5, 2024

@mlieberman85 I will consult you on slack with details.

@hythloda hythloda closed this as completed Mar 5, 2024
@hythloda hythloda moved this from Submitted to TAC Awarded - needs contract in OpenSSF TI Funding Project Board Mar 6, 2024
@hythloda hythloda added the TI Funding Request Quarterly TI requests for funding. Needs 5 approvals, 7d review. label Mar 6, 2024
@mlieberman85
Copy link
Contributor Author

Here is the high level proposal:

GUAC PoC Proposal


Problem(s)

OSS consumers want a place where they can discover and consume metadata and analytics about the software they ingest.

Toolbelt needs a way to store metadata that comes from the toolbelt (e.g. scorecard, SBOMs, SLSA, etc.) and for that metadata to be consumed by end users

OpenSSF incubating project GUAC needs some real world case studies to help identify additional features, bugs or scalability issues


Solution

Build a PoC internal service for a case study.

  • This is not a public service but should be a pilot
  • This should be used by a set of maintainers and end users for the study

Other things considered

  • Why not deps.dev or OSV?
    • These are data sources into a deeper analysis that GUAC can provide.
    • OSV/deps.dev don't provide the ability to add additional metadata.
    • The public services are not run by OSSF
  • Looking at other OpenSSF/Open Source supply chain analysis tools
    • There are not many and GUAC has grown in popularity.

Proposal

  • Run GUAC for some OpenSSF case studies like work in the Security Toolbelt group. This is not a public service but would be limited to maintainer and end user participants
  • This can evolve over time and if determined to not be useful to the community and/or deemed too expensive to run at scale by OSSF GB can be shut down

Cost

  • ~$1k a month in cloud resources for the initial case study
    • GUAC requires various resources like an api server, an event stream, a database, etc. that all need to run
    • This initial case study would also allow us to understand how GUAC scales.
    • Primary cost factors are size of ingestion and how many users are running complex queries against it.

Here is the level down goals/scope:

Scope

Primary Goals

  • Understand the open source security needs of end users
    • What data/metadata do they need about the open source software they use and intend to use and what questions about that do they need answered in order to feel safe
  • Understand the security needs of maintainers and any blockers in them adopting security tooling and best practices, in particular OpenSSF tools and best practices
    • e.g. why can't they adopt scorecard or generate SBOMs
  • Understand where there are gaps in open source security frameworks, tools, etc. intended to secure production, distribution, and consumption of open source software
    • What features need to be added, how do we make adoption easier, etc.
    • How do we ensure all these tools interoperate well together?
    • This includes various OpenSSF projects like GUAC, Scorecard, SLSA, etc.

Secondary Goals

  • Test scalability of some elements of GUAC. We will be throwing a reasonable amount of supply chain metadata at it.
  • Test feasability of running the various tools as a or part of a public service (not necessarily funded by OpenSSF or LF)
  • Dog food OpenSSF tools
    • This might help A-O and other initiatives in helping automate and provide better tooling for some of the large scale drives on implementation efforts they are doing

Running the PoC

  • Work with a handful (3-5) of maintainers of open source projects to throw their supply chain metadata at guac.
  • Work with a handful (3-5) of end users of open source projects to ask questions of GUAC to help with feeling safe to use these open source projects based on the answers from GUAC.
  • Work with the security toolbelt to iterate on what info gets generated by the projects (SBOMs, SLSA, scorecard, etc.) what features might need to be added to GUAC, as well as any integrations/tools that work with GUAC to answer questions for end users.
  • Generate a report at the end about how well this worked along with challenges, intended next steps, etc.

Resource Reqs

  • To start off with we just need GUAC running in the cloud. This could just be via docker compose on an EC2 instance. Just something to get the ball rolling
  • Over time if things go well we can spin this up to look more representative of a toolbelt with GUAC as the data store for software producer metadata (SBOM, SLSA, etc.) along with ingestion focused tools like those in S2C2F.
    • EKS or Fargate
    • RDS
    • ALB and other supporting tools
    • IAM and federation related services so that we can make controlling access more straightforward (e.g. just give folks underneath an OSSF managed Github team access to GUAC)

@KennyPaul
Copy link
Contributor

Simply documenting activities discussed in a slack thread with @mlieberman85, @sevansdell, @hythloda and myself back on March 8, 2024.

  • LF accounts receivable confirmed that they were comfortable with the request being set-up for "pass-through" billing from LF-IT.
  • A new AWS account with the name "lf-openssf-guac" was set up by the GUAC team.
  • The invitation from the LF for account linking was sent to and accepted by the GUAC team.
  • The account was confirmed as being linked on the LF side.
  • @mlieberman85 agreed to the stipulation that system thresholds of Not To Exceed $1K monthly were to be configured and in place before any actual POC usage begins.

-kenny

@omkhar
Copy link
Contributor

omkhar commented Mar 28, 2024

What is the plan to avoid accidental cost overruns? Can we enforce a monthly spend quota?

@omkhar omkhar reopened this Mar 28, 2024
@mlieberman85
Copy link
Contributor Author

This is where I'm not 100% sure. AWS has some mechanisms to control spend but last I checked there was no way to enforce a monthly spend and then cut it off without additional tooling. We're comfortable utilizing any tool to prevent or minimize any overruns.

@hythloda hythloda moved this from TAC Recommended - needs contract to GM - approved in OpenSSF TI Funding Project Board Mar 28, 2024
@KennyPaul
Copy link
Contributor

I've gotten clarification from LF-IT that while alerts can be set based upon a threshold, there is no way in AWS to enforce a spending cap natively. Extra tooling using some sort of monitoring system is indeed required to shutdown any actively running systems or prevent new resources from being instantiated.

(As a side note, there is evidently some hard limit enforcement via GHA available, BUT that only works GitHub hosted resources rather than 3rd party cloud like AWS, and even then there are certain circumstances where GHA enforcement, "...is a little fuzzy." )

I asked if AWS can send usage alerts to an address different than the one on the account itself. Evidently that can be done. The workaround I'm proposing is to set up an email alias on our end that includes both the AWS account email and any appropriate staff members to receive the alerts. So while not hard enforcement it would allow us to keep multiple eyes on the situation to keep it managed.

-kenny

@david-a-wheeler
Copy link
Contributor

@KennyPaul - that mechanism worries me. I think it's horrifying that there's no "not to exceed" mechanism built-in, it's really easy to create an infinite bill.

If it has to be external monitoring, can we automatically pause things if they get out of hand? We don't want sites down, but we also don't want to go bankrupt.

Also: We may need to quickly switch GUAC to a Series (LLC). We should talk with LF lebal - I think that might create a legal barrier so that a project can go bankrupt without LF or Mike Lieberman going bankrupt. There should be an "emergency stop, no more spend" mechanism for a system that could otherwise create absurd bills.

@hythloda
Copy link
Member

hythloda commented Apr 1, 2024

Also: We may need to quickly switch GUAC to a Series (LLC). We should talk with LF lebal - I think that might create a legal barrier so that a project can go bankrupt without LF or Mike Lieberman going bankrupt. There should be an "emergency stop, no more spend" mechanism for a system that could otherwise create absurd bills.

Please find the guac charter online at https://github.com/guacsec/governance/blob/main/CHARTER.MD

@mlieberman85
Copy link
Contributor Author

@KennyPaul - that mechanism worries me. I think it's horrifying that there's no "not to exceed" mechanism built-in, it's really easy to create an infinite bill.

If it has to be external monitoring, can we automatically pause things if they get out of hand? We don't want sites down, but we also don't want to go bankrupt.

Also: We may need to quickly switch GUAC to a Series (LLC). We should talk with LF lebal - I think that might create a legal barrier so that a project can go bankrupt without LF or Mike Lieberman going bankrupt. There should be an "emergency stop, no more spend" mechanism for a system that could otherwise create absurd bills.

So there are mechanisms within all the cloud services to stop from overrunning certain limits and quotas but they're a bit opaque and usually focused on number of resources than a cost. e.g. no more than 20 ec2 instances unless you ask to have the limit increased... but you can easily turn on a VERY expensive instance. We might be able to manually set the quotas. We can give you an estimate of what we plan to use and just set the quotas to not exceed that. I think.

@david-a-wheeler
Copy link
Contributor

@hythloda - great! Sounds like creating the LLC is done or on its way to getting done. That helps, thank you.

I'd still love to see some sort of "emergency stop". If an attacker's DDoS attack creates a $5million bill within one hour, "reviewing it later" is too late. We can't be the first with that need. I hope a little searching will reveal an existing mechanism for doing that.

@mlieberman85
Copy link
Contributor Author

@hythloda - great! Sounds like creating the LLC is done or on its way to getting done. That helps, thank you.

I'd still love to see some sort of "emergency stop". If an attacker's DDoS attack creates a $5million bill within one hour, "reviewing it later" is too late. We can't be the first with that need. I hope a little searching will reveal an existing mechanism for doing that.

So it should be impossible to do that in an account without quotas set really high which requires other approvals. However there's a big difference between like going over by a dollar a month and going over by a few hundred dollars and I'm not sure the state of AWS right now with saying "going over by a dollar is fine... going over by 100 is not" Or something like that. Let me link some of the stuff I've used in the past.

@david-a-wheeler
Copy link
Contributor

@mlieberman85 - sounds perfect.

I'm just worried that an attacker manages to massively use a resource we hadn't limited enough. I believe there are too many services/configuration knobs/etc. in AWS to be absolutely certain we "got them all". It's the "far exceeded total expected spend due to an attacker's intentional actions" case that worries me, especially if it can be done before someone flips an emergency stop switch. I just want to automatically stop a runaway train before it hits people :-).

If we far exceed expected use because it's popular, that's awesome. I hope it happens :-). As long as we monitor activity, I don't expect an automatic emergency stop switch would interfere with it.

@omkhar
Copy link
Contributor

omkhar commented Apr 1, 2024

My understanding is that the cloud provider natively supports quota / does not exceed, am I right?
Is the issue that LF IT doesn't support setting quotas when using direct billing?

I think the issue of the legal stuff, while very important, might be orthogonal to how to limit a monthly bill, unless someone can explain to me how that helps with setting a quota.

@KennyPaul
Copy link
Contributor

@KennyPaul - that mechanism worries me. I think it's horrifying that there's no "not to exceed" mechanism built-in, it's really easy to create an infinite bill.

@david-a-wheeler Yeah. It worries me too.

@mlieberman85 expense wise the NTE number is $1K /month. What that translated to in actual utilization I have no idea. My initial thoughts related to granularity of reporting would be notification triggers at 50%, 75%, and then 5% increments from there on.

@omkhar I agree that the legal status is not particularly relevant in this particular context. The funding entity is OpenSSF and the billing for that account is configured to be routed appropriately.

All of the information I've been provided by IT indicates this is an AWS issue and not an LF-IT passthrough billing issue:

  • DOES provide expense threshold based notifications natively
  • DOES NOT provide expense threshold based automated resource throttling or an emergency stop switch natively

To facilitate the latter in AWS requires an expense threshold based notification to be fed to some other mechanism that would throttle and/or disable resources.

-kenny

@mlieberman85
Copy link
Contributor Author

https://aws.amazon.com/blogs/mt/introducing-service-quotas-view-and-manage-your-quotas-for-aws-services-from-one-central-location/

There's budget stuff we can do here. In a previous life @trmiller wrote a lambda that could turn off resources. I think in our case we can definitely turn stuff off if we see stuff about to hit limits since it's not an outward facing service.

@frenchi
Copy link

frenchi commented Apr 3, 2024

Chiming in to this thread to hopefully add some clarity (and some expensive lessons learned...):

Service Quotas

While this could be used as a preventative mechanism - it requires forethought in to the types of resources required (which is an inexact science and becomes fragile) & more importantly, does not prevent unexpected classes of billing (e.g. network egress fees, stopped instances, EBS volumes, etc).

In my opinion, Quotas aren't the most effective control to meet the NTE $1k/month budget. They may be worthwhile secondary prevention.

...this is an AWS issue...
DOES provide expense threshold based notifications natively
DOES NOT provide expense threshold based automated resource throttling or an emergency stop switch natively

Effectively correct, as there is no way to say "I don't want to spend a cent over $1k/month" and have that be enforced.

However... Budget Actions are a little known feature in AWS that may be an appropriate control here.

Depending on the underlying infrastructure, (they have direct support for stopping EC2/RDS instances & SNS support for sending a notification to a Lambda to disable other resources) and firing the alert early (i.e. scale down when the forecasted spend surpasses 95% of the budget), it may be possible to achieve this goal.

@mlieberman85
Copy link
Contributor Author

It would not be hard for us to write a lambda as an emergency switch if need be!

@sevansdell
Copy link
Contributor

What work remains for this to be closed?

@omkhar
Copy link
Contributor

omkhar commented May 14, 2024

My approval is blocked waiting on a proposal for how we would avoid an overspend of the allotted credits. There are several different methods suggested to achieve this outcome, but I have not see a final proposal for how to accomplish this.

@mlieberman85
Copy link
Contributor Author

I spoke to @bbpursell1 about this at OSS NA and it looked like there was a way forward. If there's something else I should write up, I can do that.

@omkhar
Copy link
Contributor

omkhar commented May 15, 2024

via email between @mlieberman85 and @bbpursell1 ...

  1. @bbpursell1 will document a proposed solution and processes to ensure we do not exceed the allocated budget monthly or in total.
  2. @omkhar will approve the proposed solution
  3. @bbpursell1 and @mlieberman85 will implement the proposed solution
  4. Funding will be released

@sevansdell
Copy link
Contributor

via email between @mlieberman85 and @bbpursell1 ...

  1. @bbpursell1 will document a proposed solution and processes to ensure we do not exceed the allocated budget monthly or in total.
  2. @omkhar will approve the proposed solution
  3. @bbpursell1 and @mlieberman85 will implement the proposed solution
  4. Funding will be released

How is this progressing please?

@presidentoor
Copy link

The budget requested for this was $1K a month, however we need a solid annualized budget, can someone provide that?

@mlieberman85
Copy link
Contributor Author

Is reframing it here and just saying $12k good enough? The original idea was $12k over 1 year. We can speed up timelines.

@presidentoor
Copy link

presidentoor commented Jun 12, 2024 via email

@mlieberman85
Copy link
Contributor Author

Well start is whenever we can get the money. End date would be Dec 31.

@hythloda hythloda moved this from GM - approved to GM budget allocation approved in OpenSSF TI Funding Project Board Jun 14, 2024
@presidentoor presidentoor moved this from Budget Approved to Funding Approved in OpenSSF TI Funding Project Board Jul 29, 2024
@omkhar omkhar moved this from Funding Approved to Funding in Execution in OpenSSF TI Funding Project Board Aug 6, 2024
@sevansdell
Copy link
Contributor

@mlieberman85 Has this been resolved? Should we close out this issue now if so?

1 similar comment
@sevansdell
Copy link
Contributor

@mlieberman85 Has this been resolved? Should we close out this issue now if so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
budget TI Funding Request Quarterly TI requests for funding. Needs 5 approvals, 7d review.
Projects
Status: Funding in Execution
Development

No branches or pull requests

10 participants