Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual labels with large number of hosts returning errors #25555

Closed
rebeccaui opened this issue Jan 17, 2025 · 4 comments
Closed

Manual labels with large number of hosts returning errors #25555

rebeccaui opened this issue Jan 17, 2025 · 4 comments
Assignees
Labels
bug Something isn't working as documented customer-starchik #g-orchestration Orchestration product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Milestone

Comments

@rebeccaui
Copy link
Contributor

rebeccaui commented Jan 17, 2025

Fleet version: 4.61.0

Web browser and operating system: Ubuntu Noble


💥  Actual behavior

Customer is creating manual Fleet labels based on LDAP membership. When they go to create or update a manual label with many hosts, Fleet returns an error.

This is the error for a label with 5205 hosts:

{
   component: http
   err: get hostnames by identifiers: Error 1436 (HY000): Thread stack overrun:  242319 bytes used of a 262144 byte stack, and 20000 bytes needed.  Use 'mysqld --thread_stack=#' to specify a bigger stack.
   level: error
   method: POST
   took: 17.495934ms
   ts: 2025-01-15T18:14:35.152351432Z
   uri: /api/v1/fleet/labels
   user: <user_email>
}

When they clicked in the UI to edit a manual label with 2078 hosts, it fails to load and reports these errors:

{
   component: http
   err: get labels for host: selecting host labels: context canceled
   level: error
   method: GET
   took: 3.370685941s
   ts: 2025-01-16T20:27:20.161688196Z
   uri: /api/latest/fleet/hosts/14213
   user: <user_email>
}

{
   component: http
   err: get host: load query stats: context canceled
   level: error
   method: GET
   took: 747.269835ms
   ts: 2025-01-16T20:27:20.162532452Z
   uri: /api/latest/fleet/hosts/14261
   user: <user_email>
}

{
   component: http
   err: get host: list packs for host: 14853: listing hosts in pack: context canceled
   level: error
   method: GET
   took: 516.44296ms
   ts: 2025-01-16T20:27:20.162939431Z
   uri: /api/latest/fleet/hosts/14853
   user: <user_email>
}

{
   component: http
   err: load host software: context canceled
   level: error
   method: GET
   took: 3.534567077s
   ts: 2025-01-16T20:27:20.163917591Z
   uri: /api/latest/fleet/hosts/14116
   user: <user_email>
}

{
   component: http
   err: get host: load host users: context canceled
   level: error
   method: GET
   took: 1.001904159s
   ts: 2025-01-16T20:27:20.16470928Z
   uri: /api/latest/fleet/hosts/14927
   user: <user_email>
}

🧑‍💻  Steps to reproduce

  1. TODO
  2. TODO

🕯️ To QA

See QA steps in #25777 description

N/A

@rebeccaui rebeccaui added :incoming New issue in triage process. :reproduce Involves documenting reproduction steps in the issue bug Something isn't working as documented customer-starchik labels Jan 17, 2025
@JoStableford
Copy link
Contributor

Linked to Unthread ticket:

Error when creating/updating manual label with many hosts #4167

@sharon-fdm sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. #g-orchestration Orchestration product group labels Jan 20, 2025
@sharon-fdm
Copy link
Collaborator

@sharon-fdm sharon-fdm added this to the 4.64.0-tentative milestone Jan 24, 2025
@sharon-fdm sharon-fdm removed the :incoming New issue in triage process. label Jan 24, 2025
sgress454 added a commit that referenced this issue Jan 31, 2025
For #25555 

# Checklist for submitter

- [X] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/Committing-Changes.md#changes-files)
for more information.
- [X] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)

This PR updates the `NewLabel` service to use the
`UpdateLabelMembershipByHostIDs` method previously added by
@jacobshandling rather than using `ApplyLabels`. The latter method has
performance issues when adding large numbers of hosts at once to a
manual label (see #25555) because it does an expensive lookup of host
names before transforming those into Fleet host IDs. The new code skips
the middleman and transforms host identifiers directly to Fleet host
IDs, and does so using a batching strategy to ensure the queries don't
get too large.

This PR does update `UpdateLabelMembershipByHostIDs` slightly to return
an updated Label object and host IDs array, as this is the expected
return value for `NewLabel`. I update the method's tests accordingly. I
don't think any new tests for `NewLabel` are needed as it should have
the same functionality and return values.

## Manual Testing

On the main branch, I launched my local MySQL with the thread stack size
set to the minimal allowed, and used the API to try and create a new
label with 5,000 hosts attached, and received a 422 response from the
server. Server logs showed:
```
level=error ts=2025-01-28T15:08:20.465401Z component=http [email protected] method=POST 
uri=/api/latest/fleet/labels took=16.610292ms err="get hostnames by identifiers: Error 1436 (HY000): Thread stack 
overrun:  111136 bytes used of a 131072 byte stack, and 20000 bytes needed.  Use 'mysqld --thread_stack=#' to specify 
a bigger stack."
```

On this branch, I kept the same MySQL settings and tried my API request
again and it was successful:
<img width="776" alt="image"
src="https://github.com/user-attachments/assets/c4f0f52b-4d09-457b-8096-4dd3a747b1f4"
/>

## QA

The script I used to create a new manual label with lots of hosts is at:
https://gist.github.com/sgress454/84f12064c437da456c456e25c26d9069

To run it, first grab a bearer token from any API request by opening the
network tab, clicking a Fleet API request, and in the headers tab
scrolling down to Authorization:
<img width="892" alt="image"
src="https://github.com/user-attachments/assets/5680f3bf-8db8-469a-9f03-000b86622c04"
/>
(only take the part _after_ "Bearer")

Then download the script from that gist and in its folder run:
```
NODE_TLS_REJECT_UNAUTHORIZED=0 node ./add_hosts_to_label.js <the bearer token> "<a label name>"
```
e.g.
```
NODE_TLS_REJECT_UNAUTHORIZED=0 node ./add_hosts_to_label.js U3HpbdtadmJXGKYSB0U/PbwfOpHbBt7FpkWmGKKYolOO1moLNZA6XxP+QO5LVukvAotZ7d+JbNUEEhYHZtxoqg== "some test label"
```
This will invoke the API on https://localhost:8080 and try to add 5000
hosts a new label "some test label".

If you need to change the # of hosts or the url of the server, there are
additional arguments:
```
NODE_TLS_REJECT_UNAUTHORIZED=0 node ./add_hosts_to_label.js <the bearer token> "<a label name>" <number of hosts> <url>
```
e.g.
```
NODE_TLS_REJECT_UNAUTHORIZED=0 node ./add_hosts_to_label.js U3HpbdtadmJXGKYSB0U/PbwfOpHbBt7FpkWmGKKYolOO1moLNZA6XxP+QO5LVukvAotZ7d+JbNUEEhYHZtxoqg== "some test label" 10000 https://foo.bar
```
sgress454 added a commit that referenced this issue Jan 31, 2025
For #25555 

This PR fixes a failure when attempting to go to the "Edit Label" page
in the UI for manual label with a large # of hosts. Rather than making
one API request per host in the label, we instead use the "get hosts for
label" API to get them all at once.


https://github.com/user-attachments/assets/5144efa1-d466-4565-9c5b-5a1456fe0de1
@lukeheath
Copy link
Member

@sharon-fdm This still has the :reproduce label but I see a related PR from @sgress454. Just a reminder to remove the :reproduce label after we confirm it's a bug.

@lukeheath lukeheath added ~released bug This bug was found in a stable release. and removed :reproduce Involves documenting reproduction steps in the issue labels Jan 31, 2025
@fleet-release
Copy link
Contributor

Labels with many hosts fail,
Stack overrun, errors sail,
Fixed, we prevail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as documented customer-starchik #g-orchestration Orchestration product group :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. ~released bug This bug was found in a stable release.
Projects
None yet
Development

No branches or pull requests

6 participants