Skip to content

Make DatasetUpdateService respect isBulkLoad() for audit purposes #6791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: release25.3-SNAPSHOT
Choose a base branch
from

Conversation

bbimber
Copy link
Collaborator

@bbimber bbimber commented Jun 24, 2025

We have a server that runs an ETL to populate datasets. I noticed the dataset audit log was enormous, despite bulkTrue=true being set. In general, the QUS uses Summary audit logging, when bulkLoad=true. This PR makes DatasetUpdateService respect this behavior for delete and update. It was already doing that for insert and truncate.

I bet it's pretty easy for ETLs to accumulate a lot of audit records in a folder without an admin noticing. Even if other ETL users have not brought this up, it would be easy for it to be happening on many servers.

An alternative strategy would be to make this code inspect configParameters for DetailedAuditLogDataIterator.AuditConfigs.AuditBehavior=NONE or something to that effect. This would give the developer direct control and change existing behavior less. I'm happy to switch to this or another option if you prefer.

@bbimber bbimber requested a review from labkey-jeckels June 24, 2025 22:37
@labkey-jeckels
Copy link
Contributor

@bbimber I hoped I'd be able to work on this before leaving for vacation but I ran out of time. From an initial glance it seems reasonable. @labkey-klum @jryurkanin @labkey-martyp might have a chance to review before I return.

@labkey-jeckels
Copy link
Contributor

@bbimber I'm back from vacation and catching up. This change looks good to me. I'd like to get some test coverage for both the bulk and non-bulk scenarios. While the test ETL XML files are in the testAutomation repo, the tests themselves are in a private repo so I don't think it would be easy to ask you to add coverage. I'll get this into the queue.

@bbimber
Copy link
Collaborator Author

bbimber commented Jul 10, 2025

OK. I'm happy to write a simple integration test using the vehicles schema. Etl isn't really needed here

@labkey-jeckels
Copy link
Contributor

OK. I'm happy to write a simple integration test using the vehicles schema. Etl isn't really needed here

I'm certainly not against adding more coverage on the vehicles schema, but these changes are all dataset-specific so we should be sure to get coverage against dataset tables and bulkLoad is an ETL concept. We have some existing ETL tests that set up a study so that's a logical place to build from.

@bbimber
Copy link
Collaborator Author

bbimber commented Jul 10, 2025

Yes, forgot that. Vehicles no then. I'm just trying to minimize the burden placed on labkey here.

It probably still would be reasonable to try to create a study in Java code in the integration test (though I haven't actually tried this recently).

Anyway, I'm willing to give something a try if it reduces the labkey burden, but if you're going to create the dataintegraiton test no matter what, I probably would not be redundant. Just let me know which route you want.

@bbimber
Copy link
Collaborator Author

bbimber commented Jul 11, 2025

hi @labkey-jeckels: i got back to my computer and looked at the code. there is already an existing integration test under DatasetUpdateService. It should be quite easy to piggyback off this, so i will try to write something simple tomorrow to exercise this.

@bbimber
Copy link
Collaborator Author

bbimber commented Jul 11, 2025

good morning @labkey-jeckels: it seems like the integration test I added passed. can you please let me know if you think that's sufficient here?

@labkey-jeckels labkey-jeckels self-assigned this Jul 11, 2025
Copy link
Contributor

@labkey-jeckels labkey-jeckels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests. I pushed a couple of small changes, to use constants and just skip the audit handler completely for bulk loads (similar to insert and update). Please double-check that they seem reasonable and see my question about whether the second insert is improving the coverage.

@bbimber
Copy link
Collaborator Author

bbimber commented Jul 11, 2025

Thanks for the tests. I pushed a couple of small changes, to use constants and just skip the audit handler completely for bulk loads (similar to insert and update). Please double-check that they seem reasonable and see my question about whether the second insert is improving the coverage.

yes, i saw those come through. this all seems reasonable to me. certainly using constants is better than not.

Sorry - what question are you referring to? You mention 'second insert', so do you mean ~line 927 of the test? That code inserts 2 rows, check for 2 audit rows, and then inserts one 1, checking for one new audit row. If that's what you mean then I agree the second single-row insert doesnt add a heck of a lot, but it also doesnt seem that negative either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants