Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance with all content and product validation turned off #969

Closed
jordanpadams opened this issue Aug 8, 2024 · 10 comments · Fixed by #980
Closed

Slow performance with all content and product validation turned off #969

jordanpadams opened this issue Aug 8, 2024 · 10 comments · Fixed by #980
Assignees

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Aug 8, 2024

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I ran validate on a 2M product bundle with everything disabled expected referential integrity checks, it still takes 2 weeks.

Products: 2013873
Completed in 13.5 days / 324.6 hours

🕵️ Expected behavior

I expected the run to take less time.

📜 To Reproduce

  1. get a large bundle.

Here's the command:

/path/validate-3.5.1/bin/validate -target
/path/msam2/annex_ehlmann_caltech_msl_msam2 --report-file bundle.valrpt \
               -R pds4.bundle \
               --skip-content-validation --skip-product-validation

🖥 Environment Info

Linux OS
m7l.2xlarge EC2 instance

📚 Version of Software Used

v3.5.1

🩺 Test Data / Additional context

No response

🦄 Related requirements

Tightly coupled with #931

⚙️ Engineering Details

Per initial analysis by @al-niessner, code block here:

  1. outer loop running over all targets
  2. inner loop running over all targets again but different list
    a. uses if to see if var from (1) is equal to (2)
    b. cannot look up using hash because misuse of overly generic Collection<> instead of Set<> or Map<> or something better suited to task

This part of the job will take awhile and I have no way to estimate it for you. It does not seem to be stuck though.

🎉 Integration & Test

No response

@al-niessner
Copy link
Contributor

Using the harvest 143 data set, it took findUnreferencedIdentifiers 72953 ms to process 11139 products or 6.55 ms per product.

The entire job took 1472223 ms. It means that 5% of the time was doing the reference checking. Um, what was validating for the other 95% of the time? With skip product and content validation, what was validating doing for 20 something minutes of the total 20 something minutes. Point is, the reference checking may not be the problem.

Things to do to move forward:

  1. See what validate is doing that takes so much time when content and product validation are being skipped.
  2. Get a data set twice that currently being used (or even bigger) and see how overall time changes along with reference checking time.
  3. Update the reference checking with eye on performance improvement.

@al-niessner
Copy link
Contributor

Loading and reading all of the XML takes about 951 seconds. The reference checking 73 seconds for 1024 seconds which leaves roughly 450 seconds or 7.5 minutes unaccounted for or 30% of overall time.

Looking at it from a per product perspective, there are some outliers that are orders of magnitude away from the norm. The outliers look like - but not verified because it will take too much effort right now - downloads of schema and schematron. Without removing the outliers, the median is 80 ms per product while the mean is 100 ms. It simply means that 10000 ms for downloading really skews the mean. So, median is the one to think about.

Given 80 ms per product and expanding to 2 million is 4 days give or take. Given the possibility of machine performance difference, your earlier email said the preamble part that I am estimating 48 hours took 30 hours. Completely believable. The 325 hours following that doing reference checking suggests a non-linear relationship in the reference checking. Moving on to improving that bit of code.

@al-niessner
Copy link
Contributor

al-niessner commented Aug 23, 2024

@jordanpadams

Okay, now I am way off into the weeds. This class is all messed up:
https://github.com/NASA-PDS/validate/blob/main/src/main/java/gov/nasa/pds/tools/validate/Identifier.java

Not the fault of the coder necessarily. The bad concept that lid maps to latest lidvid is the beginning of the problem then how java uses hashCode() finishes the job. All Java Maps and Sets using Identifier will not work. One has to guarantee that equals() and hashCode() are always in sync. These two functions are not synchronized here and basically break all of the foundation blocks of Java. If had to use explicit lidvid all the time, then this could be corrected. Having a lid mean latest lidvid makes this impossible.

That week you were telling us about, probably need to fix this class then go in and clean up all the pieces from there correctly.

Make hashCode() and equals() be based on toString(). Then make nearEnough() to be the current equals. Should do the trick after fixing all the current equals() that need to be nearEnough().

@al-niessner
Copy link
Contributor

al-niessner commented Aug 23, 2024

@jordanpadams @rgdeen

Okay, the time waster is not at all what I thought. This eats up 99% (73 seconds vs 100 ms for everything else). Tell me what you want done with the offending code - basically can I delete it.

getListener()
.addProblem(new ValidationProblem(
new ProblemDefinition(ExceptionType.INFO, ProblemType.REFERENCED_MEMBER,
"Identifier '" + id.toString() + "' is a member of '"
+ getRegistrar().getIdentifierReferenceLocation(id) + "'"),
locationUrl));

Yup, that is right. Reporting the INFO message takes all of the time. I have not spent much time with the reporting code but I have the vague notion that it uses lists to store stuff etc for collation and processing later. Not sure why it takes so long but if there are 12000 messages already and it is sorting or doing something weird I can see where time to insert would really grow as the list grew long. The funniest part, I see log.info but not these info message. So not only are they taking forever, they are never seen.

I am going to keep digging because this is just bad, but let me know what you want done with this.

@al-niessner
Copy link
Contributor

Made progress on this one. Sped up the Unreferenced check by 10x. Since the problem was with the ValidateTarget cache not being used, it will require more memory (100s of bytes per product) but saves the time re-creating all of the metadata associated with it - aka repeatedly reading the XML file for the product type.

The next step in the clean up for performance, is to fix Identifier and all the fallout from those changes.

@al-niessner
Copy link
Contributor

@jordanpadams @rgdeen

Completed the second and third steps. What took 73 seconds at the beginning of this task now takes 700 ms. 100x improvement. The over all time has dropped from 1500 seconds to 1200 seconds. That is more than the 73 seconds from just the unreferenced list. 6 minutes are still locked up in ReferenceType. It matches the analysis of the log file which says that each label has dropped to 65 ms and that there is a 400 seconds missing in the log file which seems to be the 6 minutes in ReferenceType. Diving in...

@jordanpadams
Copy link
Member Author

@al-niessner

Okay, the time waster is not at all what I thought. This eats up 99% (73 seconds vs 100 ms for everything else). Tell me what you want done with the offending code - basically can I delete it.

getListener()
.addProblem(new ValidationProblem(
new ProblemDefinition(ExceptionType.INFO, ProblemType.REFERENCED_MEMBER,
"Identifier '" + id.toString() + "' is a member of '"
+ getRegistrar().getIdentifierReferenceLocation(id) + "'"),
locationUrl));

Yup, that is right. Reporting the INFO message takes all of the time. I have not spent much time with the reporting code but I have the vague notion that it uses lists to store stuff etc for collation and processing later. Not sure why it takes so long but if there are 12000 messages already and it is sorting or doing something weird I can see where time to insert would really grow as the list grew long. The funniest part, I see log.info but not these info message. So not only are they taking forever, they are never seen.

I am going to keep digging because this is just bad, but let me know what you want done with this.

You can remove this info message.

@al-niessner
Copy link
Contributor

It seems to be all cleaned up for this usage. It is now twice as fast (went from 1500 seconds to 700 seconds) overall. The reference checking itself, is about 100x faster. Give it a try when you can.

@rgdeen
Copy link

rgdeen commented Aug 28, 2024

Wow that's great! I suspect some of those speedups are exponential too, so if you are twice as fast on a 25-minute run you might be proportionally much faster than that on a 25-hour or 25-day run.

It takes a fair pushup to make that amount of data available for a larger test. Is it worth testing in the near term? Or wait until the next time we need to use it (a couple months, probably)? How would I even get the updated code?

@github-project-automation github-project-automation bot moved this from Release Backlog to 🏁 Done in B15.0 Aug 28, 2024
@github-project-automation github-project-automation bot moved this from ToDo to 🏁 Done in EN Portfolio Backlog Aug 28, 2024
@jordanpadams
Copy link
Member Author

Skipping I&T. This is very hard to test, and developers have done rigorous regression testing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

3 participants