Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate MiniLcm types #1344

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from
Open

Validate MiniLcm types #1344

wants to merge 22 commits into from

Conversation

rmunn
Copy link
Contributor

@rmunn rmunn commented Jan 6, 2025

Fixes #1275.
Fixes #1276.
Fixes #1277.
Fixes #1278.
Fixes #1279.
Fixes #1280.

Work completed:

  • Entry validator now validates all fields
  • Sense validator now validates all fields
  • Example sentence validator now validates all fields
  • Writing system validator created
  • Semantic domain validator created
  • Part of speech validator created
  • Tests for entry validator
  • Tests for sense validator
  • Tests for example sentence validator
  • Tests for writing system validator
  • Tests for semantic domain validator
  • Tests for part of speech validator
  • Adjusted some existing unit tests to now create valid data

rmunn added 5 commits January 6, 2025 13:44
Contains validators for:

- Entry
- Sense
- Example Sentence
- Part of Speech
- Semantic Domain
Entry validation now includes "lexeme must not be empty", so we add a
non-empty lexeme to the existing entry validation tests.
These tests are quite similar to each other; a test helper method is
probably needed here.
Refactored citation form tests to be more generic so they can be resued
for other similar fields.
@rmunn rmunn self-assigned this Jan 6, 2025
@rmunn
Copy link
Contributor Author

rmunn commented Jan 6, 2025

My thoughts on SemDom.xml and GOLDEtic.xml - we load these XML files as resources into the app (TODO: determine which DLL the resources should live in) and create a singleton service (or just a static class) that parses those XML files at system startup and provides an IDictionary interface for looking up POS / semdom data. (Or a hash set if all we need is to validate GUIDs).

Then validation code can say "Is this a predefined / canonical item?" And if it's supposed to be canonical, ensure the GUID is correct. And optionally, verify that the name and description of the canonical items hasn't been modified.

@jasonleenaylor
Copy link

GUIDs are all that needs to be used to identify if a POS or SemDom is predefined. Unless you want to support versioning, which you probably don't.


private bool NotBeComponentSelfReference(Entry entry, ComplexFormComponent component)
{
return component.ComponentEntryId != entry.Id;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be Guid.Empty otherwise it won't make sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding a .When() condition to this check, but FluentValidation's .When takes a Func<Entry, bool> and there isn't a WhenForEach. So I'd either have to use a .ForEach with a .When, or I could just make these predicates pass when the GUID in question is empty. I chose the latter approach in commit 87b2f8b.

Copy link
Collaborator

@hahn-kev hahn-kev Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this may have gotten mixed up with the HaveCorrectComponentEntryReference change I missed it on my first pass too. This check needs to prevent the Id from being empty, as it can't be inferred from the parent. It needs to be not the parent and and not empty, otherwise we don't know what you're referencing.

@rmunn rmunn force-pushed the feat/validate-minilcm-types branch from 9b3d43e to 3bcd237 Compare January 7, 2025 20:11
@rmunn
Copy link
Contributor Author

rmunn commented Jan 7, 2025

Commit 3bcd237, which actually turns on validation, makes lots of unit tests fail, because their test data is now considered invalid. I'll check through and see if my validation rules are too strict or if test data needs to be updated. I expect it to be a little of both.

rmunn added 6 commits January 7, 2025 15:19
As long as senses only have a PartOfSpeechId in them, it's hard to check
that property statelessly because we need to look up the PartOfSpeech in
order to determine whether it's predefined (and thus whether its GUID
needs to match one of the canonical GUIDs). For now, we'll skip checking
part of speech GUIDs until senses have an actual PartOfSpeech reference.
Now, instead of semantic domains always being considered predefined when
they come from fwdata, we can now look up their GUIDs in the canonical
list and set Predefined correctly. This also makes two failing tests pass.
The Sena3SyncTests are failing because some parts of speech are being
created with non-canonical GUIDs. Let's comment this out for now to make
the tests pass, then uncomment it once we've investigated where the
non-canonical PoS GUIDs are coming from.
@rmunn
Copy link
Contributor Author

rmunn commented Jan 7, 2025

Many tests are now failing with the error "Fieldname 'HumanNoOpinionNumber' does not exist." I don't get those failures when running tests locally, and I have no idea where that "HumanNoOpinionNumber" name is coming from. @hahn-kev - Any ideas on this one?

@rmunn rmunn marked this pull request as ready for review January 7, 2025 21:01
@rmunn
Copy link
Contributor Author

rmunn commented Jan 7, 2025

#1350 will be useful here; I had to comment out some of the GUID validation for parts of speech because when all you have is the GUID (in Sense.PartOfSpeechId), it's actually impossible to verify it statelessly. You need to get the PartOfSpeech object in order to see if its Predefined property is true, and only check canonical GUIDs for predefined parts of speech. But FluentAssertion wants validation to be stateless, so that's impossible.

But once #1350 is merged, it will become possible to validate parts of speech as objects, and the PartOfSpeechId validation will just need to be "Should be the same GUID as the PartOfSpeech". The PartOfSpeech validation can then, statelessly, check the Predefined property and look up the GUID in the canonical list only if needed.

@rmunn
Copy link
Contributor Author

rmunn commented Jan 8, 2025

Change decided in meeting with Kevin: example sentences are allowed to have an empty .Sentence property.

A Sentence property that has no content at all should be allowed. (Still
should not have any empty writing systems in a MultiString, of course).
{
// GUID list taken from src/SIL.LCModel/Templates/GOLDEtic.xml in liblcm
// TODO: Consider loading GOLDEtic.xml into app as a resource and add singleton providing access to it, then look up GUIDs there rather than using this hardcoded list
public static HashSet<Guid> CanonicalPosGuids = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be readonly, right now the field is mutable, and the HashSet is mutable. Obviously make the field readonly, instead I'd use a FrozenSet<T>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// GUID list taken from src/SIL.LCModel/Templates/SemDom.xml in liblcm
// TODO: Consider loading SemDom.xml into app as a resource and add singleton providing access to it, then look up GUIDs there rather than using this hardcoded list
public static HashSet<Guid> CanonicalSemDomGuids = [
new Guid("63403699-07C1-43F3-A47C-069D6E4316E5"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to spend a bunch of time on it if it's not easy, but looking at this code, at runtime it's going to have to parse all 1,000 strings. An alternative would be to use the constructor which takes a ReadOnlySpan<byte>. This would require creating the guids via parsing then calling guid.ToByteArray() and converting that into source like this:

new Guid([128, 117, 208, 48, 82, 80, 145, 77, 188, 36, 70, 155, 139, 45, 125, 249]), //30d07580-5052-4d91-bc24-469b8b2d7df9

again if that turns out to be too hard don't worry about it, we can always improve it later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do later. It's probably easy to automate this with a quick script, so I'll try that.

@@ -162,6 +162,7 @@

public Task<WritingSystem> CreateWritingSystem(WritingSystemType type, WritingSystem writingSystem)
{
validators.ValidateAndThrow(writingSystem);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a number of these warnings, I see you've fixed a few of them but there's still some more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This makes a change I made, splitting ValidateAndThrowAsync out from ValidateAndThrow, irrelevant, and I'll move the names back to ValidateAndThrow everywhere. (I did this precisely because CreateWritingSystem was synchronous, and I thought it should therefore use sync validation — and also because of the "You should not use asynchronous rules when using automatic validation with ASP.NET as ASP.NET’s validation pipeline is not asynchronous" warning in the FluentValidation docs).

@@ -301,7 +305,7 @@
Id = semanticDomain.Guid,
Name = FromLcmMultiString(semanticDomain.Name),
Code = semanticDomain.Abbreviation.UiString ?? "",
Predefined = true, // TODO: Look up in a GUID list of predefined data
Predefined = CanonicalGuidsSemanticDomain.CanonicalSemDomGuids.Contains(semanticDomain.Guid),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be similar code for PartOfSpeech, we should do the same thing there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -75,6 +75,7 @@ private async Task WorkaroundMissingWritingSystems()
}

[Fact]
[Trait("Category", "Integration")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than adding this trait to each test you could just add it to the class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not realize that was a possibility. Done.

var entryId = Guid.NewGuid();
var entry = new Entry() { Id = entryId, LexemeForm = new MultiString(){{"en", "lexeme"}}, ComplexForms = [new ComplexFormComponent(){ ComplexFormEntryId = entryId, ComponentEntryId = Guid.Empty }] };
_validator.TestValidate(entry).ShouldHaveValidationErrorFor("ComplexForms[0]");
// _validator.TestValidate(entry).ShouldNotHaveAnyValidationErrors();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can probably remove this commented code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public void Succeeds_WhenComponentsContainEmptyGuid()
{
var entryId = Guid.NewGuid();
var entry = new Entry() { Id = entryId, LexemeForm = new MultiString(){{"en", "lexeme"}}, Components = [new ComplexFormComponent(){ ComplexFormEntryId = entryId, ComponentEntryId = Guid.Empty }] };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another comment about this elsewhere, but the ComponentEntryId must be defined here, otherwise we don't know what's being referenced, the ComplexFormEntryId can be empty because it can be inferred from the parent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the second time that I've struggled to understand something you said about complex form components, because my intuition isn't tracking with what you're saying. I may be misunderstanding what ComponentEntryId and ComplexFormEntryId mean in a ComplexFormComponent object. Here's what I think it means:

Entry "hatstand", ID 1. Is a complex form.
Entry "hat", ID 2. Is a component of ID 1.
Entry "stand", ID 3. Is a component of ID 1.

"hatstand" will have a Components property with two ComplexFormComponent objects in it:

  • ComplexFormEntryId = 1, ComponentEntryId = 2
  • ComplexFormEntryId = 1, ComponentEntryId = 3

"hat" will have a ComplexForm property with one ComplexFormComponent object in it:

  • ComplexFormEntryId = 1, ComponentEntryId = 2

"stand" will have a ComplexForm property with one ComplexFormComponent object in it:

  • ComplexFormEntryId = 1, ComponentEntryId = 3

Is that correct? Or do I have it backwards somehow?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's all correct.

It is really easy to mess up when looking at the code, especially because each entry can have Components and ComplexForms. I've messed stuff up a number of times too. If you have a suggestion of how to improve it I'm open to suggestions.


[Theory]
[InlineData("Sentence")]
[InlineData("Translation")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use the nameof syntax here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

await SemanticDomainValidator.ValidateAndThrowAsync(value);
}

public void ValidateAndThrow(ComplexFormType value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to not expose these if we can avoid it, when using the non async version you can't run any async validation rules. We don't have any now but I don't want to block us from using them in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the sync versions as they were no longer called once I fixed the sync blocking warnings from #1344 (review)


namespace MiniLcm.Validators;

public class PartOfSpeechIdValidator : AbstractValidator<Guid?>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do end up keeping this then I might change the name to IsCanonicalPartOfSpeechIdValidator as it doesn't just validate any Id, it's validating that's it's canonical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit 0a39681. We might remove it later but the rename is simple so it's worth doing now even if we do remove it later.

Copy link
Collaborator

@hahn-kev hahn-kev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some feedback, it looks good so far.

It looks like that issue with HumanNoOpinionNumber is not consistent. Let me know if you see it again, I've sent a slack message to the flex team about it.

@rmunn
Copy link
Contributor Author

rmunn commented Jan 8, 2025

@hahn-kev wrote:

left some feedback, it looks good so far.

Addressed most review comments in commit c3f0908. The GUID constructor that doesn't parse strings is one I'll tackle tomorrow morning. Everything else is either done or waiting until later (such as renaming PartOfSpeechIdValidator, which I'm waiting on because I think I'll end up removing it) except for #1344 (review). There, I need a bit of help because I think I might be misunderstanding what the properties of ComplexFormComponent mean. (Either that, or I'm misunderstanding what you mean in that comment and I need you to expand on it a little).

It looks like that issue with HumanNoOpinionNumber is not consistent. Let me know if you see it again, I've sent a slack message to the flex team about it.

Yeah, that might have been a one-off caused by something entirely different. Commit 922f0a0, which only changed the validation rule for example sentences, also made that error go away. So I'll chalk it up to weirdness and ignore it unless it comes back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants