-
Notifications
You must be signed in to change notification settings - Fork 31
Refactor documentation generator to be modular #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One small step that could simplify the Configuration part (Point 3 from above) is to have some common naming convention to aid the configuration. For example, from this configuration in Python dictionary in {
"dex": {
"examples": {
"examples": f"{IMPORT_CSV_PATH}/Example.csv",
},
},
"dpv": {
"TOM": {
"taxonomy": f"{IMPORT_CSV_PATH}/TOM.csv",
"properties": f"{IMPORT_CSV_PATH}/TOM_properties.csv",
},
"technical_measures": {
"taxonomy": f"{IMPORT_CSV_PATH}/TechnicalMeasure.csv",
},
"entities_authority": {
"taxonomy": f"{IMPORT_CSV_PATH}/Entities_Authority.csv",
"properties": f"{IMPORT_CSV_PATH}/Entities_Authority_properties.csv",
},
},
"ai": {
"core": {
"taxonomy": f"{IMPORT_CSV_PATH}/ai-core.csv",
"properties": f"{IMPORT_CSV_PATH}/ai-properties.csv",
}
},
"sector-infra": {
"purposes": {
"taxonomy": f"{IMPORT_CSV_PATH}/Purpose_Infrastructure.csv",
},
},
"eu-dga": {
"legal_basis": {
"taxonomy": f"{IMPORT_CSV_PATH}/DGA_LegalBasis.csv",
},
},
} if we can agree on the casing and the use of a separator (dash [_suffix]
"classes" = "_classes"
"examples" = "_examples"
"properties" = "_properties"
"purposes" = "_purposes"
"taxonomy" = ""
[dex]
examples = ["examples"]
[dpv]
TOM = ["taxonomy", "properties"]
technical_measures = ["taxonomy"]
entity_authority = ["properties"]
[ai]
core = ["taxonomy", "properties"]
[sector-infra]
purposes = ["taxonomy"]
[eu-dga]
legal-basis = ["taxonomy"] could generate an equivalent Python dictionary below, based on the filename pattern {
"dex": {
"examples": {
"examples": f"{IMPORT_CSV_PATH}/dex_examples_examples.csv",
},
},
"dpv": {
"TOM": {
"taxonomy": f"{IMPORT_CSV_PATH}/dpv_TOM.csv",
"properties": f"{IMPORT_CSV_PATH}/dpv_TOM_properties.csv",
},
"technical_measures": {
"taxonomy": f"{IMPORT_CSV_PATH}/dpv_technical_measures.csv",
},
"entities_authority": {
"taxonomy": f"{IMPORT_CSV_PATH}/dpv_entities_authority.csv",
"properties": f"{IMPORT_CSV_PATH}/dpv_entities_authority_properties.csv",
},
},
"ai": {
"core": {
"taxonomy": f"{IMPORT_CSV_PATH}/ai_core.csv",
"properties": f"{IMPORT_CSV_PATH}/ai_core_properties.csv",
}
},
"sector-infra": {
"purposes": {
"taxonomy": f"{IMPORT_CSV_PATH}/sector-infra_purposes.csv",
},
},
"eu-dga": {
"legal_basis": {
"taxonomy": f"{IMPORT_CSV_PATH}/eu-dga_legal_basis.csv",
},
},
}
|
@bact agreed. I think the apparent inconsistency is because the underscores come from our initial filenames and the dashes come from HTML ids. I think there is some magic code that relies on them. The mass renaming of spreadsheet tabs is really cumbersome! BUT ... we can change the underscores and naming conventions to dashes in filenames trivially (in |
- HTML docs for .OWL are not generated (turned off option in 300.py) as this creates blank nodes that have random IDs and undos the benefits of recent PRs that give stability to RDFS+SKOS - to be fixed along with #262 - OWL HTML can be generated at the very end of the development process as part of the release prep workflow
https://harshp.com/dev/dpv/docgen-modular-01 describes some of my thoughts on how we can implement this and also remove a lot of the redundant boilerplate around maintaining metadata. E.g. the below code provides all the needed information to generate CSVs, RDFs, and HTMLs (common variables like folder paths and version are in other files) # config/vocab/dpv.py
modules = ['m1', 'm2', 'm3']
# in CSV script, this is used to generate dpv_m1.csv, dpv_m2.csv, dpv_m3.csv
# in RDF script, this is used to generate dpv/modules/m1 ...
# in HTML script, this is used to read dpv/modules/m1 and generate m1 section in HTML
# however, there is more metadata for modules, so we use a dict
modules = {
'm1': {
'title': 'M1',
'parser': 'taxonomy', # method to parse the CSV to generate RDF,
'source': {
'gsheet_id': '...', # ID of the Google Sheet
# not all vocab modules have both classes and properties
'classes': 'tab name', # will generate CSV dpv_m1_classes.csv
'properties': 'tab name', # will generate CSV dpv_m1_properties.csv
},
"html_template": "path...", # optional - will generate HTML output for module
}
}
folder = "/dpv"
name = "dpv"
# RDF path is <DPV_VERSION>/<folder>/<name>.ttl
html_template = f"{TEMPLATE_PATH}/template_dpv.jinja2"
metadata = {
"dct:title": "Data Privacy Vocabulary (DPV)",
"dct:description": "The Data Privacy Vocabulary (DPV) provides ...",
"dct:created": "2022-08-18",
"dct:modified": DPV_PUBLISH_DATE,
"dct:creator": "Harshvardhan J. Pandit, Beatriz Esteves, Georg P. Krog, Paul Ryan, Delaram Golpayegani, Julian Flake",
"schema:version": DPV_VERSION,
"profile:isProfileOf": "",
"bibo:status": "published",
} |
- see #262 for discussion - NOTE: This commit has DELETED files Squashed commit of the following: commit 4d04b59dc39a463b6280e77c0a1a0418b24116f9 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 16:07:33 2025 +0100 Modular RDF and HTML generation - This commit changes the `200.py` and `300.py` files so that the outputs can be selectively generated. - For use, use `--help` argument to see options. - Both changes allow specific vocabularies to be generated such that only the outputs for the specified vocabulary/extension will be written to disk. - The default option is to generate all outputs for all vocabularies. - The README.md in code has been updated to mention this. commit 85afc2cd6c95b192cb136638afb567b7979eaad7 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 16:04:19 2025 +0100 disables RDF and OWL serialisations in CG-DRAFT - If document status is set as CG-DRAFT, only .ttl and .csv files are generated and the other formats are not generated (including no owl files) - a previous commit removed the matching files for these. - If document status is set as CG-FINAL, all formats are generated, and as a result, there will be new files created (which won't be automatically removed if status is changed back to CG-DRAFT). - This change allows consistent outputs to be generated on repeated runs so that only the actual changes in files show up in git diffs, and other formats which are not stable (e.g. xml) are not generated. - This change was discussed and approved in DPVCG meetings as part of the larger process to streamline and make it easy to generate and review changes made in commits. commit 1bc2a811f76ca6b24ea5609d4fb660292a635a3b Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 15:54:11 2025 +0100 deterministic/consistent outputs of RDF and CSV - RDF turtle files have serialisations removed due to changes in draft supported serialisations (they are added back in production/final); as a result there are fewer triples generated - CSV files contains parents in a random order which showed up as changes; fixed by ensuring parents are sorted before writing output commit c45efc0a9e254cf47f15da388af41d8b97f41cd3 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 14:56:03 2025 +0100 deletes jsonld,rdf,n3 files, and all owl files
87ed62e provides some of the features discussed here for generating outputs for specific extensions. Examples: # Generate only some RDF outputs
./200_serialise_RDF.py --vocab=tech
# Generate all RDF outputs
./200_serialise_RDF.py
# Generate only some HTML outputs
./300_generate_HTML.py --vocab=dpv,tech,ai
# Generate all HTML outputs
./300_generate_HTML.py
# Skip loading some extensions (to speed up)
./300_generate_HTML.py --skip=loc
# Skip loading all extensions which match pattern
./300_generate_HTML.py --skip=loc*,legal* The commit also deletes serialisation formats which aren't stable/consistent ( |
Problem: Currently the documentation generator is made up of three parts -
100.py
for downloading CSVs,200.py
for producing RDFs, and300.py
for producing HTMLs. Except for CSVs, both other scripts will produce outputs for all configured extensions which causes git to show modifications to work that has not been changed (since RDF formats are not consistent in structure or blank nodes). This then causes issues with committing changes as manual work is needed to resolve the unwanted changes and only add those items that were intended.Problem: The
vocab_management.py
is a large file made up of various configurations that dictates where to find source files, metadata for RDF and HMTL, and other items. It is necessary to be edited if any of these details change e.g. when creating a new extension. The file is large, there are multiple places corresponding to each extension, and there is a high chance that something is missed.Problem: For people who are not me, it is likely to be confusing and cumbersome to figure out how this code works. Documentation is available, but is at high risk of not being up to date, and any changes to be made require knowledge of a highly technical nature which should not be necessary to simply generate / update files.
Solution: Change the way the documentation works to be:
The text was updated successfully, but these errors were encountered: