Refactor documentation generator to be modular #262

coolharsh55 · 2025-03-23T09:06:42Z

Problem: Currently the documentation generator is made up of three parts - 100.py for downloading CSVs, 200.py for producing RDFs, and 300.py for producing HTMLs. Except for CSVs, both other scripts will produce outputs for all configured extensions which causes git to show modifications to work that has not been changed (since RDF formats are not consistent in structure or blank nodes). This then causes issues with committing changes as manual work is needed to resolve the unwanted changes and only add those items that were intended.

Problem: The vocab_management.py is a large file made up of various configurations that dictates where to find source files, metadata for RDF and HMTL, and other items. It is necessary to be edited if any of these details change e.g. when creating a new extension. The file is large, there are multiple places corresponding to each extension, and there is a high chance that something is missed.

Problem: For people who are not me, it is likely to be confusing and cumbersome to figure out how this code works. Documentation is available, but is at high risk of not being up to date, and any changes to be made require knowledge of a highly technical nature which should not be necessary to simply generate / update files.

Solution: Change the way the documentation works to be:

Single executable script that takes parameters to do specific things like update CSVs, produce RDF and HTML. It calls other scripts internally.
Modular outputs for each process i.e. it should be possible to generate outputs for a specific extension without any other outputs also being generated.
Configuration should be modular for each extension and all configurations for a given extension should reside in a single place/file. E.g. for extension X, the CSVs to download, the RDF and HTML paths, the vocabulary metadata should all be in one file.
Documentation (in wiki) should be updated to have simpler instructions for how to regenerate documentation, how to update a typo, how to submit a PR using the above - which should result in a simpler and replicable process.

The text was updated successfully, but these errors were encountered:

bact · 2025-03-24T10:24:27Z

One small step that could simplify the Configuration part (Point 3 from above) is to have some common naming convention to aid the configuration.

For example, from this configuration in Python dictionary in vocab_management.py:

{
    "dex": {
        "examples": {
            "examples": f"{IMPORT_CSV_PATH}/Example.csv",
        },
    },
    "dpv": {
        "TOM": {
            "taxonomy": f"{IMPORT_CSV_PATH}/TOM.csv",
            "properties": f"{IMPORT_CSV_PATH}/TOM_properties.csv",
        },
        "technical_measures": {
            "taxonomy": f"{IMPORT_CSV_PATH}/TechnicalMeasure.csv",
        },
        "entities_authority": {
            "taxonomy": f"{IMPORT_CSV_PATH}/Entities_Authority.csv",
            "properties": f"{IMPORT_CSV_PATH}/Entities_Authority_properties.csv",
        },
    },
    "ai": {
        "core": {
            "taxonomy": f"{IMPORT_CSV_PATH}/ai-core.csv",
            "properties": f"{IMPORT_CSV_PATH}/ai-properties.csv",
        }
    },
    "sector-infra": {
        "purposes": {
            "taxonomy": f"{IMPORT_CSV_PATH}/Purpose_Infrastructure.csv",
        },
    },
    "eu-dga": {
        "legal_basis": {
            "taxonomy": f"{IMPORT_CSV_PATH}/DGA_LegalBasis.csv",
        },
    },
}

if we can agree on the casing and the use of a separator (dash - vs underscore _, and also PascalCase vs Snake_Case), this TOML config:

[_suffix]
"classes" = "_classes"
"examples" = "_examples"
"properties" = "_properties"
"purposes" = "_purposes"
"taxonomy" = ""

[dex]
examples = ["examples"]

[dpv]
TOM = ["taxonomy", "properties"]
technical_measures = ["taxonomy"]
entity_authority = ["properties"]

[ai]
core = ["taxonomy", "properties"]

[sector-infra]
purposes = ["taxonomy"]

[eu-dga]
legal-basis = ["taxonomy"]

could generate an equivalent Python dictionary below, based on the filename pattern {cat}_{subcat}_{subsubcat}{suffix}.csv:

{
    "dex": {
        "examples": {
            "examples": f"{IMPORT_CSV_PATH}/dex_examples_examples.csv",
        },
    },
    "dpv": {
        "TOM": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_TOM.csv",
            "properties": f"{IMPORT_CSV_PATH}/dpv_TOM_properties.csv",
        },
        "technical_measures": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_technical_measures.csv",
        },
        "entities_authority": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_entities_authority.csv",
            "properties": f"{IMPORT_CSV_PATH}/dpv_entities_authority_properties.csv",
        },
    },
    "ai": {
        "core": {
            "taxonomy": f"{IMPORT_CSV_PATH}/ai_core.csv",
            "properties": f"{IMPORT_CSV_PATH}/ai_core_properties.csv",
        }
    },
    "sector-infra": {
        "purposes": {
            "taxonomy": f"{IMPORT_CSV_PATH}/sector-infra_purposes.csv",
        },
    },
    "eu-dga": {
        "legal_basis": {
            "taxonomy": f"{IMPORT_CSV_PATH}/eu-dga_legal_basis.csv",
        },
    },
}

Notes the differences in some of the resulting filenames.
- This, of course, requires a mass renaming of spreadsheet filenames and all of their tabs, plus related codes. (We can do the renaming in advanced)
The same naming convention can be used in other scripts.
Since this is tend to be a sort of "convention over configuration" approach, it introduces a layer of less-transparency and a relevant code should be well commented.

coolharsh55 · 2025-03-24T10:31:22Z

@bact agreed. I think the apparent inconsistency is because the underscores come from our initial filenames and the dashes come from HTML ids. I think there is some magic code that relies on them. The mass renaming of spreadsheet tabs is really cumbersome! BUT ... we can change the underscores and naming conventions to dashes in filenames trivially (in 100.py). I wouldn't put this as the top priority at the moment, as this change might also break workflows and things can go missing etc. -- especially when we want to regenerate past release data (which we need to generate changelogs etc.). So we can do this gradually e.g. whenever we are renaming things or we're changing our spreadsheet source code.

- HTML docs for .OWL are not generated (turned off option in 300.py) as this creates blank nodes that have random IDs and undos the benefits of recent PRs that give stability to RDFS+SKOS - to be fixed along with #262 - OWL HTML can be generated at the very end of the development process as part of the release prep workflow

coolharsh55 · 2025-04-26T18:10:00Z

https://harshp.com/dev/dpv/docgen-modular-01 describes some of my thoughts on how we can implement this and also remove a lot of the redundant boilerplate around maintaining metadata. E.g. the below code provides all the needed information to generate CSVs, RDFs, and HTMLs (common variables like folder paths and version are in other files)

# config/vocab/dpv.py
modules = ['m1', 'm2', 'm3']
# in CSV script, this is used to generate dpv_m1.csv, dpv_m2.csv, dpv_m3.csv
# in RDF script, this is used to generate dpv/modules/m1 ...
# in HTML script, this is used to read dpv/modules/m1 and generate m1 section in HTML

# however, there is more metadata for modules, so we use a dict
modules = {
    'm1': {
        'title': 'M1',
        'parser': 'taxonomy',  # method to parse the CSV to generate RDF,
        'source': {
            'gsheet_id': '...',  # ID of the Google Sheet
            # not all vocab modules have both classes and properties
            'classes': 'tab name',  # will generate CSV dpv_m1_classes.csv
            'properties': 'tab name',  # will generate CSV dpv_m1_properties.csv
        },
        "html_template": "path...", # optional - will generate HTML output for module
    }
}

folder = "/dpv" 
name = "dpv"
# RDF path is <DPV_VERSION>/<folder>/<name>.ttl
html_template = f"{TEMPLATE_PATH}/template_dpv.jinja2"

metadata = {
    "dct:title": "Data Privacy Vocabulary (DPV)",
    "dct:description": "The Data Privacy Vocabulary (DPV) provides ...",
    "dct:created": "2022-08-18",
    "dct:modified": DPV_PUBLISH_DATE,
    "dct:creator": "Harshvardhan J. Pandit, Beatriz Esteves, Georg P. Krog, Paul Ryan, Delaram Golpayegani, Julian Flake",
    "schema:version": DPV_VERSION,
    "profile:isProfileOf": "",
    "bibo:status": "published",
}

- see #262 for discussion - NOTE: This commit has DELETED files Squashed commit of the following: commit 4d04b59dc39a463b6280e77c0a1a0418b24116f9 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 16:07:33 2025 +0100 Modular RDF and HTML generation - This commit changes the `200.py` and `300.py` files so that the outputs can be selectively generated. - For use, use `--help` argument to see options. - Both changes allow specific vocabularies to be generated such that only the outputs for the specified vocabulary/extension will be written to disk. - The default option is to generate all outputs for all vocabularies. - The README.md in code has been updated to mention this. commit 85afc2cd6c95b192cb136638afb567b7979eaad7 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 16:04:19 2025 +0100 disables RDF and OWL serialisations in CG-DRAFT - If document status is set as CG-DRAFT, only .ttl and .csv files are generated and the other formats are not generated (including no owl files) - a previous commit removed the matching files for these. - If document status is set as CG-FINAL, all formats are generated, and as a result, there will be new files created (which won't be automatically removed if status is changed back to CG-DRAFT). - This change allows consistent outputs to be generated on repeated runs so that only the actual changes in files show up in git diffs, and other formats which are not stable (e.g. xml) are not generated. - This change was discussed and approved in DPVCG meetings as part of the larger process to streamline and make it easy to generate and review changes made in commits. commit 1bc2a811f76ca6b24ea5609d4fb660292a635a3b Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 15:54:11 2025 +0100 deterministic/consistent outputs of RDF and CSV - RDF turtle files have serialisations removed due to changes in draft supported serialisations (they are added back in production/final); as a result there are fewer triples generated - CSV files contains parents in a random order which showed up as changes; fixed by ensuring parents are sorted before writing output commit c45efc0a9e254cf47f15da388af41d8b97f41cd3 Author: Harshvardhan Pandit <[email protected]> Date: Sun May 4 14:56:03 2025 +0100 deletes jsonld,rdf,n3 files, and all owl files

coolharsh55 · 2025-05-04T15:19:09Z

87ed62e provides some of the features discussed here for generating outputs for specific extensions. Examples:

# Generate only some RDF outputs
./200_serialise_RDF.py --vocab=tech
# Generate all RDF outputs
./200_serialise_RDF.py

# Generate only some HTML outputs
./300_generate_HTML.py --vocab=dpv,tech,ai
# Generate all HTML outputs
./300_generate_HTML.py
# Skip loading some extensions (to speed up)
./300_generate_HTML.py --skip=loc
# Skip loading all extensions which match pattern
./300_generate_HTML.py --skip=loc*,legal*

The commit also deletes serialisation formats which aren't stable/consistent (.n3,.rdf,.jsonld) and retains those that are (.ttl,.csv). This means now we can produce outputs for a specific extension and the git status/diff should be consistent in showing only what has actually changed.

coolharsh55 added docs help-wanted proposal todo labels Mar 23, 2025

coolharsh55 added this to the proposed-work milestone Mar 23, 2025

github-project-automation bot added this to dpv 2.1 planning Mar 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor documentation generator to be modular #262

Refactor documentation generator to be modular #262

coolharsh55 commented Mar 23, 2025

bact commented Mar 24, 2025 •

edited

Loading

coolharsh55 commented Mar 24, 2025

coolharsh55 commented Apr 26, 2025 •

edited

Loading

coolharsh55 commented May 4, 2025 •

edited

Loading

Refactor documentation generator to be modular #262

Refactor documentation generator to be modular #262

Comments

coolharsh55 commented Mar 23, 2025

bact commented Mar 24, 2025 • edited Loading

coolharsh55 commented Mar 24, 2025

coolharsh55 commented Apr 26, 2025 • edited Loading

coolharsh55 commented May 4, 2025 • edited Loading

bact commented Mar 24, 2025 •

edited

Loading

coolharsh55 commented Apr 26, 2025 •

edited

Loading

coolharsh55 commented May 4, 2025 •

edited

Loading