Skip to content

Refactor documentation generator to be modular #262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
coolharsh55 opened this issue Mar 23, 2025 · 4 comments
Open
4 tasks

Refactor documentation generator to be modular #262

coolharsh55 opened this issue Mar 23, 2025 · 4 comments

Comments

@coolharsh55
Copy link
Collaborator

Problem: Currently the documentation generator is made up of three parts - 100.py for downloading CSVs, 200.py for producing RDFs, and 300.py for producing HTMLs. Except for CSVs, both other scripts will produce outputs for all configured extensions which causes git to show modifications to work that has not been changed (since RDF formats are not consistent in structure or blank nodes). This then causes issues with committing changes as manual work is needed to resolve the unwanted changes and only add those items that were intended.

Problem: The vocab_management.py is a large file made up of various configurations that dictates where to find source files, metadata for RDF and HMTL, and other items. It is necessary to be edited if any of these details change e.g. when creating a new extension. The file is large, there are multiple places corresponding to each extension, and there is a high chance that something is missed.

Problem: For people who are not me, it is likely to be confusing and cumbersome to figure out how this code works. Documentation is available, but is at high risk of not being up to date, and any changes to be made require knowledge of a highly technical nature which should not be necessary to simply generate / update files.

Solution: Change the way the documentation works to be:

  • Single executable script that takes parameters to do specific things like update CSVs, produce RDF and HTML. It calls other scripts internally.
  • Modular outputs for each process i.e. it should be possible to generate outputs for a specific extension without any other outputs also being generated.
  • Configuration should be modular for each extension and all configurations for a given extension should reside in a single place/file. E.g. for extension X, the CSVs to download, the RDF and HTML paths, the vocabulary metadata should all be in one file.
  • Documentation (in wiki) should be updated to have simpler instructions for how to regenerate documentation, how to update a typo, how to submit a PR using the above - which should result in a simpler and replicable process.
@bact
Copy link
Collaborator

bact commented Mar 24, 2025

One small step that could simplify the Configuration part (Point 3 from above) is to have some common naming convention to aid the configuration.

For example, from this configuration in Python dictionary in vocab_management.py:

{
    "dex": {
        "examples": {
            "examples": f"{IMPORT_CSV_PATH}/Example.csv",
        },
    },
    "dpv": {
        "TOM": {
            "taxonomy": f"{IMPORT_CSV_PATH}/TOM.csv",
            "properties": f"{IMPORT_CSV_PATH}/TOM_properties.csv",
        },
        "technical_measures": {
            "taxonomy": f"{IMPORT_CSV_PATH}/TechnicalMeasure.csv",
        },
        "entities_authority": {
            "taxonomy": f"{IMPORT_CSV_PATH}/Entities_Authority.csv",
            "properties": f"{IMPORT_CSV_PATH}/Entities_Authority_properties.csv",
        },
    },
    "ai": {
        "core": {
            "taxonomy": f"{IMPORT_CSV_PATH}/ai-core.csv",
            "properties": f"{IMPORT_CSV_PATH}/ai-properties.csv",
        }
    },
    "sector-infra": {
        "purposes": {
            "taxonomy": f"{IMPORT_CSV_PATH}/Purpose_Infrastructure.csv",
        },
    },
    "eu-dga": {
        "legal_basis": {
            "taxonomy": f"{IMPORT_CSV_PATH}/DGA_LegalBasis.csv",
        },
    },
}

if we can agree on the casing and the use of a separator (dash - vs underscore _, and also PascalCase vs Snake_Case), this TOML config:

[_suffix]
"classes" = "_classes"
"examples" = "_examples"
"properties" = "_properties"
"purposes" = "_purposes"
"taxonomy" = ""

[dex]
examples = ["examples"]

[dpv]
TOM = ["taxonomy", "properties"]
technical_measures = ["taxonomy"]
entity_authority = ["properties"]

[ai]
core = ["taxonomy", "properties"]

[sector-infra]
purposes = ["taxonomy"]

[eu-dga]
legal-basis = ["taxonomy"]

could generate an equivalent Python dictionary below, based on the filename pattern {cat}_{subcat}_{subsubcat}{suffix}.csv:

{
    "dex": {
        "examples": {
            "examples": f"{IMPORT_CSV_PATH}/dex_examples_examples.csv",
        },
    },
    "dpv": {
        "TOM": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_TOM.csv",
            "properties": f"{IMPORT_CSV_PATH}/dpv_TOM_properties.csv",
        },
        "technical_measures": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_technical_measures.csv",
        },
        "entities_authority": {
            "taxonomy": f"{IMPORT_CSV_PATH}/dpv_entities_authority.csv",
            "properties": f"{IMPORT_CSV_PATH}/dpv_entities_authority_properties.csv",
        },
    },
    "ai": {
        "core": {
            "taxonomy": f"{IMPORT_CSV_PATH}/ai_core.csv",
            "properties": f"{IMPORT_CSV_PATH}/ai_core_properties.csv",
        }
    },
    "sector-infra": {
        "purposes": {
            "taxonomy": f"{IMPORT_CSV_PATH}/sector-infra_purposes.csv",
        },
    },
    "eu-dga": {
        "legal_basis": {
            "taxonomy": f"{IMPORT_CSV_PATH}/eu-dga_legal_basis.csv",
        },
    },
}
  • Notes the differences in some of the resulting filenames.
    • This, of course, requires a mass renaming of spreadsheet filenames and all of their tabs, plus related codes. (We can do the renaming in advanced)
  • The same naming convention can be used in other scripts.
  • Since this is tend to be a sort of "convention over configuration" approach, it introduces a layer of less-transparency and a relevant code should be well commented.

@coolharsh55
Copy link
Collaborator Author

@bact agreed. I think the apparent inconsistency is because the underscores come from our initial filenames and the dashes come from HTML ids. I think there is some magic code that relies on them. The mass renaming of spreadsheet tabs is really cumbersome! BUT ... we can change the underscores and naming conventions to dashes in filenames trivially (in 100.py). I wouldn't put this as the top priority at the moment, as this change might also break workflows and things can go missing etc. -- especially when we want to regenerate past release data (which we need to generate changelogs etc.). So we can do this gradually e.g. whenever we are renaming things or we're changing our spreadsheet source code.

coolharsh55 added a commit that referenced this issue Apr 7, 2025
- HTML docs for .OWL are not generated (turned off option in 300.py) as
  this creates blank nodes that have random IDs and undos the benefits
  of recent PRs that give stability to RDFS+SKOS
- to be fixed along with #262
- OWL HTML can be generated at the very end of the development process
  as part of the release prep workflow
@coolharsh55
Copy link
Collaborator Author

coolharsh55 commented Apr 26, 2025

https://harshp.com/dev/dpv/docgen-modular-01 describes some of my thoughts on how we can implement this and also remove a lot of the redundant boilerplate around maintaining metadata. E.g. the below code provides all the needed information to generate CSVs, RDFs, and HTMLs (common variables like folder paths and version are in other files)

# config/vocab/dpv.py
modules = ['m1', 'm2', 'm3']
# in CSV script, this is used to generate dpv_m1.csv, dpv_m2.csv, dpv_m3.csv
# in RDF script, this is used to generate dpv/modules/m1 ...
# in HTML script, this is used to read dpv/modules/m1 and generate m1 section in HTML

# however, there is more metadata for modules, so we use a dict
modules = {
    'm1': {
        'title': 'M1',
        'parser': 'taxonomy',  # method to parse the CSV to generate RDF,
        'source': {
            'gsheet_id': '...',  # ID of the Google Sheet
            # not all vocab modules have both classes and properties
            'classes': 'tab name',  # will generate CSV dpv_m1_classes.csv
            'properties': 'tab name',  # will generate CSV dpv_m1_properties.csv
        },
        "html_template": "path...", # optional - will generate HTML output for module
    }
}

folder = "/dpv" 
name = "dpv"
# RDF path is <DPV_VERSION>/<folder>/<name>.ttl
html_template = f"{TEMPLATE_PATH}/template_dpv.jinja2"

metadata = {
    "dct:title": "Data Privacy Vocabulary (DPV)",
    "dct:description": "The Data Privacy Vocabulary (DPV) provides ...",
    "dct:created": "2022-08-18",
    "dct:modified": DPV_PUBLISH_DATE,
    "dct:creator": "Harshvardhan J. Pandit, Beatriz Esteves, Georg P. Krog, Paul Ryan, Delaram Golpayegani, Julian Flake",
    "schema:version": DPV_VERSION,
    "profile:isProfileOf": "",
    "bibo:status": "published",
}

coolharsh55 added a commit that referenced this issue May 4, 2025
- see #262 for discussion
- NOTE: This commit has DELETED files

Squashed commit of the following:

commit 4d04b59dc39a463b6280e77c0a1a0418b24116f9
Author: Harshvardhan Pandit <[email protected]>
Date:   Sun May 4 16:07:33 2025 +0100

    Modular RDF and HTML generation

    - This commit changes the `200.py` and `300.py` files so that the
      outputs can be selectively generated.
    - For use, use `--help` argument to see options.
    - Both changes allow specific vocabularies to be generated such that
      only the outputs for the specified vocabulary/extension will be
      written to disk.
    - The default option is to generate all outputs for all vocabularies.
    - The README.md in code has been updated to mention this.

commit 85afc2cd6c95b192cb136638afb567b7979eaad7
Author: Harshvardhan Pandit <[email protected]>
Date:   Sun May 4 16:04:19 2025 +0100

    disables RDF and OWL serialisations in CG-DRAFT

    - If document status is set as CG-DRAFT, only .ttl and .csv files are
      generated and the other formats are not generated (including no owl
      files) - a previous commit removed the matching files for these.
    - If document status is set as CG-FINAL, all formats are generated, and
      as a result, there will be new files created (which won't be
      automatically removed if status is changed back to CG-DRAFT).
    - This change allows consistent outputs to be generated on repeated runs
      so that only the actual changes in files show up in git diffs, and
      other formats which are not stable (e.g. xml) are not generated.
    - This change was discussed and approved in DPVCG meetings as part of
      the larger process to streamline and make it easy to generate and
      review changes made in commits.

commit 1bc2a811f76ca6b24ea5609d4fb660292a635a3b
Author: Harshvardhan Pandit <[email protected]>
Date:   Sun May 4 15:54:11 2025 +0100

    deterministic/consistent outputs of RDF and CSV

    - RDF turtle files have serialisations removed due to changes in draft
      supported serialisations (they are added back in production/final); as
      a result there are fewer triples generated
    - CSV files contains parents in a random order which showed up as
      changes; fixed by ensuring parents are sorted before writing output

commit c45efc0a9e254cf47f15da388af41d8b97f41cd3
Author: Harshvardhan Pandit <[email protected]>
Date:   Sun May 4 14:56:03 2025 +0100

    deletes jsonld,rdf,n3 files, and all owl files
@coolharsh55
Copy link
Collaborator Author

coolharsh55 commented May 4, 2025

87ed62e provides some of the features discussed here for generating outputs for specific extensions. Examples:

# Generate only some RDF outputs
./200_serialise_RDF.py --vocab=tech
# Generate all RDF outputs
./200_serialise_RDF.py

# Generate only some HTML outputs
./300_generate_HTML.py --vocab=dpv,tech,ai
# Generate all HTML outputs
./300_generate_HTML.py
# Skip loading some extensions (to speed up)
./300_generate_HTML.py --skip=loc
# Skip loading all extensions which match pattern
./300_generate_HTML.py --skip=loc*,legal*

The commit also deletes serialisation formats which aren't stable/consistent (.n3,.rdf,.jsonld) and retains those that are (.ttl,.csv). This means now we can produce outputs for a specific extension and the git status/diff should be consistent in showing only what has actually changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants