Skip to content

Commit

Permalink
Expose dataset generation in CLI (#983)
Browse files Browse the repository at this point in the history
  • Loading branch information
Ndpnt authored Jan 16, 2023
2 parents d5df561 + a3e147f commit 969963b
Show file tree
Hide file tree
Showing 11 changed files with 87 additions and 75 deletions.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,10 @@ All notable changes to this project will be documented in this file.
This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) and the format is based on [Common Changelog](https://common-changelog.org).\
Unlike Common Changelog, the `unreleased` section of [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) is preserved with the addition of a tag to specify which type of release should be published and to foster discussions about it inside pull requests. This tag should be one of the names mandated by SemVer, within brackets: `[patch]`, `[minor]` or `[major]`. For example: `## Unreleased [minor]`.

## Unreleased
## Unreleased [minor]

### Added
- Add dataset command to CLI; this command can be discovered in the documentation and by running `ota dataset help`

## 0.20.0 - 2022-12-13
Full changeset and discussions: [#959](https://github.com/ambanum/OpenTermsArchive/pull/959)._
Expand Down
9 changes: 9 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,15 @@ Avoid “you” and “we” (who is “we” in a common anyway?). Use neutral

#### CLI

##### Options case

Use snake case for multiple word options:

```diff
- ota validate --schemaOnly
+ ota validate --schema-only
```

##### docopt

For command-line examples and documentation, we follow the [docopt usage patterns](http://docopt.org) syntax. Quick recap of the main points:
Expand Down
42 changes: 35 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ npx ota track --services "<service_id>" ["<service_id>"...]
##### Track specific terms of specific services

```sh
npx ota track --services "<service_id>" ["<service_id>"...] --termsTypes "<terms_type>" ["<terms_type>"...]
npx ota track --services "<service_id>" ["<service_id>"...] --terms-types "<terms_type>" ["<terms_type>"...]
```

##### Track documents four times a day
Expand All @@ -196,7 +196,7 @@ npx ota track --schedule
#### `ota validate`

```sh
npx ota validate [--services <service_id>...] [--termsTypes <terms_type>...]
npx ota validate [--services <service_id>...] [--terms-types <terms_type>...]
```

Check that all declarations allow recording a snapshot and a version properly.
Expand All @@ -206,7 +206,7 @@ If one or several `<service_id>` are provided, check only those services.
##### Validate schema only

```sh
npx ota validate --schema-only [--services <service_id>...] [--termsTypes <terms_type>...]
npx ota validate --schema-only [--services <service_id>...] [--terms-types <terms_type>...]
```

Check that all declarations are readable by the engine.
Expand All @@ -227,6 +227,38 @@ Automatically correct formatting mistakes and ensure that all declarations are s

If one or several `<service_id>` are provided, check only those services.

#### `ota dataset`

Export the versions dataset into a ZIP file and publish it to GitHub releases.

The dataset title and the URL of the versions repository are defined in the [configuration](#configuring).

To export the dataset into a local ZIP file:

```sh
npx ota dataset [--file <filename>]
```

To export the dataset into a ZIP file and publish it on GitHub releases:

```sh
GITHUB_TOKEN=ghp_XXXXXXXXX npx ota dataset --publish
```

The `GITHUB_TOKEN` can also be defined in a [`.env` file](#environment-variables).

To export, publish the dataset and remove the local copy that was created after it has been uploaded:

```sh
GITHUB_TOKEN=ghp_XXXXXXXXX npx ota dataset --publish --remove-local-copy
```

To schedule export, publishing and local copy removal:

```sh
GITHUB_TOKEN=ghp_XXXXXXXXX npx ota dataset --schedule --publish --remove-local-copy
```

### API

Once added as a dependency, the engine exposes a JavaScript API that can be called in your own code. The following modules are available.
Expand Down Expand Up @@ -277,10 +309,6 @@ import pageDeclaration from '@opentermsarchive/engine/page-declaration';

The `PageDeclaration` format is defined [in source code](./src/archivist/services/pageDeclaration.js).

### Dataset generation

See the [`dataset` script documentation](./scripts/dataset/README.md).

## Configuring

### Configuration file
Expand Down
33 changes: 33 additions & 0 deletions bin/ota-dataset.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#! /usr/bin/env node
import './env.js';

import { program } from 'commander';
import cron from 'croner';

import { release } from '../scripts/dataset/index.js';
import logger from '../src/logger/index.js';

program
.name('ota dataset')
.description('Export the versions dataset into a ZIP file and optionally publish it to GitHub releases')
.option('-f, --file <filename>', 'file name of the generated dataset')
.option('-p, --publish', 'publish dataset to GitHub releases on versions repository. Mandatory authentication to GitHub is provided through the `GITHUB_TOKEN` environment variable')
.option('-r, --remove-local-copy', 'remove local copy of dataset after publishing. Works only in combination with --publish option')
.option('--schedule', 'schedule automatic dataset generation');

const { schedule, publish, removeLocalCopy, file: fileName } = program.parse().opts();

const options = {
fileName,
shouldPublish: publish,
shouldRemoveLocalCopy: removeLocalCopy,
};

if (!schedule) {
await release(options);
} else {
logger.info('The scheduler is running…');
logger.info('Dataset will be published every Monday at 08:30 in the timezone of this machine');

cron('30 8 * * MON', () => release(options));
}
2 changes: 1 addition & 1 deletion bin/ota-track.js
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ program
.name('ota track')
.description('Retrieve declared documents, record snapshots, extract versions and publish the resulting records')
.option('-s, --services [serviceId...]', 'service IDs of services to track')
.option('-t, --termsType [termsType...]', 'terms types to track')
.option('-t, --terms-types [termsType...]', 'terms types to track')
.option('-r, --refilter-only', 'refilter existing snapshots with latest declarations and engine, without recording new snapshots')
.option('--schedule', 'schedule automatic document tracking');

Expand Down
2 changes: 1 addition & 1 deletion bin/ota-validate.js
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ program
.name('ota validate')
.description('Run a series of tests to check the validity of document declarations')
.option('-s, --services [serviceId...]', 'service IDs of services to validate')
.option('-t, --termsTypes [termsType...]', 'terms types to validate')
.option('-t, --terms-types [termsType...]', 'terms types to validate')
.option('-m, --modified', 'target only services modified in the current git branch')
.option('-o, --schema-only', 'much faster check of declarations, but does not check that the documents are actually accessible');

Expand Down
1 change: 1 addition & 0 deletions bin/ota.js
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,5 @@ program
.command('track', 'Track the current terms of services according to provided declarations')
.command('validate', 'Run a series of tests to check the validity of document declarations')
.command('lint', 'Check format and stylistic errors in declarations and auto fix them')
.command('dataset', 'Export the versions dataset into a ZIP file and optionally publish it to GitHub releases')
.parse(process.argv);
4 changes: 2 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@
".eslintrc.yaml"
],
"scripts": {
"dataset:generate": "node scripts/dataset/main.js",
"dataset:release": "node scripts/dataset/main.js --publish --remove-local-copy",
"dataset:generate": "node bin/ota.js dataset",
"dataset:release": "node bin/ota.js dataset --publish --remove-local-copy",
"dataset:scheduler": "npm run dataset:release -- --schedule",
"declarations:lint": "node bin/ota.js lint",
"declarations:validate": "node bin/ota.js validate",
Expand Down
37 changes: 0 additions & 37 deletions scripts/dataset/README.md

This file was deleted.

25 changes: 0 additions & 25 deletions scripts/dataset/main.js

This file was deleted.

2 changes: 1 addition & 1 deletion src/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import logger from './logger/index.js';
import Notifier from './notifier/index.js';
import Tracker from './tracker/index.js';

export default async function track({ services = [], termsType: documentTypes, refilterOnly, schedule }) {
export default async function track({ services = [], termsTypes: documentTypes, refilterOnly, schedule }) {
const archivist = new Archivist({ recorderConfig: config.get('recorder') });

archivist.attach(logger);
Expand Down

0 comments on commit 969963b

Please sign in to comment.