[TTAHUB-1974] Monitoring Data #1803

GarrettEHill · 2023-11-01T22:37:11Z

Description of change

This adds three major areas of functionality:

A database-stored definition of datasets to be fetched from a remote SFTP, including the data's location, cadence, and schema
A sequence of steps to stream the data through a series of steps from remote files to records suitable for inserting into the database.
A data model within the hub that mirrors the data provided by the source, supplemented with link tables to bridge the gap between the shape of the data provided by the source and the kinds of joins Sequelize knows how to perform.

The immediate implementation uses cron to fetch the job definition for Monitoring files out of the database. It starts by streaming a zip file on Monitoring's SFTP to a file on the TTA Hub S3. From there we stream decompress component files into memory*. We stream-reencode them from UTF-16 to UTF-8 encoded text streams, then to parsed XML, then to rows, which we upsert into the database.

The resulting tables are mirrors of the current state of the relevant columns of the relevant tables within the Monitoring source data. In addition, there is one link table added for each of the joining data elements: 1)grant numbers, 2) review IDs, and 3) status IDs. The link tables are kept populated by hooks on the main tables and do nothing but provide Sequelize with tables for which the join elements are primary keys.

The overall implementation goals were to minimize the memory footprint, to minimize transformation of data between the source and TTA Hub's representation, and to allow easy reconfiguration of the process.

*This is the only time we have to hold a whole file in memory, and is due to zip handling limitations that would be onerous to work around.

Additional Introduced Functionality to meet goals above:

LockManager class in lockManager.ts manages distributed locks with Redis, allowing execution of callbacks within these locks. It initializes with a lock key, TTL, and Redis configuration, and provides methods to acquire and release locks, check lock ownership, and renew lock TTL. The class ensures that only one instance can execute a callback with the lock at a time and handles lock renewal automatically. Errors during lock operations are logged and managed within the class methods.
BufferStream class extends the Writable stream class from Node.js and is designed to collect data chunks into an internal buffer. It provides a mechanism to retrieve a Readable stream representing the collected data, either immediately if the stream has finished or via a promise that resolves once the stream finishes. The class overrides the _write method to push incoming data chunks into an array after converting them to Buffer objects. It also has a finish event listener that marks the stream as finished and resolves a promise with a Readable stream created from the buffered data. The getReadableStream method returns a promise that resolves with a Readable stream, which is created immediately if the stream has finished or deferred until the stream finishes. The getSize method returns the number of chunks in the buffer. The getBuffer method concatenates all chunks into a single Buffer object, which can throw a TypeError if any chunk is not a valid type for concatenation.
EncodingConverter class extends the Node.js Transform stream to facilitate the conversion of text data between different encodings. It is initialized with a target encoding and an optional source encoding, both of which must be among the supported BufferEncoding types. If an unsupported encoding is provided, an error is thrown. The class uses the chardet library to detect the encoding of input data when the source encoding is not specified. It implements the _transform and _flush methods from the Transform class to handle the streaming and conversion of data, and it includes helper methods convertBuffer and convertChunk to perform the encoding conversion. The class maintains a private buffer for accumulating data and a static set of supported encodings to check the validity of the encodings used.
'Hasher' class extends the 'PassThrough' stream to provide a streaming interface for generating cryptographic hashes. It includes a private 'hash' object from the crypto module, a promise 'hashPromise' to handle the asynchronous result, and resolver/rejector functions 'resolveHash' and 'rejectHash'. The constructor accepts an 'Algorithm' type with a default value and sets up the hash object and promise. Event listeners are attached to the stream for 'data', 'end', and 'error' events to update the hash, resolve the promise with the final hash value in hexadecimal format, or reject the promise in case of an error. The 'getHash' public method returns the promise that resolves to the hash value.
S3Client class in TypeScript is designed to interact with AWS S3 services. It provides methods to upload files as streams, download files as streams, retrieve file metadata, delete files, and list all files in a specified S3 bucket. The class constructor accepts a configuration object with an optional S3 configuration and a bucket name. The uploadFileAsStream method takes a file key and a readable stream, uploading the file to the S3 bucket. The downloadFileAsStream method returns a readable stream of the file content from the S3 bucket, given a file key. The getFileMetadata method retrieves metadata for a specified file in the S3 bucket. The deleteFile method deletes a file from the S3 bucket using the file key. Lastly, the listFiles method returns a list of all objects in the S3 bucket. All methods are asynchronous and return a promise. If any errors occur during the execution of these methods, they log the error using auditLogger.error and rethrow the error.
SftpClient class for interacting with an SFTP server using the ssh2 library. It includes methods for connecting (connect), disconnecting (disconnect), checking connection status (isConnected), listing files (listFiles), and downloading files as streams (downloadAsStream). The connect method establishes a connection and handles errors by logging them and updating the connection state. The disconnect method closes the connection and removes event listeners. The listFiles method asynchronously retrieves a list of files from the server, with options to filter by file name, list files after a certain file, and include read streams for each file. The downloadAsStream method initiates a download of a remote file as a readable stream, with an option to use compression. The class also handles system signals to ensure proper disconnection and cleanup. The code includes interfaces for connection configuration (ConnectConfig), file listing options (ListFileOptions), file information (FileInfo), and the file listing itself (FileListing). A utility function (modeToPermissions) is provided to convert numeric file modes to human-readable permission strings. The SftpClient class and related interfaces are exported for use in other modules.
XMLStream class is designed to parse XML data from a Readable stream using a SAX parser. It provides methods to initialize the parser, count parsed objects, check if parsing is complete, retrieve the next parsed object, and obtain the JSON schema of the parsed XML. The class constructor accepts an XML stream and an optional flag for a virtual root node. The initialize method sets up event handlers for parsing and returns a promise that resolves when parsing is complete or rejects on error. The getObjectCount method returns the number of parsed objects, while processingComplete checks if the stream has been fully read. The getNextObject method asynchronously retrieves the next object from the parsed data, optionally simplifying it, and getObjectSchema returns a promise that resolves to the schema of the parsed data. The class handles XML node construction, schema generation, and text content processing, and it supports virtual root nodes to encapsulate the XML data.
ZipStream class, designed to handle ZIP file streams for reading and extracting file information and contents. The class constructor accepts a ZIP file stream, an optional password for encrypted archives, and an array specifying which files need their streams extracted. It provides methods to list file paths (listFiles), get details of a specific file (getFileDetails), retrieve details for all files (getAllFileDetails), and obtain a stream for a particular file (getFileStream). Each method returns a Promise, ensuring that the requested data is available after the ZIP processing is complete. The class utilizes the unzipper library for parsing ZIP contents and manages errors and logging through an auditLogger. Additionally, the FileInfo interface is defined to structure the file details, and both the ZipStream class and FileInfo interface are exported for external use.
A collection of utilities for data manipulation, including object type checks, undefined value removal, object property remapping and pruning, deep value comparison, object merging, and key transformation. Functions handle equality checks for numbers and dates, collect changes between objects, flatten nested structures, cast values to their detected types, and convert object keys to lowercase. Errors are thrown for invalid inputs or operations.
A collection of utility functions for working with Sequelize models and data types. It defines a custom type SequelizeDataTypes that represents all available data types in Sequelize. A mapping between Sequelize data types and TypeScript data types is established in dataTypeMapping. The modelForTable function retrieves a model for a given table name from a database object and throws an error if no model is found. getColumnInformation asynchronously retrieves column information from a model, including column names, data types, and nullability. getColumnNamesFromModelForType fetches column names from a model that match a specific Sequelize data type. filterDataToModel filters a data object based on a model's column information and returns objects with matched and unmatched properties. includeToFindAll is a function that includes a model and its associations, retrieves all records matching given conditions, and supports additional conditions and attribute selection. Lastly, nestedRawish transforms a nested object or array of objects by stripping out Sequelize model instance metadata, retaining only the raw data values, and preserving the input structure.

How to test

Issue(s)

https://ocio-jira.acf.hhs.gov/browse/TTAHUB-1974

Checklists

Every PR

Meets issue criteria
JIRA ticket status updated
Code is meaningfully tested
Meets accessibility standards (WCAG 2.1 Levels A, AA)
API Documentation updated
Boundary diagram updated
Logical Data Model updated
Architectural Decision Records written for major infrastructure decisions
UI review complete

Production Deploy

Staging smoke test completed

After merge/deploy

Update JIRA ticket status

whoops, wrong tab

src/lib/importSystem/process.ts

src/lib/stream/ftp.ts

src/lib/dataObjectUtils.ts

package.json

collect_monitoring.sh

docker-compose.yml

src/services/files.js

src/constants.js

src/lib/importSystem/process.ts

src/lib/importSystem/record.ts

src/lib/maintenance/import.test.js

src/lib/stream/base64.ts

thewatermethod · 2024-01-26T21:41:07Z

I'm not seeing too much to complain about with the code. If you could add instructions for local dev, I can attempt to set this up Monday morning and start testing it. We probably also want to deploy it to sandbox or dev ASAP now that CI is passing.

src/models/monitoringReviewGrantee.js

src/models/monitoringReviewStatus.js

kcirtapfromspace · 2024-01-31T04:58:54Z

Is there a need to release all this in one go? Can this be broken out into manageable stages/ segments? From the TLDR these seem like obvious segments.

Stage 1 The SFTP to S3
- seems like an easy thing to spin off and get out the door
- can spot test to prod
Stage 2 Unzip and Deal with UTF Encodings
- Check that UTF conversion is correct
Stage 3 Hash the XML Data / PreProcessing/ Filtering
Stage 4 CRUD to DB.

GarrettEHill · 2024-01-31T10:17:06Z

Is there a need to release all this in one go? Can this be broken out into manageable stages/ segments? From the TLDR these seem like obvious segments.

Stage 1 The SFTP to S3

seems like an easy thing to spin off and get out the door

can spot test to prod

Stage 2 Unzip and Deal with UTF Encodings

Check that UTF conversion is correct

Stage 3 Hash the XML Data / PreProcessing/ Filtering

Stage 4 CRUD to DB.

@kcirtapfromspace That would have been a good way to split this into stories had this been done before we got to this point. But now that we need this delivered already. Its too late to split now.

src/migrations/20240124082206-monitoring-data.js

src/models/monitoringReviewGrantee.js

thewatermethod · 2024-01-31T18:21:08Z

Problem with a different association:

sql error is here:

ERROR:  operator does not exist: text = integer
LINE 25: ...gReview" ON "monitoringReviewGrantees"."reviewId" = "monitor...

this is line 25:

INNER JOIN "MonitoringReviews" AS "monitoringReviewGrantees->monitoringReview" ON "monitoringReviewGrantees"."reviewId" = "monitoringReviewGrantees->monitoringReview"."id"

[TTAHUB-2059] CLASS/monitoring FE. CLASS BE.

kryswisnaskas

I marked the places where we need to remove comments.

.circleci/config.yml

Co-authored-by: kryswisnaskas <[email protected]>

…Head-Start-TTADP into TTAHUB-1974/monitoring-data

thewatermethod previously approved these changes Jan 11, 2024

View reviewed changes

GarrettEHill commented Jan 25, 2024

View reviewed changes

src/lib/importSystem/process.ts Outdated Show resolved Hide resolved

nvms reviewed Jan 25, 2024

View reviewed changes

src/lib/stream/ftp.ts Outdated Show resolved Hide resolved

thewatermethod reviewed Jan 25, 2024

View reviewed changes

src/lib/stream/ftp.ts Outdated Show resolved Hide resolved

thewatermethod reviewed Jan 25, 2024

View reviewed changes

src/lib/dataObjectUtils.ts Outdated Show resolved Hide resolved

thewatermethod self-requested a review January 26, 2024 20:26