Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TTAHUB-1974] Monitoring Data #1803

Merged
merged 611 commits into from
Mar 4, 2024
Merged

[TTAHUB-1974] Monitoring Data #1803

merged 611 commits into from
Mar 4, 2024

Conversation

GarrettEHill
Copy link
Contributor

@GarrettEHill GarrettEHill commented Nov 1, 2023

Description of change

This adds three major areas of functionality:

  • A database-stored definition of datasets to be fetched from a remote SFTP, including the data's location, cadence, and schema
  • A sequence of steps to stream the data through a series of steps from remote files to records suitable for inserting into the database.
  • A data model within the hub that mirrors the data provided by the source, supplemented with link tables to bridge the gap between the shape of the data provided by the source and the kinds of joins Sequelize knows how to perform.

The immediate implementation uses cron to fetch the job definition for Monitoring files out of the database. It starts by streaming a zip file on Monitoring's SFTP to a file on the TTA Hub S3. From there we stream decompress component files into memory*. We stream-reencode them from UTF-16 to UTF-8 encoded text streams, then to parsed XML, then to rows, which we upsert into the database.

The resulting tables are mirrors of the current state of the relevant columns of the relevant tables within the Monitoring source data. In addition, there is one link table added for each of the joining data elements: 1)grant numbers, 2) review IDs, and 3) status IDs. The link tables are kept populated by hooks on the main tables and do nothing but provide Sequelize with tables for which the join elements are primary keys.

The overall implementation goals were to minimize the memory footprint, to minimize transformation of data between the source and TTA Hub's representation, and to allow easy reconfiguration of the process.

*This is the only time we have to hold a whole file in memory, and is due to zip handling limitations that would be onerous to work around.

Additional Introduced Functionality to meet goals above:

  • LockManager class in lockManager.ts manages distributed locks with Redis, allowing execution of callbacks within these locks. It initializes with a lock key, TTL, and Redis configuration, and provides methods to acquire and release locks, check lock ownership, and renew lock TTL. The class ensures that only one instance can execute a callback with the lock at a time and handles lock renewal automatically. Errors during lock operations are logged and managed within the class methods.

  • BufferStream class extends the Writable stream class from Node.js and is designed to collect data chunks into an internal buffer. It provides a mechanism to retrieve a Readable stream representing the collected data, either immediately if the stream has finished or via a promise that resolves once the stream finishes. The class overrides the _write method to push incoming data chunks into an array after converting them to Buffer objects. It also has a finish event listener that marks the stream as finished and resolves a promise with a Readable stream created from the buffered data. The getReadableStream method returns a promise that resolves with a Readable stream, which is created immediately if the stream has finished or deferred until the stream finishes. The getSize method returns the number of chunks in the buffer. The getBuffer method concatenates all chunks into a single Buffer object, which can throw a TypeError if any chunk is not a valid type for concatenation.

  • EncodingConverter class extends the Node.js Transform stream to facilitate the conversion of text data between different encodings. It is initialized with a target encoding and an optional source encoding, both of which must be among the supported BufferEncoding types. If an unsupported encoding is provided, an error is thrown. The class uses the chardet library to detect the encoding of input data when the source encoding is not specified. It implements the _transform and _flush methods from the Transform class to handle the streaming and conversion of data, and it includes helper methods convertBuffer and convertChunk to perform the encoding conversion. The class maintains a private buffer for accumulating data and a static set of supported encodings to check the validity of the encodings used.

  • 'Hasher' class extends the 'PassThrough' stream to provide a streaming interface for generating cryptographic hashes. It includes a private 'hash' object from the crypto module, a promise 'hashPromise' to handle the asynchronous result, and resolver/rejector functions 'resolveHash' and 'rejectHash'. The constructor accepts an 'Algorithm' type with a default value and sets up the hash object and promise. Event listeners are attached to the stream for 'data', 'end', and 'error' events to update the hash, resolve the promise with the final hash value in hexadecimal format, or reject the promise in case of an error. The 'getHash' public method returns the promise that resolves to the hash value.

  • S3Client class in TypeScript is designed to interact with AWS S3 services. It provides methods to upload files as streams, download files as streams, retrieve file metadata, delete files, and list all files in a specified S3 bucket. The class constructor accepts a configuration object with an optional S3 configuration and a bucket name. The uploadFileAsStream method takes a file key and a readable stream, uploading the file to the S3 bucket. The downloadFileAsStream method returns a readable stream of the file content from the S3 bucket, given a file key. The getFileMetadata method retrieves metadata for a specified file in the S3 bucket. The deleteFile method deletes a file from the S3 bucket using the file key. Lastly, the listFiles method returns a list of all objects in the S3 bucket. All methods are asynchronous and return a promise. If any errors occur during the execution of these methods, they log the error using auditLogger.error and rethrow the error.

  • SftpClient class for interacting with an SFTP server using the ssh2 library. It includes methods for connecting (connect), disconnecting (disconnect), checking connection status (isConnected), listing files (listFiles), and downloading files as streams (downloadAsStream). The connect method establishes a connection and handles errors by logging them and updating the connection state. The disconnect method closes the connection and removes event listeners. The listFiles method asynchronously retrieves a list of files from the server, with options to filter by file name, list files after a certain file, and include read streams for each file. The downloadAsStream method initiates a download of a remote file as a readable stream, with an option to use compression. The class also handles system signals to ensure proper disconnection and cleanup. The code includes interfaces for connection configuration (ConnectConfig), file listing options (ListFileOptions), file information (FileInfo), and the file listing itself (FileListing). A utility function (modeToPermissions) is provided to convert numeric file modes to human-readable permission strings. The SftpClient class and related interfaces are exported for use in other modules.

  • XMLStream class is designed to parse XML data from a Readable stream using a SAX parser. It provides methods to initialize the parser, count parsed objects, check if parsing is complete, retrieve the next parsed object, and obtain the JSON schema of the parsed XML. The class constructor accepts an XML stream and an optional flag for a virtual root node. The initialize method sets up event handlers for parsing and returns a promise that resolves when parsing is complete or rejects on error. The getObjectCount method returns the number of parsed objects, while processingComplete checks if the stream has been fully read. The getNextObject method asynchronously retrieves the next object from the parsed data, optionally simplifying it, and getObjectSchema returns a promise that resolves to the schema of the parsed data. The class handles XML node construction, schema generation, and text content processing, and it supports virtual root nodes to encapsulate the XML data.

  • ZipStream class, designed to handle ZIP file streams for reading and extracting file information and contents. The class constructor accepts a ZIP file stream, an optional password for encrypted archives, and an array specifying which files need their streams extracted. It provides methods to list file paths (listFiles), get details of a specific file (getFileDetails), retrieve details for all files (getAllFileDetails), and obtain a stream for a particular file (getFileStream). Each method returns a Promise, ensuring that the requested data is available after the ZIP processing is complete. The class utilizes the unzipper library for parsing ZIP contents and manages errors and logging through an auditLogger. Additionally, the FileInfo interface is defined to structure the file details, and both the ZipStream class and FileInfo interface are exported for external use.

  • A collection of utilities for data manipulation, including object type checks, undefined value removal, object property remapping and pruning, deep value comparison, object merging, and key transformation. Functions handle equality checks for numbers and dates, collect changes between objects, flatten nested structures, cast values to their detected types, and convert object keys to lowercase. Errors are thrown for invalid inputs or operations.

  • A collection of utility functions for working with Sequelize models and data types. It defines a custom type SequelizeDataTypes that represents all available data types in Sequelize. A mapping between Sequelize data types and TypeScript data types is established in dataTypeMapping. The modelForTable function retrieves a model for a given table name from a database object and throws an error if no model is found. getColumnInformation asynchronously retrieves column information from a model, including column names, data types, and nullability. getColumnNamesFromModelForType fetches column names from a model that match a specific Sequelize data type. filterDataToModel filters a data object based on a model's column information and returns objects with matched and unmatched properties. includeToFindAll is a function that includes a model and its associations, retrieves all records matching given conditions, and supports additional conditions and attribute selection. Lastly, nestedRawish transforms a nested object or array of objects by stripping out Sequelize model instance metadata, retaining only the raw data values, and preserving the input structure.

How to test

Issue(s)

Checklists

Every PR

  • Meets issue criteria
  • JIRA ticket status updated
  • Code is meaningfully tested
  • Meets accessibility standards (WCAG 2.1 Levels A, AA)
  • API Documentation updated
  • Boundary diagram updated
  • Logical Data Model updated
  • Architectural Decision Records written for major infrastructure decisions
  • UI review complete

Production Deploy

  • Staging smoke test completed

After merge/deploy

  • Update JIRA ticket status

thewatermethod
thewatermethod previously approved these changes Jan 11, 2024
@thewatermethod thewatermethod dismissed their stale review January 11, 2024 21:32

whoops, wrong tab

@thewatermethod thewatermethod self-requested a review January 26, 2024 20:26
@thewatermethod
Copy link
Collaborator

I'm not seeing too much to complain about with the code. If you could add instructions for local dev, I can attempt to set this up Monday morning and start testing it. We probably also want to deploy it to sandbox or dev ASAP now that CI is passing.

@GarrettEHill GarrettEHill marked this pull request as ready for review January 30, 2024 18:41
@kcirtapfromspace
Copy link
Contributor

Is there a need to release all this in one go? Can this be broken out into manageable stages/ segments? From the TLDR these seem like obvious segments.

  • Stage 1 The SFTP to S3
    • seems like an easy thing to spin off and get out the door
    • can spot test to prod
  • Stage 2 Unzip and Deal with UTF Encodings
    • Check that UTF conversion is correct
  • Stage 3 Hash the XML Data / PreProcessing/ Filtering
  • Stage 4 CRUD to DB.

@GarrettEHill
Copy link
Contributor Author

GarrettEHill commented Jan 31, 2024

Is there a need to release all this in one go? Can this be broken out into manageable stages/ segments? From the TLDR these seem like obvious segments.

  • Stage 1 The SFTP to S3

    • seems like an easy thing to spin off and get out the door
    • can spot test to prod
  • Stage 2 Unzip and Deal with UTF Encodings

    • Check that UTF conversion is correct
  • Stage 3 Hash the XML Data / PreProcessing/ Filtering

  • Stage 4 CRUD to DB.

@kcirtapfromspace That would have been a good way to split this into stories had this been done before we got to this point. But now that we need this delivered already. Its too late to split now.

@thewatermethod
Copy link
Collaborator

Problem with a different association:
Screenshot 2024-01-31 at 1 20 34 PM

sql error is here:

ERROR:  operator does not exist: text = integer
LINE 25: ...gReview" ON "monitoringReviewGrantees"."reviewId" = "monitor...

this is line 25:

INNER JOIN "MonitoringReviews" AS "monitoringReviewGrantees->monitoringReview" ON "monitoringReviewGrantees"."reviewId" = "monitoringReviewGrantees->monitoringReview"."id"

Copy link
Collaborator

@kryswisnaskas kryswisnaskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked the places where we need to remove comments.

@GarrettEHill GarrettEHill merged commit b704871 into main Mar 4, 2024
11 checks passed
@GarrettEHill GarrettEHill deleted the TTAHUB-1974/monitoring-data branch March 4, 2024 20:03
@GarrettEHill GarrettEHill mentioned this pull request Mar 5, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants