Skip to content

Commit

Permalink
Feat/paragraph-extraction (#7623)
Browse files Browse the repository at this point in the history
* Add zod library to dependencies and update yarn.lock

* Implement AbstractController and UseCase interface for API structure

* Add method to filter properties by type in Template class

* Add ExtractorDataSource interface for extractor creation

* Add Extractor class with validation and status management

* Add CreateExtractorUseCase for creating extractors with template validation

* Add extractor ID generation and update Extractor class to include ID property

* Add MongoDB data source and error handling for extractors

* Fix import path for CreateExtractorUseCase in specs

* Add getAll method to TemplatesDataSource and MongoTemplatesDataSource for retrieving all templates

* fix eslint rule and change naming conventions

* IdGenerator as dependency instead of DS.nextId

* getAllFrom on testingEnvironment to get data from db

* unified PX validation errors in a single class

* wip extract paragraphs

* fix: update Document instantiation in S3FileStorage.spec.ts to include additional parameter

* fix: update LanguageISO6391 type usage in FilesMappers and commonSchemas

* fix: enforce required ISO639_1 field in language schemas and update usage in FileMappers

* Revert "fix: enforce required ISO639_1 field in language schemas and update usage in FileMappers"

This reverts commit 317a121.

* feat: introduce FileType and HttpClientFactory, enhance Segmentation and SettingsDataSource interfaces

* wip

* feat: implement PXExtractionId class and refactor extraction ID handling in PXExtractParagraphsFromEntity

* fix: add TODO comment to clarify tenant name prepending in S3FileStorage

* feat: enhance SegmentationMapper and FileSystemStorage, implement missing methods in TestFileStorage

* fix: remove TODO comment regarding tenant name prepending in S3FileStorage

* feat: expose PXErrorCode in PXValidationError for improved error handling

* feat: add HttpField class for improved handling of form data fields

* feat: refactor postFormData method to improve file and field handling in SuperAgentHttpClient

* feat: integrate HttpField for improved form data handling in PXExternalExtractionService

* feat: update PXExtractionId to use a custom separator for improved ID handling

* add validator

* feat: add GET method to HttpClient interface and implement in SuperAgentHttpClient

* feat: implement getParagraphsResult method in PXExternalExtractionService and add related types

* feat: add EXTRACTION_ID_INVALID error code and refactor PXValidationError name assignment

* feat: integrate Validator for PXExtractionId validation and refactor schema handling

* feat: implement PXCreateParagraph use case and associated tests

* refactor: clean up imports in PXCreateExtractor.ts

* feat: implement PXCreateParagraphs use case for paragraph extraction

* feat: add PXParagraphsResultListener for handling paragraph extraction results

* feat: implement PXExtractParagraphsFromEntities use case for paragraph extraction

* wip

* feat: update MongoPXExtractorsQueryService to use paragraphsQuantity instead of entityCount

* feat: enhance MongoPXExtractorsQueryService with additional lookups and fix extractorId assignment

* refactor: move InputSchema and PXCreateExtractor to named exports for better modularity

* refactor: rename GetExtractorsOutput to ExtractorDTO for clarity and consistency

* feat: add Application dependency to AbstractController for enhanced functionality

* feat: add PXCreateExtractorController for handling extractor creation requests

* refactor: simplify PXExtractionId constructor and improve ID handling methods

* wip

* refactor: remove unused create method from EntitiesDataSource interface

* feat: implement PXExtractParagraphsFromEntity use case for paragraph extraction

* test: add unit tests for PXExtractParagraphsFromEntity use case

* refactor: rename entityDS to entitiesDS for consistency in PXCreateParagraph spec

* feat: extend PXExtractionId to include tenantName and userId for enhanced extraction context

* feat: add tenantName and userId to PXExtractionId for improved context in paragraph extraction

* refactor: simplify validation logic in Validator class by removing unnecessary error handling

* feat: enhance language handling in PXExternalExtractionService to select default language from documents

* refactor: remove unused methods and tests from PXCreateParagraph to streamline codebase

* wip

* feat: add isMainLanguage flag to translations and enhance error handling in paragraph creation

* refactor: improve method naming and enhance result handling in paragraph extraction

* refactor: rename paragraphsQuantity to sourceEntitiesCount and update related logic in MongoPXExtractorsQueryService

* refactor: add notes for paragraph property selection and inheritance in PXCreateParagraph

* test: add additional test cases for PXCreateParagraphs functionality

* WIP, manual integration test with dummy service

* testing files

* feat: add paragraph property IDs to extractor and validation error handling

* wip

* finish controllers

* AbstractController refactor

* wip

* last changes

* replace hardcoded url

* add authorization middleware

* typo

* fix function errors

* last changes

* add paragraph extraction routes

* refactor on controllers

* remove unused code

* add segmentation type

* fix test

* fix File mapper

* refactor: replace segment_type to type

* turn Extract paragraph use case sequential

---------

Co-authored-by: Daneryl <[email protected]>
  • Loading branch information
Joao-vi and daneryl authored Feb 26, 2025
1 parent 473d62a commit 587f52e
Show file tree
Hide file tree
Showing 89 changed files with 4,121 additions and 54 deletions.
6 changes: 5 additions & 1 deletion .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,11 @@ module.exports = {
excludedFiles: './**/*.cy.tsx',
parser: '@typescript-eslint/parser',
parserOptions: { project: './tsconfig.json' },
rules: { ...rules },
rules: {
...rules,
'no-empty-function': ['error', { allow: ['constructors'] }],
'no-useless-constructor': 'off',
},
},
{
files: ['./cypress/**/*.ts', './cypress/**/*.d.ts', './**/*.cy.tsx'],
Expand Down
2 changes: 2 additions & 0 deletions app/api/api.js
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,6 @@ export default (app, server) => {
require('./relationships.v2/routes/routes').default(app);
require('./stats/routes').default(app);
require('./testing_errors/routes').default(app);

require('./paragraphExtraction/adapters/PXRoutes').paragraphExtractionRoutes(app);
};
87 changes: 87 additions & 0 deletions app/api/common.v2/AbstractController.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import util from 'util';
import { ValidationError } from 'ajv';
import { ZodError } from 'zod';
import { Request, Response } from 'express';

import { config } from 'api/config';
import { LanguageISO6391 } from 'shared/types/commonTypes';

export type Dependencies<RequestBody = any> = {
response: Response;
request: Request<unknown, any, RequestBody>;
};

export abstract class AbstractController<RequestBody = any> {
constructor(private dependencies: Dependencies<RequestBody>) {}

protected abstract handle(): Promise<void>;

async handleAsync() {
try {
await this.handle();
} catch (e) {
if (e instanceof ZodError) {
const error = new ValidationError(
e.errors.map(issue => ({
instancePath: issue.path.join('.'),
message: issue.message,
}))
);

error.message = util.inspect(error, false, null);

throw error;
}

throw e;
}
}

/**
* Adapts a controller class to an Express route handler.
*
* This method takes a controller class (not an instance), instantiates it with
* the request and response objects, and calls its `handleAsync` method.
*
*/
static adapt<Controller extends AbstractController>(
ControllerClass: new (dependencies: Dependencies) => Controller
) {
return async (request: Request, response: Response) =>
new ControllerClass({ request, response }).handleAsync();
}

protected get request() {
return this.dependencies.request;
}

protected get response() {
return this.dependencies.response;
}

protected get language() {
return this.request.language as LanguageISO6391;
}

protected get tenantName() {
return this.request.get('tenant') ?? config.defaultTenant.name;
}

protected serverError(error: Error) {
this.response.status(500).json({
message: error.message,
});
}

protected clientError(message: string) {
this.response.status(400).json({ message });
}

protected jsonResponse(body: any) {
this.response.status(200).json(body);
}

protected ok() {
this.response.status(200).send();
}
}
19 changes: 19 additions & 0 deletions app/api/common.v2/contracts/HttpClient.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
import { File } from 'api/files.v2/model/File';
import { HttpField } from './HttpField';

type PostFormDataInput = {
url: string;
fields: Record<string, HttpField>;
files: Record<string, File[]>;
};

type GetInput = {
url: string;
};

interface HttpClient {
postFormData<T>(input: PostFormDataInput): Promise<T>;
get<Response>(input: GetInput): Promise<Response>;
}

export type { PostFormDataInput, HttpClient, GetInput };
15 changes: 15 additions & 0 deletions app/api/common.v2/contracts/HttpField.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
export class HttpField {
value: string;

constructor(value: any) {
this.value = HttpField.parseValue(value);
}

private static parseValue(value: any): string {
if (Array.isArray(value) || typeof value === 'object') {
return JSON.stringify(value);
}

return `${value}`;
}
}
3 changes: 3 additions & 0 deletions app/api/common.v2/contracts/UseCase.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
export interface UseCase<Input, Output> {
execute(input: Input): Promise<Output>;
}
38 changes: 38 additions & 0 deletions app/api/common.v2/contracts/specs/HttpField.spec.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
import { HttpField } from '../HttpField';

describe('HttpField', () => {
it('should parse a string value correctly', () => {
const field = new HttpField('test');
expect(field.value).toBe('test');
});

it('should parse a number value correctly', () => {
const field = new HttpField(123);
expect(field.value).toBe('123');
});

it('should parse a boolean value correctly', () => {
const field = new HttpField(true);
expect(field.value).toBe('true');
});

it('should parse an array value correctly', () => {
const field = new HttpField([1, 2, 3]);
expect(field.value).toBe('[1,2,3]');
});

it('should parse an object value correctly', () => {
const field = new HttpField({ key: 'value' });
expect(field.value).toBe('{"key":"value"}');
});

it('should parse a null value correctly', () => {
const field = new HttpField(null);
expect(field.value).toBe('null');
});

it('should parse an undefined value correctly', () => {
const field = new HttpField(undefined);
expect(field.value).toBe('undefined');
});
});
10 changes: 6 additions & 4 deletions app/api/common.v2/database/MongoDataSource.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,12 @@ export abstract class MongoDataSource<TSchema extends Document = Document> {
this.transactionManager = transactionManager;
}

protected getCollection(collectionName = this.collectionName) {
return new SyncedCollection<TSchema>(
new SessionScopedCollection<TSchema>(
this.db.collection<TSchema>(collectionName),
protected getCollection<Collection extends Document = TSchema>(
collectionName = this.collectionName
) {
return new SyncedCollection<Collection>(
new SessionScopedCollection<Collection>(
this.db.collection<Collection>(collectionName),
this.transactionManager
),
this.transactionManager,
Expand Down
7 changes: 7 additions & 0 deletions app/api/common.v2/infrastructure/HttpClientFactory.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import { SuperAgentHttpClient } from './SuperAgentHttpClient';

export class HttpClientFactory {
static createDefault() {
return new SuperAgentHttpClient();
}
}
44 changes: 44 additions & 0 deletions app/api/common.v2/infrastructure/SuperAgentHttpClient.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import superagent from 'superagent';
import { File } from 'api/files.v2/model/File';
import { GetInput, HttpClient, PostFormDataInput } from '../contracts/HttpClient';
import { HttpField } from '../contracts/HttpField';

export class SuperAgentHttpClient implements HttpClient {
private client = superagent;

async get<Response>(input: GetInput): Promise<Response> {
const response = await this.client.get(input.url);

return response.body as Response;
}

async postFormData<T>(input: PostFormDataInput): Promise<T> {
const request = this.client.post(input.url);

await SuperAgentHttpClient.attachFiles(request, input.files);
SuperAgentHttpClient.appendFields(request, input.fields);

const response = await request;

return response.body as T;
}

private static async attachFiles(request: superagent.Request, files: Record<string, File[]>) {
const promises = Object.entries(files).flatMap(([key, _files]) =>
_files.map(async file => {
const buffer = await file.toBuffer();

// This is necessary because when we actually 'await' for 'request.[attach/field]' the 'superagent' library kicks off the request
// This is not what we want here.
// eslint-disable-next-line @typescript-eslint/no-floating-promises
request.attach(key, buffer, file.filename);
})
);

return Promise.all(promises);
}

private static appendFields(request: superagent.Request, fields: Record<string, HttpField>) {
Object.entries(fields).forEach(([key, value]) => request.field(key, value.value));
}
}
3 changes: 3 additions & 0 deletions app/api/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,9 @@ export const config = {
},
},
externalServices: Boolean(process.env.EXTERNAL_SERVICES) || false,
externalServicesUrls: {
paragraphExtraction: process.env.PARAGRAPH_EXTRACTION_URL || 'http://localhost:5056',
},

redis: {
activated: CLUSTER_MODE,
Expand Down
9 changes: 9 additions & 0 deletions app/api/files.v2/contracts/FileStorage.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
import { StoredFile } from '../model/StoredFile';
import { UwaziFile } from '../model/UwaziFile';
import { FileType } from '../model/FileType';
import { File } from '../model/File';

export type GetFileInput = {
type: FileType;
filename: string;
};

export interface FileStorage {
list(): Promise<StoredFile[]>;
getPath(file: UwaziFile): string;
getFiles(inputs: GetFileInput[]): Promise<File[]>;
getFile(input: GetFileInput): Promise<File>;
}
41 changes: 41 additions & 0 deletions app/api/files.v2/contracts/FileStorageStrategy.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import { Tenant } from 'api/tenants/tenantContext';
import { FileStorage, GetFileInput } from './FileStorage';
import { File } from '../model/File';
import { StoredFile } from '../model/StoredFile';
import { UwaziFile } from '../model/UwaziFile';

type Strategy = {
s3Storage: FileStorage;
fileSystemStorage: FileStorage;
};

type FileStorageStrategyProps = {
tenant: Tenant;
strategy: Strategy;
};

export class FileStorageStrategy implements FileStorage {
constructor(private props: FileStorageStrategyProps) {}

private get currentStrategy() {
if (this.props.tenant.featureFlags?.s3Storage) return this.props.strategy.s3Storage;

return this.props.strategy.fileSystemStorage;
}

async list(): Promise<StoredFile[]> {
return this.currentStrategy.list();
}

getPath(file: UwaziFile): string {
return this.currentStrategy.getPath(file);
}

async getFiles(inputs: GetFileInput[]): Promise<File[]> {
return this.currentStrategy.getFiles(inputs);
}

async getFile(input: GetFileInput): Promise<File> {
return this.currentStrategy.getFile(input);
}
}
7 changes: 6 additions & 1 deletion app/api/files.v2/contracts/FilesDataSource.ts
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
import { ResultSet } from 'api/common.v2/contracts/ResultSet';
import { UwaziFile } from '../model/UwaziFile';
import { Segmentation } from '../model/Segmentation';
import { Document } from '../model/Document';

export interface FilesDataSource {
interface FilesDataSource {
filesExistForEntities(files: { entity: string; _id: string }[]): Promise<boolean>;
getAll(): ResultSet<UwaziFile>;
getSegmentations(fileId: string[]): ResultSet<Segmentation>;
getDocumentsForEntity(entitySharedId: string): ResultSet<Document>;
}
export type { FilesDataSource };
30 changes: 13 additions & 17 deletions app/api/files.v2/database/FilesMappers.ts
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
// import { OptionalId } from 'mongodb';

import { LanguageUtils } from 'shared/language';
import { FileDBOType } from './schemas/filesTypes';
import { UwaziFile } from '../model/UwaziFile';
import { Document } from '../model/Document';
import { URLAttachment } from '../model/URLAttachment';
import { Attachment } from '../model/Attachment';
import { CustomUpload } from '../model/CustomUpload';

export const FileMappers = {
// toDBO(file: UwaziFile): OptionalId<FileDBOType> {
// return {
// filename: file.filename,
// entity: file.entity,
// type: 'document',
// totalPages: file.totalPages,
// };
// },
const toDocumentModel = (fileDBO: FileDBOType) =>
new Document(
fileDBO._id.toString(),
fileDBO.entity,
fileDBO.totalPages,
fileDBO.filename,
LanguageUtils.fromISO639_3(fileDBO.language).ISO639_1!
).withCreationDate(new Date(fileDBO.creationDate));

export const FileMappers = {
toModel(fileDBO: FileDBOType): UwaziFile {
if (fileDBO.type === 'attachment' && fileDBO.url) {
return new URLAttachment(
Expand All @@ -43,11 +42,8 @@ export const FileMappers = {
fileDBO.filename
).withCreationDate(new Date(fileDBO.creationDate));
}
return new Document(
fileDBO._id.toString(),
fileDBO.entity,
fileDBO.totalPages,
fileDBO.filename
).withCreationDate(new Date(fileDBO.creationDate));
return toDocumentModel(fileDBO);
},

toDocumentModel,
};
Loading

0 comments on commit 587f52e

Please sign in to comment.