Skip to content

Commit 67c5299

Browse files
authored
External dataset (labring#1519)
* perf: local file create collection * rename middleware * perf: remove code * feat: next14 * feat: external file dataset * collection tags field * external file dataset doc * fix: ts
1 parent 2d1ec9b commit 67c5299

File tree

102 files changed

+1841
-1284
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+1841
-1284
lines changed

.npmrc

-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1 @@
11
public-hoist-pattern[]=*tiktoken*
2-
public-hoist-pattern[]=*react*

.vscode/nextapi.code-snippets

+1-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"prefix": "nextapi",
1212
"body": [
1313
"import type { ApiRequestProps, ApiResponseType } from '@fastgpt/service/type/next';",
14-
"import { NextAPI } from '@/service/middle/entry';",
14+
"import { NextAPI } from '@/service/middleware/entry';",
1515
"",
1616
"export type ${TM_FILENAME_BASE}Query = {};",
1717
"",
163 KB
Loading
122 KB
Loading
74.6 KB
Loading
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
title: '外部文件知识库'
3+
description: 'FastGPT 外部文件知识库功能介绍和使用方式'
4+
icon: 'language'
5+
draft: false
6+
toc: true
7+
weight: 107
8+
---
9+
10+
外部文件库是 FastGPT 商业版特有功能。它允许接入你现在的文件系统,无需将文件再导入一份到 FastGPT 中。
11+
12+
并且,阅读权限可以通过你的文件系统进行控制。
13+
14+
| | | |
15+
| --------------------- | --------------------- | --------------------- |
16+
| ![](/imgs/external_file0.png) | ![](/imgs/external_file1.png) | ![](/imgs/external_file2.png) |
17+
18+
19+
## 导入参数说明
20+
21+
- 外部预览地址:用于跳转你的文件阅读地址,会携带“文件阅读ID”进行访问。
22+
- 文件访问URL:文件可访问的地址。
23+
- 文件阅读ID:通常情况下,文件访问URL是临时的。如果希望永久可以访问,你需要使用该文件阅读ID,并配合上“外部预览地址”,跳转至新的阅读地址进行原文件访问。
24+
- 文件名:默认会自动解析文件访问URL上的文件名。如果你手动填写,将会以手动填写的值为准。
25+
26+
[点击查看API导入文档](/docs/development/openapi/dataset/#创建一个外部文件库集合商业版)

docSite/content/docs/development/openapi/dataset.md

+82-2
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,24 @@ curl --location --request DELETE 'http://localhost:3000/api/core/dataset/delete?
295295

296296
## 集合
297297

298+
### 通用创建参数说明
299+
300+
**入参**
301+
302+
| 参数 | 说明 | 必填 |
303+
| --- | --- | --- |
304+
| datasetId | 知识库ID ||
305+
| parentId: | 父级ID,不填则默认为根目录 | |
306+
| trainingType | 训练模式。chunk: 按文本长度进行分割;qa: QA拆分;auto: 增强训练 ||
307+
| chunkSize | 预估块大小 | |
308+
| chunkSplitter | 自定义最高优先分割符号 | |
309+
| qaPrompt | qa拆分提示词 | |
310+
311+
**出参**
312+
313+
- collectionId - 新建的集合ID
314+
- insertLen:插入的块数量
315+
298316
### 创建一个空的集合
299317

300318
{{< tabs tabTotal="3" >}}
@@ -500,7 +518,7 @@ data 为集合的 ID。
500518
{{< /tab >}}
501519
{{< /tabs >}}
502520

503-
### 创建一个文件集合(商业版)
521+
### 创建一个文件集合
504522

505523
传入一个文件,创建一个集合,会读取文件内容进行分割。目前支持:pdf, docx, md, txt, html, csv。
506524

@@ -509,7 +527,7 @@ data 为集合的 ID。
509527
{{< markdownify >}}
510528

511529
```bash
512-
curl --location --request POST 'http://localhost:3000/api/proApi/core/dataset/collection/create/file' \
530+
curl --location --request POST 'http://localhost:3000/api/core/dataset/collection/create/localFile' \
513531
--header 'Authorization: Bearer {{authorization}}' \
514532
--form 'file=@"C:\\Users\\user\\Desktop\\fastgpt测试文件\\index.html"' \
515533
--form 'data="{\"datasetId\":\"6593e137231a2be9c5603ba7\",\"parentId\":null,\"trainingType\":\"chunk\",\"chunkSize\":512,\"chunkSplitter\":\"\",\"qaPrompt\":\"\",\"metadata\":{}}"'
@@ -565,6 +583,68 @@ data 为集合的 ID。
565583
{{< /tab >}}
566584
{{< /tabs >}}
567585

586+
### 创建一个外部文件库集合(商业版)
587+
588+
{{< tabs tabTotal="3" >}}
589+
{{< tab tabName="请求示例" >}}
590+
{{< markdownify >}}
591+
592+
```bash
593+
curl --location --request POST 'http://localhost:3000/api/proApi/core/dataset/collection/create/externalFileUrl' \
594+
--header 'Authorization: Bearer {{authorization}}' \
595+
--header 'User-Agent: Apifox/1.0.0 (https://apifox.com)' \
596+
--header 'Content-Type: application/json' \
597+
--data-raw '{
598+
"externalFileUrl":"https://image.xxxxx.com/fastgpt-dev/%E6%91%82.pdf",
599+
"externalFileId":"1111",
600+
"filename":"自定义文件名",
601+
"datasetId":"6642d105a5e9d2b00255b27b",
602+
"parentId": null,
603+
604+
"trainingType": "chunk",
605+
"chunkSize":512,
606+
"chunkSplitter":"",
607+
"qaPrompt":""
608+
}'
609+
```
610+
611+
{{< /markdownify >}}
612+
{{< /tab >}}
613+
614+
{{< tab tabName="参数说明" >}}
615+
{{< markdownify >}}
616+
617+
| 参数 | 说明 | 必填 |
618+
| --- | --- | --- |
619+
| externalFileUrl | 文件访问链接(可以是临时链接) ||
620+
| externalFileId | 外部文件ID | |
621+
| filename | 自定义文件名 | |
622+
623+
624+
{{< /markdownify >}}
625+
{{< /tab >}}
626+
627+
{{< tab tabName="响应示例" >}}
628+
{{< markdownify >}}
629+
630+
data 为集合的 ID。
631+
632+
```json
633+
{
634+
"code": 200,
635+
"statusText": "",
636+
"message": "",
637+
"data": {
638+
"collectionId": "6646fcedfabd823cdc6de746",
639+
"insertLen": 3
640+
}
641+
}
642+
```
643+
644+
{{< /markdownify >}}
645+
{{< /tab >}}
646+
{{< /tabs >}}
647+
568648
### 获取集合列表
569649

570650
{{< tabs tabTotal="3" >}}

docSite/content/docs/development/upgrading/481.md

+8-5
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,11 @@ curl --location --request POST 'https://{{host}}/api/admin/clearInvalidData' \
3535
## V4.8.1 更新说明
3636

3737
1. 新增 - 知识库重新选择向量模型重建
38-
2. 新增 - 工作流节点版本变更提示,并可以同步最新版本。
39-
3. 优化 - 插件输入的 debug 模式,支持全量参数输入渲染。
40-
4. 修复 - 插件输入默认值被清空问题。
41-
5. 修复 - 工作流删除节点的动态输入和输出时候,没有正确的删除连接线,导致可能出现逻辑异常。
42-
6. 修复 - 定时器清理脏数据任务
38+
2. 新增 - 对话框支持问题模糊检索提示,可自定义预设问题词库。
39+
3. 新增 - 工作流节点版本变更提示,并可以同步最新版本配置,避免存在隐藏脏数据。
40+
4. 新增 - 开放文件导入知识库接口到开源版, [点击插件文档](/docs/development/openapi/dataset/#创建一个文件集合)
41+
5. 新增 - 外部文件源知识库, [点击查看文档](/docs/course/externalfile/)
42+
6. 优化 - 插件输入的 debug 模式,支持全量参数输入渲染。
43+
7. 修复 - 插件输入默认值被清空问题。
44+
8. 修复 - 工作流删除节点的动态输入和输出时候,没有正确的删除连接线,导致可能出现逻辑异常。
45+
9. 修复 - 定时器清理脏数据任务

packages/global/core/dataset/api.d.ts

+15-1
Original file line numberDiff line numberDiff line change
@@ -26,18 +26,27 @@ export type DatasetCollectionChunkMetadataType = {
2626
qaPrompt?: string;
2727
metadata?: Record<string, any>;
2828
};
29+
30+
// create collection params
2931
export type CreateDatasetCollectionParams = DatasetCollectionChunkMetadataType & {
3032
datasetId: string;
3133
name: string;
32-
type: `${DatasetCollectionTypeEnum}`;
34+
type: DatasetCollectionTypeEnum;
35+
36+
tags?: string[];
37+
3338
fileId?: string;
3439
rawLink?: string;
40+
externalFileId?: string;
41+
42+
externalFileUrl?: string;
3543
rawTextLength?: number;
3644
hashRawText?: string;
3745
};
3846

3947
export type ApiCreateDatasetCollectionParams = DatasetCollectionChunkMetadataType & {
4048
datasetId: string;
49+
tags?: string[];
4150
};
4251
export type TextCreateDatasetCollectionParams = ApiCreateDatasetCollectionParams & {
4352
name: string;
@@ -58,6 +67,11 @@ export type CsvTableCreateDatasetCollectionParams = {
5867
parentId?: string;
5968
fileId: string;
6069
};
70+
export type ExternalFileCreateDatasetCollectionParams = ApiCreateDatasetCollectionParams & {
71+
externalFileId?: string;
72+
externalFileUrl: string;
73+
filename?: string;
74+
};
6175

6276
/* ================= data ===================== */
6377
export type PgSearchRawType = {

packages/global/core/dataset/collection/constants.ts

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
/* sourceId = prefix-id; id=fileId;link url;externalId */
1+
/* sourceId = prefix-id; id=fileId;link url;externalFileId */
22
export enum CollectionSourcePrefixEnum {
33
local = 'local',
44
link = 'link',
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import { CollectionWithDatasetType, DatasetCollectionSchemaType } from '../type';
2+
3+
export const getCollectionSourceData = (
4+
collection?: CollectionWithDatasetType | DatasetCollectionSchemaType
5+
) => {
6+
return {
7+
sourceId:
8+
collection?.fileId ||
9+
collection?.rawLink ||
10+
collection?.externalFileId ||
11+
collection?.externalFileUrl,
12+
sourceName: collection?.name || ''
13+
};
14+
};

packages/global/core/dataset/constants.ts

+7-2
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ export const DatasetTypeMap = {
2222
collectionLabel: 'common.Website'
2323
},
2424
[DatasetTypeEnum.externalFile]: {
25-
icon: 'core/dataset/commonDataset',
25+
icon: 'core/dataset/externalDataset',
2626
label: 'External File',
2727
collectionLabel: 'common.File'
2828
}
@@ -44,9 +44,11 @@ export const DatasetStatusMap = {
4444
/* ------------ collection -------------- */
4545
export enum DatasetCollectionTypeEnum {
4646
folder = 'folder',
47+
virtual = 'virtual',
48+
4749
file = 'file',
4850
link = 'link', // one link
49-
virtual = 'virtual'
51+
externalFile = 'externalFile'
5052
}
5153
export const DatasetCollectionTypeMap = {
5254
[DatasetCollectionTypeEnum.folder]: {
@@ -55,6 +57,9 @@ export const DatasetCollectionTypeMap = {
5557
[DatasetCollectionTypeEnum.file]: {
5658
name: 'core.dataset.file'
5759
},
60+
[DatasetCollectionTypeEnum.externalFile]: {
61+
name: 'core.dataset.externalFile'
62+
},
5863
[DatasetCollectionTypeEnum.link]: {
5964
name: 'core.dataset.link'
6065
},

packages/global/core/dataset/read.ts

-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
import { DatasetSourceReadTypeEnum, ImportDataSourceEnum } from './constants';
22

3-
export const rawTextBackupPrefix = 'index,content';
4-
53
export const importType2ReadType = (type: ImportDataSourceEnum) => {
64
if (type === ImportDataSourceEnum.csvTable || type === ImportDataSourceEnum.fileLocal) {
75
return DatasetSourceReadTypeEnum.fileLocal;

packages/global/core/dataset/type.d.ts

+5-3
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ export type DatasetCollectionSchemaType = {
4141
datasetId: string;
4242
parentId?: string;
4343
name: string;
44-
type: `${DatasetCollectionTypeEnum}`;
44+
type: DatasetCollectionTypeEnum;
4545
createTime: Date;
4646
updateTime: Date;
4747

@@ -50,13 +50,15 @@ export type DatasetCollectionSchemaType = {
5050
chunkSplitter?: string;
5151
qaPrompt?: string;
5252

53-
sourceId?: string; // relate CollectionSourcePrefixEnum
53+
tags?: string[];
54+
5455
fileId?: string; // local file id
5556
rawLink?: string; // link url
57+
externalFileId?: string; //external file id
5658

5759
rawTextLength?: number;
5860
hashRawText?: string;
59-
externalSourceUrl?: string; // external import url
61+
externalFileUrl?: string; // external import url
6062
metadata?: {
6163
webPageSelector?: string;
6264
relatedImgId?: string; // The id of the associated image collections

packages/global/core/dataset/utils.ts

+5-5
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ import { getFileIcon } from '../../common/file/icon';
33
import { strIsLink } from '../../common/string/tools';
44

55
export function getCollectionIcon(
6-
type: `${DatasetCollectionTypeEnum}` = DatasetCollectionTypeEnum.file,
6+
type: DatasetCollectionTypeEnum = DatasetCollectionTypeEnum.file,
77
name = ''
88
) {
99
if (type === DatasetCollectionTypeEnum.folder) {
@@ -24,13 +24,13 @@ export function getSourceNameIcon({
2424
sourceName: string;
2525
sourceId?: string;
2626
}) {
27-
if (strIsLink(sourceId)) {
28-
return 'common/linkBlue';
29-
}
30-
const fileIcon = getFileIcon(sourceName, '');
27+
const fileIcon = getFileIcon(decodeURIComponent(sourceName), '');
3128
if (fileIcon) {
3229
return fileIcon;
3330
}
31+
if (strIsLink(sourceId)) {
32+
return 'common/linkBlue';
33+
}
3434

3535
return 'file/fill/manual';
3636
}

packages/global/package.json

+1-1
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
"js-yaml": "^4.1.0",
1111
"jschardet": "3.1.1",
1212
"nanoid": "^4.0.1",
13-
"next": "13.5.2",
13+
"next": "14.2.3",
1414
"openai": "4.28.0",
1515
"openapi-types": "^12.1.3",
1616
"timezones-list": "^3.0.2"

packages/service/common/file/gridfs/controller.ts

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ import { MongoFileSchema } from './schema';
77
import { detectFileEncoding } from '@fastgpt/global/common/file/tools';
88
import { CommonErrEnum } from '@fastgpt/global/common/error/code/common';
99
import { MongoRawTextBuffer } from '../../buffer/rawText/schema';
10-
import { readFileRawContent } from '../read/utils';
10+
import { readRawContentByFileBuffer } from '../read/utils';
1111
import { PassThrough } from 'stream';
1212

1313
export function getGFSCollection(bucket: `${BucketNameEnum}`) {
@@ -196,7 +196,7 @@ export const readFileContentFromMongo = async ({
196196
});
197197
})();
198198

199-
const { rawText } = await readFileRawContent({
199+
const { rawText } = await readRawContentByFileBuffer({
200200
extension,
201201
isQAImport,
202202
teamId,

0 commit comments

Comments
 (0)