Skip to content

Commit

Permalink
Merge branch 'main' of https://github.com/lobehub/lobe-chat
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Mar 3, 2025
2 parents 5a5cb52 + ccb56cd commit cf9509d
Show file tree
Hide file tree
Showing 23 changed files with 825 additions and 61 deletions.
75 changes: 75 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,81 @@

# Changelog

### [Version 1.68.3](https://github.com/lobehub/lobe-chat/compare/v1.68.2...v1.68.3)

<sup>Released on **2025-03-03**</sup>

#### 🐛 Bug Fixes

- **misc**: Improve url rules.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### What's fixed

- **misc**: Improve url rules, closes [#6669](https://github.com/lobehub/lobe-chat/issues/6669) ([5ee59e3](https://github.com/lobehub/lobe-chat/commit/5ee59e3))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>

### [Version 1.68.2](https://github.com/lobehub/lobe-chat/compare/v1.68.1...v1.68.2)

<sup>Released on **2025-03-03**</sup>

#### 💄 Styles

- **misc**: Add build-in web search support for Wenxin & Hunyuan.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### Styles

- **misc**: Add build-in web search support for Wenxin & Hunyuan, closes [#6617](https://github.com/lobehub/lobe-chat/issues/6617) ([dfd1f09](https://github.com/lobehub/lobe-chat/commit/dfd1f09))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>

### [Version 1.68.1](https://github.com/lobehub/lobe-chat/compare/v1.68.0...v1.68.1)

<sup>Released on **2025-03-03**</sup>

#### 🐛 Bug Fixes

- **misc**: Fix page crash with crawler error.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### What's fixed

- **misc**: Fix page crash with crawler error, closes [#6662](https://github.com/lobehub/lobe-chat/issues/6662) ([0c24251](https://github.com/lobehub/lobe-chat/commit/0c24251))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>

## [Version 1.68.0](https://github.com/lobehub/lobe-chat/compare/v1.67.2...v1.68.0)

<sup>Released on **2025-03-03**</sup>
Expand Down
21 changes: 21 additions & 0 deletions changelog/v1.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,25 @@
[
{
"children": {
"fixes": ["Improve url rules."]
},
"date": "2025-03-03",
"version": "1.68.3"
},
{
"children": {
"improvements": ["Add build-in web search support for Wenxin & Hunyuan."]
},
"date": "2025-03-03",
"version": "1.68.2"
},
{
"children": {
"fixes": ["Fix page crash with crawler error."]
},
"date": "2025-03-03",
"version": "1.68.1"
},
{
"children": {
"features": ["Add new model provider PPIO."],
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "@lobehub/chat",
"version": "1.68.0",
"version": "1.68.3",
"description": "Lobe Chat - an open-source, high-performance chatbot framework that supports speech synthesis, multimodal, and extensible Function Call plugin system. Supports one-click free deployment of your private ChatGPT/LLM web application.",
"keywords": [
"framework",
Expand Down
67 changes: 47 additions & 20 deletions packages/web-crawler/README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,61 @@
# @lobechat/web-crawler

LobeChat 内置的网页抓取模块,用于从网页中提取结构化内容,并转换为 Markdown 格式。
LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.

## 📝 简介
## 📝 Introduction

`@lobechat/web-crawler` LobeChat 项目的内部组件,专门负责网页内容的抓取和处理。它能够智能地从各种网页中提取有意义的内容,剔除广告、导航栏等干扰元素,并将结果转换为结构良好的 Markdown 文本。
`@lobechat/web-crawler` is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.

## 🔍 主要功能
## 🛠️ Core Features

- **网页内容抓取**:支持从各类网站获取原始 HTML 内容
- **智能内容提取**:使用 Mozilla 的 Readability 算法识别页面中的主要内容
- **降级处理机制**:当标准抓取失败时,自动切换到 Browserless.io 服务进行渲染抓取(需要自行配置环境变量)
- **Markdown 转换**:将提取的 HTML 内容转换为易于 AI 处理的 Markdown 格式
- **Intelligent Content Extraction**: Identifies main content based on Mozilla Readability algorithm
- **Multi-level Crawling Strategy**: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
- **Custom URL Rules**: Handles specific website crawling logic through a flexible rule system

## 🛠️ 技术实现
## 🤝 Contribution

该模块主要依赖以下技术:
Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:

- **@mozilla/readability**:提供了强大的内容提取算法
- **happy-dom**:轻量级的服务端 DOM 实现
- **node-html-markdown**:高效的 HTML 到 Markdown 转换工具
### How to Contribute URL Rules

## 🤝 共建改进
1. Add new rules to the [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) file
2. Rule example:

由于网页结构的多样性和复杂性,内容提取可能会遇到各种挑战。如果您发现某些网站的抓取效果不佳,欢迎通过以下方式参与改进:
```typescript
// Example: handling specific websites
const url = [
// ... other URL matching rules
{
// URL matching pattern, supports regex
urlPattern: 'https://example.com/articles/(.*)',

1. 提交具体的问题网址和期望的输出结果
2. 分享您对特定网站类型的处理经验
3. 提出针对性的算法或配置调整建议
// Optional: URL transformation, redirects to an easier-to-crawl version
urlTransform: 'https://example.com/print/$1',

## 📌 注意事项
// Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
impls: ['naive', 'jina', 'browserless'],

这是 LobeHub 的内部模块(`"private": true`),不作为独立包发布使用。它专为 LobeChat 的特定需求设计,与其他系统组件紧密集成。
// Optional: content filtering configuration
filterOptions: {
// Whether to enable Readability algorithm for filtering distracting elements
enableReadability: true,
// Whether to convert to plain text
pureText: false,
},
},
];
```

### Rule Submission Process

1. Fork the [LobeChat repository](https://github.com/lobehub/lobe-chat)
2. Add or modify URL rules
3. Submit a Pull Request describing:

- Target website characteristics
- Problems solved by the rule
- Test cases (example URLs)

## 📌 Note

This is an internal module of LobeHub (`"private": true`), designed specifically for LobeChat and not published as a standalone package.
61 changes: 61 additions & 0 deletions packages/web-crawler/README.zh-CN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# @lobechat/web-crawler

LobeChat 内置的网页抓取模块,用于智能提取网页内容并转换为 Markdown 格式。

## 📝 简介

`@lobechat/web-crawler` 是 LobeChat 的核心组件,负责网页内容的智能抓取与处理。它能够从各类网页中提取有价值的内容,过滤掉干扰元素,并生成结构化的 Markdown 文本。

## 🛠️ 核心功能

- **智能内容提取**:基于 Mozilla Readability 算法识别主要内容
- **多级抓取策略**:支持多种抓取实现,包括基础抓取、Jina 和 Browserless 渲染抓取
- **自定义 URL 规则**:通过灵活的规则系统处理特定网站的抓取逻辑

## 🤝 参与共建

网页结构多样复杂,我们欢迎社区贡献特定网站的抓取规则。您可以通过以下方式参与改进:

### 如何贡献 URL 规则

1.[urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) 文件中添加新规则
2. 规则示例:

```typescript
// 示例:处理特定网站
const url = [
// ... 其他 url 匹配规则
{
// URL 匹配模式,仅支持正则表达式
urlPattern: 'https://example.com/articles/(.*)',

// 可选:URL 转换,用于重定向到更易抓取的版本
urlTransform: 'https://example.com/print/$1',

// 可选:指定抓取实现方式,支持 'naive'、'jina' 和 'browserless' 三种
impls: ['naive', 'jina', 'browserless'],

// 可选:内容过滤配置
filterOptions: {
// 是否启用 Readability 算法,用于过滤干扰元素
enableReadability: true,
// 是否转换为纯文本
pureText: false,
},
},
];
```

### 规则提交流程

1. Fork [LobeChat 仓库](https://github.com/lobehub/lobe-chat)
2. 添加或修改 URL 规则
3. 提交 Pull Request 并描述:

- 目标网站特点
- 规则解决的问题
- 测试用例(示例 URL)

## 📌 注意事项

这是 LobeHub 的内部模块(`"private": true`),专为 LobeChat 设计,不作为独立包发布使用。
9 changes: 6 additions & 3 deletions packages/web-crawler/src/__test__/crawler.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,12 @@ describe('Crawler', () => {
});

expect(result).toEqual({
content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
errorMessage: 'Crawl failed',
errorType: 'CrawlError',
crawler: 'browserless',
data: {
content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
errorMessage: 'Crawl failed',
errorType: 'CrawlError',
},
originalUrl: 'https://example.com',
transformedUrl: undefined,
});
Expand Down
Loading

0 comments on commit cf9509d

Please sign in to comment.