Merge branch 'main' of https://github.com/lobehub/lobe-chat

yuanze-dev · Mar 3, 2025 · cf9509d · cf9509d
2 parents 5a5cb52 + ccb56cd
commit cf9509d
Show file tree

Hide file tree

Showing 23 changed files with 825 additions and 61 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,81 @@
 
 # Changelog
 
+### [Version 1.68.3](https://github.com/lobehub/lobe-chat/compare/v1.68.2...v1.68.3)
+
+<sup>Released on **2025-03-03**</sup>
+
+#### 🐛 Bug Fixes
+
+- **misc**: Improve url rules.
+
+<br/>
+
+<details>
+<summary><kbd>Improvements and Fixes</kbd></summary>
+
+#### What's fixed
+
+- **misc**: Improve url rules, closes [#6669](https://github.com/lobehub/lobe-chat/issues/6669) ([5ee59e3](https://github.com/lobehub/lobe-chat/commit/5ee59e3))
+
+</details>
+
+<div align="right">
+
+[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)
+
+</div>
+
+### [Version 1.68.2](https://github.com/lobehub/lobe-chat/compare/v1.68.1...v1.68.2)
+
+<sup>Released on **2025-03-03**</sup>
+
+#### 💄 Styles
+
+- **misc**: Add build-in web search support for Wenxin & Hunyuan.
+
+<br/>
+
+<details>
+<summary><kbd>Improvements and Fixes</kbd></summary>
+
+#### Styles
+
+- **misc**: Add build-in web search support for Wenxin & Hunyuan, closes [#6617](https://github.com/lobehub/lobe-chat/issues/6617) ([dfd1f09](https://github.com/lobehub/lobe-chat/commit/dfd1f09))
+
+</details>
+
+<div align="right">
+
+[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)
+
+</div>
+
+### [Version 1.68.1](https://github.com/lobehub/lobe-chat/compare/v1.68.0...v1.68.1)
+
+<sup>Released on **2025-03-03**</sup>
+
+#### 🐛 Bug Fixes
+
+- **misc**: Fix page crash with crawler error.
+
+<br/>
+
+<details>
+<summary><kbd>Improvements and Fixes</kbd></summary>
+
+#### What's fixed
+
+- **misc**: Fix page crash with crawler error, closes [#6662](https://github.com/lobehub/lobe-chat/issues/6662) ([0c24251](https://github.com/lobehub/lobe-chat/commit/0c24251))
+
+</details>
+
+<div align="right">
+
+[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)
+
+</div>
+
 ## [Version 1.68.0](https://github.com/lobehub/lobe-chat/compare/v1.67.2...v1.68.0)
 
 <sup>Released on **2025-03-03**</sup>

diff --git a/changelog/v1.json b/changelog/v1.json
@@ -1,4 +1,25 @@
 [
+  {
+    "children": {
+      "fixes": ["Improve url rules."]
+    },
+    "date": "2025-03-03",
+    "version": "1.68.3"
+  },
+  {
+    "children": {
+      "improvements": ["Add build-in web search support for Wenxin & Hunyuan."]
+    },
+    "date": "2025-03-03",
+    "version": "1.68.2"
+  },
+  {
+    "children": {
+      "fixes": ["Fix page crash with crawler error."]
+    },
+    "date": "2025-03-03",
+    "version": "1.68.1"
+  },
   {
     "children": {
       "features": ["Add new model provider PPIO."],

diff --git a/package.json b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "@lobehub/chat",
-  "version": "1.68.0",
+  "version": "1.68.3",
   "description": "Lobe Chat - an open-source, high-performance chatbot framework that supports speech synthesis, multimodal, and extensible Function Call plugin system. Supports one-click free deployment of your private ChatGPT/LLM web application.",
   "keywords": [
     "framework",

diff --git a/packages/web-crawler/README.md b/packages/web-crawler/README.md
@@ -1,34 +1,61 @@
 # @lobechat/web-crawler
 
-LobeChat 内置的网页抓取模块，用于从网页中提取结构化内容，并转换为 Markdown 格式。
+LobeChat's built-in web crawling module for intelligent extraction of web content and conversion to Markdown format.
 
-## 📝 简介
+## 📝 Introduction
 
-`@lobechat/web-crawler` 是 LobeChat 项目的内部组件，专门负责网页内容的抓取和处理。它能够智能地从各种网页中提取有意义的内容，剔除广告、导航栏等干扰元素，并将结果转换为结构良好的 Markdown 文本。
+`@lobechat/web-crawler` is a core component of LobeChat responsible for intelligent web content crawling and processing. It extracts valuable content from various webpages, filters out distracting elements, and generates structured Markdown text.
 
-## 🔍 主要功能
+## 🛠️ Core Features
 
-- **网页内容抓取**：支持从各类网站获取原始 HTML 内容
-- **智能内容提取**：使用 Mozilla 的 Readability 算法识别页面中的主要内容
-- **降级处理机制**：当标准抓取失败时，自动切换到 Browserless.io 服务进行渲染抓取（需要自行配置环境变量）
-- **Markdown 转换**：将提取的 HTML 内容转换为易于 AI 处理的 Markdown 格式
+- **Intelligent Content Extraction**: Identifies main content based on Mozilla Readability algorithm
+- **Multi-level Crawling Strategy**: Supports multiple crawling implementations including basic crawling, Jina, and Browserless rendering
+- **Custom URL Rules**: Handles specific website crawling logic through a flexible rule system
 
-## 🛠️ 技术实现
+## 🤝 Contribution
 
-该模块主要依赖以下技术：
+Web structures are diverse and complex. We welcome community contributions for specific website crawling rules. You can participate in improvements through:
 
-- **@mozilla/readability**：提供了强大的内容提取算法
-- **happy-dom**：轻量级的服务端 DOM 实现
-- **node-html-markdown**：高效的 HTML 到 Markdown 转换工具
+### How to Contribute URL Rules
 
-## 🤝 共建改进
+1. Add new rules to the [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) file
+2. Rule example:
 
-由于网页结构的多样性和复杂性，内容提取可能会遇到各种挑战。如果您发现某些网站的抓取效果不佳，欢迎通过以下方式参与改进：
+```typescript
+// Example: handling specific websites
+const url = [
+  // ... other URL matching rules
+  {
+    // URL matching pattern, supports regex
+    urlPattern: 'https://example.com/articles/(.*)',
 
-1. 提交具体的问题网址和期望的输出结果
-2. 分享您对特定网站类型的处理经验
-3. 提出针对性的算法或配置调整建议
+    // Optional: URL transformation, redirects to an easier-to-crawl version
+    urlTransform: 'https://example.com/print/$1',
 
-## 📌 注意事项
+    // Optional: specify crawling implementation, supports 'naive', 'jina', and 'browserless'
+    impls: ['naive', 'jina', 'browserless'],
 
-这是 LobeHub 的内部模块（`"private": true`），不作为独立包发布使用。它专为 LobeChat 的特定需求设计，与其他系统组件紧密集成。
+    // Optional: content filtering configuration
+    filterOptions: {
+      // Whether to enable Readability algorithm for filtering distracting elements
+      enableReadability: true,
+      // Whether to convert to plain text
+      pureText: false,
+    },
+  },
+];
+```
+
+### Rule Submission Process
+
+1. Fork the [LobeChat repository](https://github.com/lobehub/lobe-chat)
+2. Add or modify URL rules
+3. Submit a Pull Request describing:
+
+- Target website characteristics
+- Problems solved by the rule
+- Test cases (example URLs)
+
+## 📌 Note
+
+This is an internal module of LobeHub (`"private": true`), designed specifically for LobeChat and not published as a standalone package.
diff --git a/packages/web-crawler/README.zh-CN.md b/packages/web-crawler/README.zh-CN.md
@@ -0,0 +1,61 @@
+# @lobechat/web-crawler
+
+LobeChat 内置的网页抓取模块，用于智能提取网页内容并转换为 Markdown 格式。
+
+## 📝 简介
+
+`@lobechat/web-crawler` 是 LobeChat 的核心组件，负责网页内容的智能抓取与处理。它能够从各类网页中提取有价值的内容，过滤掉干扰元素，并生成结构化的 Markdown 文本。
+
+## 🛠️ 核心功能
+
+- **智能内容提取**：基于 Mozilla Readability 算法识别主要内容
+- **多级抓取策略**：支持多种抓取实现，包括基础抓取、Jina 和 Browserless 渲染抓取
+- **自定义 URL 规则**：通过灵活的规则系统处理特定网站的抓取逻辑
+
+## 🤝 参与共建
+
+网页结构多样复杂，我们欢迎社区贡献特定网站的抓取规则。您可以通过以下方式参与改进：
+
+### 如何贡献 URL 规则
+
+1. 在 [urlRules.ts](https://github.com/lobehub/lobe-chat/blob/main/packages/web-crawler/src/urlRules.ts) 文件中添加新规则
+2. 规则示例：
+
+```typescript
+// 示例：处理特定网站
+const url = [
+  // ... 其他 url 匹配规则
+  {
+    // URL 匹配模式，仅支持正则表达式
+    urlPattern: 'https://example.com/articles/(.*)',
+
+    // 可选：URL 转换，用于重定向到更易抓取的版本
+    urlTransform: 'https://example.com/print/$1',
+
+    // 可选：指定抓取实现方式，支持 'naive'、'jina' 和 'browserless' 三种
+    impls: ['naive', 'jina', 'browserless'],
+
+    // 可选：内容过滤配置
+    filterOptions: {
+      // 是否启用 Readability 算法，用于过滤干扰元素
+      enableReadability: true,
+      // 是否转换为纯文本
+      pureText: false,
+    },
+  },
+];
+```
+
+### 规则提交流程
+
+1. Fork [LobeChat 仓库](https://github.com/lobehub/lobe-chat)
+2. 添加或修改 URL 规则
+3. 提交 Pull Request 并描述：
+
+- 目标网站特点
+- 规则解决的问题
+- 测试用例（示例 URL）
+
+## 📌 注意事项
+
+这是 LobeHub 的内部模块（`"private": true`），专为 LobeChat 设计，不作为独立包发布使用。
diff --git a/packages/web-crawler/src/__test__/crawler.test.ts b/packages/web-crawler/src/__test__/crawler.test.ts
@@ -80,9 +80,12 @@ describe('Crawler', () => {
     });
 
     expect(result).toEqual({
-      content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
-      errorMessage: 'Crawl failed',
-      errorType: 'CrawlError',
+      crawler: 'browserless',
+      data: {
+        content: 'Fail to crawl the page. Error type: CrawlError, error message: Crawl failed',
+        errorMessage: 'Crawl failed',
+        errorType: 'CrawlError',
+      },
       originalUrl: 'https://example.com',
       transformedUrl: undefined,
     });