Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ refactor: Add reject pattern for browserless to boost crawl performance #6996

Merged
merged 3 commits into from
Mar 25, 2025

Conversation

cy948
Copy link
Contributor

@cy948 cy948 commented Mar 16, 2025

💻 变更类型 | Change Type

  • ✨ feat
  • 🐛 fix
  • ♻️ refactor
  • 💄 style
  • 👷 build
  • ⚡️ perf
  • 📝 docs
  • 🔨 chore

🔀 变更说明 | Description of Change

  • packages/web-crawler/src/crawImpl/browserless.ts: 向 browserless 传递忽略的爬取文件规则。如图像、音频等,从而提升返回速度。

📝 补充信息 | Additional Information

.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$
  • Browserless接收到的设置
gotoOptions: { waitUntil: 'networkidle2' },
  rejectRequestPattern: [
    '.*\\.(?!(html|css|js|json|xml|webmanifest|txt|md)(\\?|#|$))[\\w-]+(?:[\\?#].*)?$'
  ],
  url: 'https://www.cloudflare.com/zh-cn/'
}

js 已经对\进行处理,但此处环境变量还是要用 "" 包裹,防止\造成转义

网页为例,在爬取图片时普遍会增加时延(红色高亮)。在开启上述的过滤规则后,browserless会主动 abort 图片的请求(绿色高亮的 Aborting request),从而节省请求时间。

-   browserless.io:ChromiumContentPostRoute:trace 172.17.0.1 GET: https://cf-assets.www.cloudflare.com/slt3lc6tev37/R68bUicgjCMdgEBXBM1ey/c350338abc119640172d8876202ffbf8/Webinar-106x165-thumbnail-card.svg +0ms
+   browserless.io:ChromiumContentPostRoute:debug 172.17.0.1 Aborting request GET: https://cf-assets.www.cloudflare.com/slt3lc6tev37/R68bUicgjCMdgEBXBM1ey/c350338abc119640172d8876202ffbf8/Webinar-106x165-thumbnail-card.svg +1ms

同时,该规则只是让 browserLess 不去下载媒体,而媒体本身的 url 仍然存在,所以在 llm 处理的 raw text 返回中,图片 url 仍然会存在,不影响最终体验。
这是一个在当前环境变量设置的 raw text 返回:

随时随地连接、保护和构建 让全球连通云为您服务。  ![checkbox - orange](https://cf-assets.www.cloudflare.com/slt3lc6tev37/5XA6P5ZUYwcjq9LZGBbAcj/1517e2b34ef3bf213fca28586ae33170/ease-of-use-toggle.svg)控制 对本地、公共云、SaaS 和互联网上的 IT 和安全重获可见性和控制 ![](https://cf-assets.www.cloudflare.com/slt3lc6tev37/74GDwwyKnKfYYz1QQEQh1P/c7232082d74a2cb16d2197596662f593/security-shield-protection-2.svg)安全 改善安全和韧性,并减少攻击面、供应商数量和工具扩散
# 从 raw html 中处理得到的图片 url 仍然存在

Copy link

vercel bot commented Mar 16, 2025

@cy948 is attempting to deploy a commit to the LobeChat Desktop Team on Vercel.

A member of the Team first needs to authorize it.

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. ⚡️ Performance Performance issue | 性能问题 labels Mar 16, 2025
@lobehubbot
Copy link
Member

👍 @cy948

Thank you for raising your pull request and contributing to our Community
Please make sure you have followed our contributing guidelines. We will review it as soon as possible.
If you encounter any problems, please feel free to connect with us.
非常感谢您提出拉取请求并为我们的社区做出贡献,请确保您已经遵循了我们的贡献指南,我们会尽快审查它。
如果您遇到任何问题,请随时与我们联系。

Copy link
Contributor

gru-agent bot commented Mar 16, 2025

TestGru Assignment

Summary

Link CommitId Status Reason
Detail 9c194be ✅ Finished

Files

File Pull Request
packages/web-crawler/src/crawImpl/browserless.ts ❌ Failure (I failed to write the unit tests for the file.)

Tip

You can @gru-agent and leave your feedback. TestGru will make adjustments based on your input

Copy link

codecov bot commented Mar 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.13%. Comparing base (375f924) to head (c176878).
Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff            @@
##             main    #6996     +/-   ##
=========================================
  Coverage   90.13%   90.13%             
=========================================
  Files         740      740             
  Lines       52896    52899      +3     
  Branches     5022     3260   -1762     
=========================================
+ Hits        47678    47681      +3     
  Misses       5218     5218             
Flag Coverage Δ
app 90.13% <100.00%> (+<0.01%) ⬆️
server ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@arvinxx
Copy link
Contributor

arvinxx commented Mar 17, 2025

感觉可以不用变成环境变量,直接屏蔽好了

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


It feels like you can do it without turning it into an environment variable, just block it directly

@cy948
Copy link
Contributor Author

cy948 commented Mar 17, 2025

感觉可以不用变成环境变量,直接屏蔽好了

做成环境变量主要是方便以后加屏蔽清单来着。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


I feel like I can do it without turning it into an environment variable, just block it directly

The main purpose of making environmental variables is to add a shielded list in the future.

@arvinxx
Copy link
Contributor

arvinxx commented Mar 19, 2025

做成环境变量主要是方便以后加屏蔽清单来着。

那感觉应该做成白名单模式吧?默认都屏蔽,只有在白名单里才放开?

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Making environment variables is mainly to facilitate the use of blocked lists in the future.

That feels like it should be made into a whitelist mode, right? All blocked by default, and only release it on the whitelist?

@cy948 cy948 force-pushed the refactor/add-browserless-reject-pattern branch from 9c194be to 4e4945f Compare March 21, 2025 11:19
Copy link

vercel bot commented Mar 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
lobe-chat-preview ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 25, 2025 2:42am

@arvinxx arvinxx merged commit 184a1ba into lobehub:main Mar 25, 2025
13 of 15 checks passed
@lobehubbot
Copy link
Member

❤️ Great PR @cy948 ❤️

The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world.
项目的成长离不开用户反馈和贡献,感谢您的贡献! 如果您对 LobeHub 开发者社区感兴趣,请加入我们的 discord,然后私信 @arvinxx@canisminor1990。他们会邀请您加入我们的私密开发者频道。我们将会讨论关于 Lobe Chat 的开发,分享和讨论全球范围内的 AI 消息。

github-actions bot pushed a commit that referenced this pull request Mar 25, 2025
### [Version&nbsp;1.74.9](v1.74.8...v1.74.9)
<sup>Released on **2025-03-25**</sup>

#### ♻ Code Refactoring

- **misc**: Add reject pattern for browserless to boost crawl performance.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### Code refactoring

* **misc**: Add reject pattern for browserless to boost crawl performance, closes [#6996](#6996) ([184a1ba](184a1ba))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>
@lobehubbot
Copy link
Member

🎉 This PR is included in version 1.74.9 🎉

The release is available on:

Your semantic-release bot 📦🚀

github-actions bot pushed a commit to bentwnghk/lobe-chat that referenced this pull request Mar 25, 2025
### [Version&nbsp;1.115.3](v1.115.2...v1.115.3)
<sup>Released on **2025-03-25**</sup>

#### ♻ Code Refactoring

- **misc**: Add reject pattern for browserless to boost crawl performance.

<br/>

<details>
<summary><kbd>Improvements and Fixes</kbd></summary>

#### Code refactoring

* **misc**: Add reject pattern for browserless to boost crawl performance, closes [lobehub#6996](https://github.com/bentwnghk/lobe-chat/issues/6996) ([184a1ba](184a1ba))

</details>

<div align="right">

[![](https://img.shields.io/badge/-BACK_TO_TOP-151515?style=flat-square)](#readme-top)

</div>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚡️ Performance Performance issue | 性能问题 released size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants