-
-
Notifications
You must be signed in to change notification settings - Fork 12.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ refactor: Add reject pattern for browserless to boost crawl performance #6996
♻️ refactor: Add reject pattern for browserless to boost crawl performance #6996
Conversation
@cy948 is attempting to deploy a commit to the LobeChat Desktop Team on Vercel. A member of the Team first needs to authorize it. |
👍 @cy948 Thank you for raising your pull request and contributing to our Community |
TestGru AssignmentSummary
Files
Tip You can |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #6996 +/- ##
=========================================
Coverage 90.13% 90.13%
=========================================
Files 740 740
Lines 52896 52899 +3
Branches 5022 3260 -1762
=========================================
+ Hits 47678 47681 +3
Misses 5218 5218
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
感觉可以不用变成环境变量,直接屏蔽好了 |
It feels like you can do it without turning it into an environment variable, just block it directly |
做成环境变量主要是方便以后加屏蔽清单来着。 |
The main purpose of making environmental variables is to add a shielded list in the future. |
那感觉应该做成白名单模式吧?默认都屏蔽,只有在白名单里才放开? |
That feels like it should be made into a whitelist mode, right? All blocked by default, and only release it on the whitelist? |
9c194be
to
4e4945f
Compare
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
❤️ Great PR @cy948 ❤️ The growth of project is inseparable from user feedback and contribution, thanks for your contribution! If you are interesting with the lobehub developer community, please join our discord and then dm @arvinxx or @canisminor1990. They will invite you to our private developer channel. We are talking about the lobe-chat development or sharing ai newsletter around the world. |
### [Version 1.74.9](v1.74.8...v1.74.9) <sup>Released on **2025-03-25**</sup> #### ♻ Code Refactoring - **misc**: Add reject pattern for browserless to boost crawl performance. <br/> <details> <summary><kbd>Improvements and Fixes</kbd></summary> #### Code refactoring * **misc**: Add reject pattern for browserless to boost crawl performance, closes [#6996](#6996) ([184a1ba](184a1ba)) </details> <div align="right"> [](#readme-top) </div>
🎉 This PR is included in version 1.74.9 🎉 The release is available on: Your semantic-release bot 📦🚀 |
### [Version 1.115.3](v1.115.2...v1.115.3) <sup>Released on **2025-03-25**</sup> #### ♻ Code Refactoring - **misc**: Add reject pattern for browserless to boost crawl performance. <br/> <details> <summary><kbd>Improvements and Fixes</kbd></summary> #### Code refactoring * **misc**: Add reject pattern for browserless to boost crawl performance, closes [lobehub#6996](https://github.com/bentwnghk/lobe-chat/issues/6996) ([184a1ba](184a1ba)) </details> <div align="right"> [](#readme-top) </div>
💻 变更类型 | Change Type
🔀 变更说明 | Description of Change
packages/web-crawler/src/crawImpl/browserless.ts
: 向 browserless 传递忽略的爬取文件规则。如图像、音频等,从而提升返回速度。📝 补充信息 | Additional Information
https://docs.browserless.io/baas/http-apis/content#rejecting-undesired-requests 实测reject patterns更有效,能匹配像nextjs这些带query的资源请求
使用表达式
以网页为例,在爬取图片时普遍会增加时延(红色高亮)。在开启上述的过滤规则后,browserless会主动 abort 图片的请求(绿色高亮的
Aborting request
),从而节省请求时间。同时,该规则只是让 browserLess 不去下载媒体,而媒体本身的 url 仍然存在,所以在 llm 处理的 raw text 返回中,图片 url 仍然会存在,不影响最终体验。
这是一个在当前环境变量设置的 raw text 返回:
随时随地连接、保护和构建 让全球连通云为您服务。 控制 对本地、公共云、SaaS 和互联网上的 IT 和安全重获可见性和控制 安全 改善安全和韧性,并减少攻击面、供应商数量和工具扩散 # 从 raw html 中处理得到的图片 url 仍然存在