Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Presigned URLs can become invalid in LakeFSLoader.load when Unstructured is slow #29130

Open
5 tasks done
dhdaines opened this issue Jan 10, 2025 · 2 comments
Open
5 tasks done
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@dhdaines
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Unfortunately this will require you to have a LakeFS configuration so it isn't that straightforward to reproduce (and may also depend on your specific LakeFS configuration)... but basically just call ls_objects with presign=True then wait a while... and then try to access one of the URLs (which is what the LakeFSLoader does internally).

import dotenv
import os
import requests
import time
from langchain_community.document_loaders.lakefs import LakeFSClient

client = LakeFSClient(lakefs_access_key=lakefs_access_key,
                      lakefs_secret_key=lakefs_secret_key,
                      lakefs_endpoint=lakefs_endpoint)
objs = client.ls_objects(repo, ref, path, presign=True)
path, url = objs[0]
response = requests.get(url)
response.raise_for_status()
time.sleep(1200)
response = requests.get(url)
response.raise_for_status()

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "lakefs_bug.py", line 24, in <module>
    response.raise_for_status()
  File ".venv/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: http://172.17.0.1:9200/mybucket/data/gb80naqqr9gs72ue9qu0/ctvfgfaqr9gs72ue9qv0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20250110%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250110T151448Z&X-Amz-Expires=900&X-Amz-SignedHeaders=host&x-id=GetObject&X-Amz-Signature=89f8f515de3ee05589a4d7880b5a4a21957d2e3498ed702c436a9564fe8db285

Description

When loading lots of or large documents with the LakeFSLoader it is frequently the case that quite a bit of time passes between the call to ls_objects at line 109 and the call to requests.get on line 172.

This is because Unstructured can be very slow (insisting on "repairing" and OCRing perfectly good PDFs, for instance). The result is that the presigned URLs that LakeFS gives us (in the call to ls_objects) are no longer valid once we get around to actually accessing them.

System Info

System Information

OS: Linux
OS Version: #52~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Dec 9 15:00:52 UTC 2
Python Version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]

Package Information

langchain_core: 0.3.25
langchain: 0.3.12
langchain_community: 0.3.12
langsmith: 0.2.3
langchain_nomic: 0.1.4
langchain_ollama: 0.2.0
langchain_text_splitters: 0.3.3
langchain_unstructured: 0.1.6
langgraph_sdk: 0.1.47

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.11.10
async-timeout: 4.0.3
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
jsonpatch: 1.33
langsmith-pyo3: Installed. No version info available.
nomic: 3.3.4
numpy: 1.26.4
ollama: 0.4.4
onnxruntime: 1.20.1
orjson: 3.10.12
packaging: 24.2
pillow: 10.4.0
pydantic: 2.9.2
pydantic-settings: 2.7.0
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.35
tenacity: 8.5.0
typing-extensions: 4.12.2
unstructured-client: 0.27.0
unstructured[all-docs]: Installed. No version info available.

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jan 10, 2025
@keenborder786
Copy link
Contributor

@dhdaine check if the above PR solves your issue.

@dhdaines
Copy link
Author

@dhdaine check if the above PR solves your issue.

Thanks! That is indeed one way to deal with the problem, though it isn't really guaranteed to solve it and might use an enormous amount of memory.

In reality I'm not using LakeFSLoader directly, in part because UnstructuredFileLoader is obsolete/deprecated, but noticed this in a locally customized version of it. My solution (which possibly uses a lot of disk space but is more memory/CPU friendly) is to simply download all the objects at once before processing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants