Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate Vector Store fails with Operation Failed: Stream has ended unexpectedly #75

Closed
andytael opened this issue Jan 9, 2025 · 1 comment
Labels
duplicate This issue or pull request already exists

Comments

@andytael
Copy link

andytael commented Jan 9, 2025

When loading PDF document from WEB, populate Vector Store fails with the following error:

2025-Jan-09 19:19:04 - INFO     - (modules.utilities): Response for https://docs.oracle.com/en/database/oracle/oracle-database/23/vecse/oracle-ai-vector-search-users-guide.pdf: 200
2025-Jan-09 19:19:04 - INFO     - (chunk_embed): Loading PDF from web to /tmp/tmpotjj_jne/oracle-ai-vector-search-users-guide.pdf
2025-Jan-09 19:19:04 - INFO     - (chunk_embed): Wrote /tmp/tmpotjj_jne/oracle-ai-vector-search-users-guide.pdf
2025-Jan-09 19:19:04 - INFO     - (modules.split): Loading oracle-ai-vector-search-users-guide.pdf (6270 bytes)
2025-Jan-09 19:19:04 - WARNING  - (pypdf._reader): invalid pdf header: b'<!DOC'
2025-Jan-09 19:19:04 - WARNING  - (pypdf._reader): EOF marker not found
2025-Jan-09 19:19:04 - ERROR    - (chunk_embed): Operation Failed: Stream has ended unexpectedly
Traceback (most recent call last):
  File "/app/content/split_embed.py", line 427, in main
    split_docos, _ = split.load_and_split_documents(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/modules/split.py", line 171, in load_and_split_documents
    loaded_doc = loader.load()
                 ^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/langchain_community/document_loaders/pdf.py", line 257, in lazy_load
    yield from self.parser.parse(blob)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/langchain_core/document_loaders/base.py", line 127, in parse
    return list(self.lazy_parse(blob))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/langchain_community/document_loaders/parsers/pdf.py", line 123, in lazy_parse
    pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/pypdf/_reader.py", line 133, in __init__
    self._initialize_stream(stream)
  File "/opt/venv/lib64/python3.11/site-packages/pypdf/_reader.py", line 155, in _initialize_stream
    self.read(stream)
  File "/opt/venv/lib64/python3.11/site-packages/pypdf/_reader.py", line 608, in read
    self._find_eof_marker(stream)
  File "/opt/venv/lib64/python3.11/site-packages/pypdf/_reader.py", line 716, in _find_eof_marker
    line = read_previous_line(stream)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib64/python3.11/site-packages/pypdf/_utils.py", line 288, in read_previous_line
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
pypdf.errors.PdfStreamError: Stream has ended unexpectedly
@andytael andytael added the duplicate This issue or pull request already exists label Jan 9, 2025
@andytael
Copy link
Author

andytael commented Jan 9, 2025

Duplicate issue of #68 Closing issues

@andytael andytael closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

1 participant