Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLSemanticPreservingSplitter doesn't pull headers when using semantic HTML5 tags like <main> <article>, <section>, etc. #29184

Open
5 tasks done
srkirkland opened this issue Jan 14, 2025 · 1 comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@srkirkland
Copy link

srkirkland commented Jan 14, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import HTMLSemanticPreservingSplitter

html_string = """
<!DOCTYPE html>
<html>
    <body>
        <main>
            <h1>Section 1</h1>
            <p>This section contains an important table and list that should not be split across chunks.</p>
            <table>
                <tr>
                    <th>Item</th>
                    <th>Quantity</th>
                    <th>Price</th>
                </tr>
                <tr>
                    <td>Apples</td>
                    <td>10</td>
                    <td>$1.00</td>
                </tr>
                <tr>
                    <td>Oranges</td>
                    <td>5</td>
                    <td>$0.50</td>
                </tr>
                <tr>
                    <td>Bananas</td>
                    <td>50</td>
                    <td>$1.50</td>
                </tr>
            </table>
            <h2>Subsection 1.1</h2>
            <p>Additional text in subsection 1.1 that is separated from the table and list.</p>
            <p>Here is a detailed list:</p>
            <ul>
                <li>Item 1: Description of item 1, which is quite detailed and important.</li>
                <li>Item 2: Description of item 2, which also contains significant information.</li>
                <li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
            </ul>
        </main>
    </body>
</html>
"""

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=headers_to_split_on,
    max_chunk_size=50,
    elements_to_preserve=["table", "ul"],
)

documents = splitter.split_text(html_string)
print(documents)

Error Message and Stack Trace (if applicable)

No response

Description

HTMLSemanticPreservingSplitter doesn't produce metadata if the html passed includes semantic HTML5 tags like <main> <article>, <section>, etc. If you take the example usage from the docs (https://python.langchain.com/docs/how_to/split_html/#preserving-tables-and-lists) and replace <div> with <main> you can see an example.

I've made a notebook where you can see the issue: https://colab.research.google.com/drive/19hZQzpIFOfVxtJGcOpT5PtZcuibmYZu-?usp=sharing

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Thu Jun 27 21:05:47 UTC 2024
Python Version: 3.10.12 (main, Nov 6 2024, 20:22:13) [GCC 11.4.0]

Package Information

langchain_core: 0.3.29
langchain: 0.3.7
langsmith: 0.1.147
langchain_text_splitters: 0.3.5

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.11.11
async-timeout: 4.0.3
httpx: 0.28.1
jsonpatch: 1.33
langsmith-pyo3: Installed. No version info available.
numpy: 1.26.4
orjson: 3.10.13
packaging: 24.2
pydantic: 2.10.4
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.36
tenacity: 9.0.0
typing-extensions: 4.12.2

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jan 14, 2025
@keenborder786
Copy link
Contributor

@srkirkland fixed

ccurme pushed a commit that referenced this issue Jan 15, 2025
…vingSplitter to properly extract the metadata. (#29215)

- **Description:** Include `main` in the list of elements whose child
elements needs to be processed for splitting the HTML.
- **Issue:** #29184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants