HTMLSemanticPreservingSplitter doesn't pull headers when using semantic HTML5 tags like <main> <article>, <section>, etc. #29184
Labels
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
HTMLSemanticPreservingSplitter doesn't produce metadata if the html passed includes semantic HTML5 tags like
<main>
<article>
,<section>
, etc. If you take the example usage from the docs (https://python.langchain.com/docs/how_to/split_html/#preserving-tables-and-lists) and replace<div>
with<main>
you can see an example.I've made a notebook where you can see the issue: https://colab.research.google.com/drive/19hZQzpIFOfVxtJGcOpT5PtZcuibmYZu-?usp=sharing
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: