Skip to content

feat: Add ChineseDocumentSplitter #9494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Jun 5, 2025

Related Issues

No related issue. This PR originated from a community discussion about improving Chinese document splitting support.

Follow up on #9453 by @mc112611

There was a notebook to show how to use the new component in the forked repository: https://github.com/mc112611/haystack/blob/307f8340b2e1a9104efe4e33d8c1885d17143c36/examples/chinese_RAG_test_haystack_chinese.ipynb


Proposed Changes

This PR introduces a ChineseDocumentSplitter that supports accurate sentence and paragraph splitting for Chinese documents. It leverages the HanLP library for Chinese linguistic analysis, including sentence segmentation and tokenization.

How did you test it?

newly added tests. We can also use the notebook mentioned above for testing but haven't done that yet with the latest version of the code.

Notes for the reviewer

We still plan to make a couple of changes:

  • All tests are currently marked as integrations tests because they require the downloaded models. We need to change that and make sure we convert most of the integration tests to unit tests
  • Similar to NLTK (needs to download extra data), we should add a warm_up method. Still to decide whether this warm_up should also call super.warm_up
  • DocumentSplitter is currently inheriting from DocumentSplitter. We need to check if that inheritance makes sense and if we want to keep it
  • We are still discussing if we can remove the parts of the code that are for English inputs, for example self.language == english
  • We could consider moving this component to haystack-core-integrations and make a new integration instead. hanlp-haystack

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 15468825551

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-1.1%) to 89.356%

Files with Coverage Reduction New Missed Lines %
components/preprocessors/init.py 2 40.0%
Totals Coverage Status
Change from base Build 15464875423: -1.1%
Covered Lines: 11501
Relevant Lines: 12871

💛 - Coveralls

@julian-risch
Copy link
Member

Closing this PR because we decided to move it to haystack-core-integrations: deepset-ai/haystack-core-integrations#1943

@julian-risch julian-risch removed their assignment Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants