feat: add new HanLP integration with ChineseDocumentSplitter #1943

julian-risch · 2025-06-13T08:24:33Z

Related Issues

Addresses parts of New HanLP integration with ChineseDocumentSplitter #1944

Proposed Changes:

This PR introduces a ChineseDocumentSplitter that supports accurate sentence and paragraph splitting for Chinese documents. It leverages the HanLP library for Chinese linguistic analysis, including sentence segmentation and tokenization.
It keeps the commit history from drafts created in the haystack repository: feature: Chinese DocumentSplitter haystack#9453 and feat: Add ChineseDocumentSplitter haystack#9494
In addition, there is a warm_up method loading the models, support for English language is removed, and ChineseDocumentSplitter is no longer inheriting from the DocumentSplitter
Similar to the findings and changes of the following PR, we skip tests for the combination of Windows with python 3.13 because of an incompatibility with the sentence-piece dependency chore: stop testing instructor-embedders on windows + python 3.13 #1941
There is a Github workflow running the tests for the integration nightly and at every PR
The labeler.yml file has been updated

How did you test it?

We should test with this notebook. It shows how to use the new component in the forked repository: https://github.com/mc112611/haystack/blob/307f8340b2e1a9104efe4e33d8c1885d17143c36/examples/chinese_RAG_test_haystack_chinese.ipynb

Notes for the reviewer

Before this can be reviewed we need to work on:

Current implementation of ChineseDocumentSplitter is inheriting from the DocumentSplitter. That's probably not needed.
We can remove some parts that are only needed to handle English, for example self.language == english
We need to define warm_up in ChineseDocumentSplitter, which should load external data
We need to add a usage example to the component docstring
py.typed needs to be added
Tests. Their should be proper unit tests and only a limited number of integration tests. Similar to NLTK (needs to download extra data) right now all tests are integrations tests. we should change that.

I had a look at the other tokenizers that HanLP supports. All of them seem to be worse than the two tokenizers that we support in this integration. Therefore, I'd limit the user's options to just the two. https://hanlp.hankcs.com/docs/api/hanlp/pretrained/tok.html

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

… - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module

… and fix lint issues

mc112611 and others added 17 commits June 13, 2025 10:17

Add Chinese DocumentSplitter support with examples

a1f30c7

fix: update tests and release notes

d4a665f

Fix according to review: - Removed notebook and original_pipeline.png…

e9ccf11

… - Added release note YAML file in notes/ - Reverted config.yaml - Implemented lazy import for hanlp - Removed main guard block from module

cleaning up

99eb335

fixing lazy import

671ca13

Add test script for ChineseDocumentSplitter, remove Chinese comments,…

830a525

… and fix lint issues

wip

75413e4

adding LICENSE header to tests

bd5ffd5

wip

b908ee5

adding LICENSE header

29de330

fixing linting issues

86e5c8f

fixing tests

14bcf08

fixing tests

fbfe639

wip: trying to make tests work with downloaded data

7609675

wip: trying to make tests work with downloaded data

ca18eb2

move ChineseDocumentSplitter to integration folder

8b8ea1f

create scaffold of the new integration

fe4784b

github-actions bot added the type:documentation Improvements or additions to documentation label Jun 13, 2025

julian-risch added 3 commits June 13, 2025 11:11

add test workflow

1747d6a

add new integration to labeler.yml

b1f802c

add to overview table in README

d4165f0

github-actions bot added the topic:CI label Jun 13, 2025

julian-risch changed the title ~~feat: add new HanNLP integration with ChineseDocumentSplitter~~ feat: add new HanLP integration with ChineseDocumentSplitter Jun 13, 2025

julian-risch added the integration:hanlp label Jun 13, 2025

julian-risch mentioned this pull request Jun 13, 2025

feat: Add ChineseDocumentSplitter deepset-ai/haystack#9494

Closed

julian-risch added 5 commits June 18, 2025 23:16

add warmup, do not inherit from DocumentSplitter, remove english support

7415b83

fmt

b803e7c

fmt

3a18f8d

align Hatch script of hanlp with other integrations

ec070b2

lint:typing more_itertools

bbfce8e

julian-risch added 11 commits June 19, 2025 00:16

move files to hanlp directory

0f90ce1

fix tests using enumerate only for prints

7ac831a

fmt

928d949

add unit tests, handle edge case when split_length < split_overlap

5f77903

stop testing on windows + python 3.13

797d294

Merge branch 'main' into hanlp

9b09dac

refactor tests, add usage example, simplify tokenizers

af7015f

add py.typed

3909a25

ignore RUF002, fmt

2460646

to_dict, from_dict

5e8fc7b

Merge branch 'main' into hanlp

a274b72

julian-risch marked this pull request as ready for review June 19, 2025 17:55

julian-risch requested a review from a team as a code owner June 19, 2025 17:55

julian-risch requested review from vblagoje and removed request for a team June 19, 2025 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add new HanLP integration with ChineseDocumentSplitter #1943

feat: add new HanLP integration with ChineseDocumentSplitter #1943

Uh oh!

julian-risch commented Jun 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: add new HanLP integration with ChineseDocumentSplitter #1943

Are you sure you want to change the base?

feat: add new HanLP integration with ChineseDocumentSplitter #1943

Uh oh!

Conversation

julian-risch commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Uh oh!

Uh oh!

julian-risch commented Jun 13, 2025 •

edited

Loading