Feature: JSON document format #603

elshize · 2025-03-08T21:04:23Z

Support JSONL collection format. This should make it easy to prepare a script that transforms any format into JSON and streams it to the parsing tool.

codecov · 2025-03-08T21:24:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.92%. Comparing base (403342e) to head (d7cb511).
Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #603   +/-   ##
=======================================
  Coverage   93.91%   93.92%           
=======================================
  Files          82       82           
  Lines        3929     3931    +2     
=======================================
+ Hits         3690     3692    +2     
  Misses        239      239

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Support JSONL collection format. This should make it easy to prepare a script that transforms any format into JSON and streams it to the parsing tool. Signed-off-by: Michal Siedlaczek <[email protected]>

The script utilizes `ir-datasets` to download a dataset and prints it out in a format that can be understood by `parse_collection`. Signed-off-by: Michal Siedlaczek <[email protected]>

JMMackenzie

This is great, nice work!

amallia · 2025-03-12T14:24:22Z

docs/book/guide/parsing.html

@@ -262,6 +262,8 @@ <h2 id="supported-formats"><a class="header" href="#supported-formats">Supported
 <li><code>plaintext</code>: every line contains the document's title first, then any
 number of whitespaces, followed by the content delimited by a new line
 character.</li>
+<li><code>jsonl</code>: every line is a JSON document with three fields: <code>title</code>,


Should we make title optional?

Can we clarify how we treat title and content? Do we concatenate?

"title" what in PISA we call a string document ID. We must have one. This has nothing to do with a title that could be part of content.

I'll try to clarify

amallia · 2025-03-12T14:25:35Z

docs/book/print.html

@@ -420,6 +420,8 @@ <h2 id="supported-formats"><a class="header" href="#supported-formats">Supported
 <li><code>plaintext</code>: every line contains the document's title first, then any
 number of whitespaces, followed by the content delimited by a new line
 character.</li>
+<li><code>jsonl</code>: every line is a JSON document with three fields: <code>title</code>,


amallia · 2025-03-12T14:28:29Z

script/ir-datasets.py

+def parse_dataset(dataset_name):
+    dataset = ir.load(dataset_name)
+    for doc in dataset.docs_iter():
+        print(json.dumps({"title": doc.doc_id, "content": doc.text}))


why doc_id as title?
For example

import ir_datasets dataset = ir_datasets.load("beir/climate-fever") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, text, title>

import ir_datasets dataset = ir_datasets.load("msmarco-passage") for doc in dataset.docs_iter(): doc # namedtuple<doc_id, text>

Two examples: one has title and the other one does not

elshize self-assigned this Mar 8, 2025

elshize requested review from JMMackenzie and amallia March 8, 2025 21:33

elshize added 2 commits March 9, 2025 16:23

Feature: JSON document format

7d23051

Support JSONL collection format. This should make it easy to prepare a script that transforms any format into JSON and streams it to the parsing tool. Signed-off-by: Michal Siedlaczek <[email protected]>

Feature: ir-datasets script

d7cb511

The script utilizes `ir-datasets` to download a dataset and prints it out in a format that can be understood by `parse_collection`. Signed-off-by: Michal Siedlaczek <[email protected]>

elshize force-pushed the json-parser branch from 07e067b to d7cb511 Compare March 9, 2025 20:32

JMMackenzie approved these changes Mar 9, 2025

View reviewed changes

elshize merged commit 6616ce3 into main Mar 12, 2025
8 checks passed

elshize deleted the json-parser branch March 12, 2025 16:47

amallia reviewed Mar 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: JSON document format #603

Feature: JSON document format #603

elshize commented Mar 8, 2025

codecov bot commented Mar 8, 2025 •

edited

Loading

JMMackenzie left a comment

amallia Mar 12, 2025

amallia Mar 12, 2025

elshize Mar 12, 2025

elshize Mar 12, 2025

amallia Mar 12, 2025

amallia Mar 12, 2025

Feature: JSON document format #603

Feature: JSON document format #603

Conversation

elshize commented Mar 8, 2025

codecov bot commented Mar 8, 2025 • edited Loading

Codecov Report

JMMackenzie left a comment

Choose a reason for hiding this comment

amallia Mar 12, 2025

Choose a reason for hiding this comment

amallia Mar 12, 2025

Choose a reason for hiding this comment

elshize Mar 12, 2025

Choose a reason for hiding this comment

elshize Mar 12, 2025

Choose a reason for hiding this comment

amallia Mar 12, 2025

Choose a reason for hiding this comment

amallia Mar 12, 2025

Choose a reason for hiding this comment

codecov bot commented Mar 8, 2025 •

edited

Loading