-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: JSON document format #603
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #603 +/- ##
=======================================
Coverage 93.91% 93.92%
=======================================
Files 82 82
Lines 3929 3931 +2
=======================================
+ Hits 3690 3692 +2
Misses 239 239 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Support JSONL collection format. This should make it easy to prepare a script that transforms any format into JSON and streams it to the parsing tool. Signed-off-by: Michal Siedlaczek <[email protected]>
The script utilizes `ir-datasets` to download a dataset and prints it out in a format that can be understood by `parse_collection`. Signed-off-by: Michal Siedlaczek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, nice work!
@@ -262,6 +262,8 @@ <h2 id="supported-formats"><a class="header" href="#supported-formats">Supported | |||
<li><code>plaintext</code>: every line contains the document's title first, then any | |||
number of whitespaces, followed by the content delimited by a new line | |||
character.</li> | |||
<li><code>jsonl</code>: every line is a JSON document with three fields: <code>title</code>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make title optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we clarify how we treat title and content? Do we concatenate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"title" what in PISA we call a string document ID. We must have one. This has nothing to do with a title that could be part of content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to clarify
@@ -420,6 +420,8 @@ <h2 id="supported-formats"><a class="header" href="#supported-formats">Supported | |||
<li><code>plaintext</code>: every line contains the document's title first, then any | |||
number of whitespaces, followed by the content delimited by a new line | |||
character.</li> | |||
<li><code>jsonl</code>: every line is a JSON document with three fields: <code>title</code>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
def parse_dataset(dataset_name): | ||
dataset = ir.load(dataset_name) | ||
for doc in dataset.docs_iter(): | ||
print(json.dumps({"title": doc.doc_id, "content": doc.text})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doc_id as title?
For example
import ir_datasets
dataset = ir_datasets.load("beir/climate-fever")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text, title>
import ir_datasets
dataset = ir_datasets.load("msmarco-passage")
for doc in dataset.docs_iter():
doc # namedtuple<doc_id, text>
Two examples: one has title and the other one does not
Support JSONL collection format. This should make it easy to prepare a script that transforms any format into JSON and streams it to the parsing tool.