(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592

kiukchung · 2022-08-30T17:52:21Z

Description

In the function scripts.pretraining.bert.create_pretraining_data.tokenize_lines()

The code snippet:

for line in lines:
        if not line:
            break
        line = line.strip()
        # Empty lines are used as document delimiters
        if not line:
            results.append([])
        else:
            #<OMITTED FOR BREVITY...>
    return results

Suggests that empty or null lines (e.g. "" or None) break the for-loop returning only the lines that have been processed so far whereas stripped-empty lines (e.g. " ") are used as document delimiters.

Could someone shed light as to what the (empty line + break-from-loop) is meant to accomplish? Are empty/null lines used as delimiters?

The text was updated successfully, but these errors were encountered:

szhengac · 2022-08-31T00:55:58Z

This requires a format in the text file where documents are separated by empty lines. The following example consists of 2 documents, where each line is considered as a sentence in a document.

We use wikipedia corpus.
Wikipedia is great.

We also use bookcorpus.
Bookcorpus is also helpful.

So when the tokenize_lines function reaches the end of the file, if not line is triggered and we break the loop. When the function reads the empty line between two documents, the second if not line is triggered after line.strip() on line = '\n'.

kiukchung added the bug Something isn't working label Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592

kiukchung commented Aug 30, 2022

szhengac commented Aug 31, 2022

(not a bug) question about bert create_pretraining_data.tokenize_lines() #1592

(not a bug) question about bert create_pretraining_data.tokenize_lines() #1592

Comments

kiukchung commented Aug 30, 2022

Description

szhengac commented Aug 31, 2022

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592

(not a bug) question about bert `create_pretraining_data.tokenize_lines()` #1592