Problems with length of oracle output #1

tbrodbeck · 2020-07-29T20:14:35Z

The example command in the description is not working:
$ extoracle source.txt target.txt -method greedy -output oracle.txt

Traceback (most recent call last):
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/bin/extoracle", line 10, in <module>
    sys.exit(main())
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/bin/cmd.py", line 34, in main
    n_thread=args.n_thread)
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/extoracle.py", line 82, in from_files
    "Argument [summary_length, length_oracle] "
ValueError: Argument [summary_length, length_oracle] cannot be both None/False

Also the length is very long in my outputs. Even when I specify -length 1 or -length_oracle.
Furthermore, I am not sure if this works with other languages than English - because then it even returned double the amount of sentences of the source.

$ pip freeze | grep extoracle      
extoracle==0.1

The text was updated successfully, but these errors were encountered:

pltrdy · 2020-07-30T14:45:13Z

Thanks for pointing this out.

First, you're right, one must set -length or -length_oracle, I will update README accordingly.
There's no consideration of language, it just compares words (i.e. virtually anything separated with whitespaces). However, it is looking for sentences, which should be terminated by a period «.».

If you set -length 1 each output should be at most 1 sentence long. Unless your text has no «.» it should be fine. Could you share an extract so I can reproduce?

tbrodbeck · 2020-07-30T19:56:21Z

Sure. Thanks for your help.
I just created this minimum example with the following files:
sourceTest.txt
targetTest.txt
The following commands

extoracle sourceTest.txt targetTest.txt -length 1 -output oracleTest.txt
extoracle sourceTest.txt targetTest.txt -length_oracle -output oracleTest.txt

both give the same output (basically just a copy of sourceTest.txt):
oracleTest.txt

pltrdy · 2020-07-31T09:07:39Z

Alright, the thing is your texts are not tokenized. It is quite common before NLP tasks to tokenize text so it's easier to process. In particular, it will add whitespace around punctuation etc so it's easy to distinct from words.
E.g. your source becomes:
this actually happened a couple of years ago . i grew up in germany where i went to a german secondary school that went from 5th to 13th grade -LRB- we still had 13 grades then , they have since changed that -RRB- . my school was named after anne frank and we had a club that i was very active in from 9th grade on , which was dedicated to teaching incoming 5th graders about anne franks life , discrimination , anti-semitism , hitler , the third reich and that whole spiel . basically a day where the students ' classes are cancelled and instead we give them an interactive history and social studies class with lots of activities and games . this was my last year at school and i already had a lot of experience doing these project days with the kids . i was running the thing with a friend , so it was just the two of us and 30-something 5th graders . we start off with a brief introduction and brainstorming : what do they know about anne frank and the third reich ? you 'd be surprised how much they know . anyway after the brainstorming we do a few activities , and then we take a short break . after the break we split the class into two groups to make it easier to handle . one group watches a short movie about anne frank while the other gets a tour through our poster presentation that our student group has been perfecting over the years . then the groups switch . i 'm in the classroom to show my group the movie and i take attendance to make sure no one decided to run away during break . i 'm going down the list when i come to the name sandra -LRB- name changed -RRB- . a kid with a boyish haircut and a somewhat deeper voice , wearing clothes from the boy 's section at a big clothing chain in germany , pipes up . now keep in mind , these are all 11 year olds , they are all pre-pubescent , their bodies are not yet showing any sex specific features one would be able to see while they are fully clothed.

I can suggest you to use Stanford CoreNLP Tokenizer i.e.

# Get Stanford CoreNLP
mkdir ~/corenlp
cd ~/corenlp
wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip

# change with your version number
export CLASSPATH=$HOME/corenlp/stanford-corenlp-X.Y.Z.jar

tokenize(){
    # Tokenize function using Stanford PTB Tokenizer
    path="$1"
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$path" \
        > "$path.tok"
}

tokenize "sourceTest.txt"
tokenize "targetTest.txt"

# Finally run extoracle on tokenized text
extoracle sourceTest.txt.tok targetTest.txt.tok -length_oracle -output oracleTest.txt.tok

tbrodbeck · 2020-07-31T12:52:33Z

Thank you! It was not clear for me that the text had to be tokenized in that manner. You could think about adding this to the README as well 👍

As a remark, my shell did not like that I changed the $path variable, so I had to change the function to this one:

tokenize(){                                                                         
    # Tokenize function using Stanford PTB Tokenizer
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$1" \
        > "$1.tok"
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with length of oracle output #1

Problems with length of oracle output #1

tbrodbeck commented Jul 29, 2020

pltrdy commented Jul 30, 2020

tbrodbeck commented Jul 30, 2020 •

edited

Loading

pltrdy commented Jul 31, 2020 •

edited

Loading

tbrodbeck commented Jul 31, 2020 •

edited

Loading

Problems with length of oracle output #1

Problems with length of oracle output #1

Comments

tbrodbeck commented Jul 29, 2020

pltrdy commented Jul 30, 2020

tbrodbeck commented Jul 30, 2020 • edited Loading

pltrdy commented Jul 31, 2020 • edited Loading

tbrodbeck commented Jul 31, 2020 • edited Loading

tbrodbeck commented Jul 30, 2020 •

edited

Loading

pltrdy commented Jul 31, 2020 •

edited

Loading

tbrodbeck commented Jul 31, 2020 •

edited

Loading