Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with length of oracle output #1

Open
tbrodbeck opened this issue Jul 29, 2020 · 4 comments
Open

Problems with length of oracle output #1

tbrodbeck opened this issue Jul 29, 2020 · 4 comments

Comments

@tbrodbeck
Copy link

The example command in the description is not working:
$ extoracle source.txt target.txt -method greedy -output oracle.txt

Traceback (most recent call last):
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/bin/extoracle", line 10, in <module>
    sys.exit(main())
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/bin/cmd.py", line 34, in main
    n_thread=args.n_thread)
  File "/Users/tillmann/dev/textSum/.direnv/python-3.7.3/lib/python3.7/site-packages/extoracle/extoracle.py", line 82, in from_files
    "Argument [summary_length, length_oracle] "
ValueError: Argument [summary_length, length_oracle] cannot be both None/False

Also the length is very long in my outputs. Even when I specify -length 1 or -length_oracle.
Furthermore, I am not sure if this works with other languages than English - because then it even returned double the amount of sentences of the source.

$ pip freeze | grep extoracle      
extoracle==0.1
@pltrdy
Copy link
Owner

pltrdy commented Jul 30, 2020

Thanks for pointing this out.

First, you're right, one must set -length or -length_oracle, I will update README accordingly.
There's no consideration of language, it just compares words (i.e. virtually anything separated with whitespaces). However, it is looking for sentences, which should be terminated by a period «.».

If you set -length 1 each output should be at most 1 sentence long. Unless your text has no «.» it should be fine. Could you share an extract so I can reproduce?

@tbrodbeck
Copy link
Author

tbrodbeck commented Jul 30, 2020

Sure. Thanks for your help.
I just created this minimum example with the following files:
sourceTest.txt
targetTest.txt
The following commands

extoracle sourceTest.txt targetTest.txt -length 1 -output oracleTest.txt
extoracle sourceTest.txt targetTest.txt -length_oracle -output oracleTest.txt

both give the same output (basically just a copy of sourceTest.txt):
oracleTest.txt

@pltrdy
Copy link
Owner

pltrdy commented Jul 31, 2020

Alright, the thing is your texts are not tokenized. It is quite common before NLP tasks to tokenize text so it's easier to process. In particular, it will add whitespace around punctuation etc so it's easy to distinct from words.
E.g. your source becomes:
this actually happened a couple of years ago . i grew up in germany where i went to a german secondary school that went from 5th to 13th grade -LRB- we still had 13 grades then , they have since changed that -RRB- . my school was named after anne frank and we had a club that i was very active in from 9th grade on , which was dedicated to teaching incoming 5th graders about anne franks life , discrimination , anti-semitism , hitler , the third reich and that whole spiel . basically a day where the students ' classes are cancelled and instead we give them an interactive history and social studies class with lots of activities and games . this was my last year at school and i already had a lot of experience doing these project days with the kids . i was running the thing with a friend , so it was just the two of us and 30-something 5th graders . we start off with a brief introduction and brainstorming : what do they know about anne frank and the third reich ? you 'd be surprised how much they know . anyway after the brainstorming we do a few activities , and then we take a short break . after the break we split the class into two groups to make it easier to handle . one group watches a short movie about anne frank while the other gets a tour through our poster presentation that our student group has been perfecting over the years . then the groups switch . i 'm in the classroom to show my group the movie and i take attendance to make sure no one decided to run away during break . i 'm going down the list when i come to the name sandra -LRB- name changed -RRB- . a kid with a boyish haircut and a somewhat deeper voice , wearing clothes from the boy 's section at a big clothing chain in germany , pipes up . now keep in mind , these are all 11 year olds , they are all pre-pubescent , their bodies are not yet showing any sex specific features one would be able to see while they are fully clothed.

I can suggest you to use Stanford CoreNLP Tokenizer i.e.

# Get Stanford CoreNLP
mkdir ~/corenlp
cd ~/corenlp
wget http://nlp.stanford.edu/software/stanford-corenlp-latest.zip
unzip stanford-corenlp-latest.zip

# change with your version number
export CLASSPATH=$HOME/corenlp/stanford-corenlp-X.Y.Z.jar

tokenize(){
    # Tokenize function using Stanford PTB Tokenizer
    path="$1"
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$path" \
        > "$path.tok"
}

tokenize "sourceTest.txt"
tokenize "targetTest.txt"

# Finally run extoracle on tokenized text
extoracle sourceTest.txt.tok targetTest.txt.tok -length_oracle -output oracleTest.txt.tok

@tbrodbeck
Copy link
Author

tbrodbeck commented Jul 31, 2020

Thank you! It was not clear for me that the text had to be tokenized in that manner. You could think about adding this to the README as well 👍

As a remark, my shell did not like that I changed the $path variable, so I had to change the function to this one:

tokenize(){                                                                         
    # Tokenize function using Stanford PTB Tokenizer
    java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
        < "$1" \
        > "$1.tok"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants