-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with length of oracle output #1
Comments
Thanks for pointing this out. First, you're right, one must set If you set |
Sure. Thanks for your help.
both give the same output (basically just a copy of |
Alright, the thing is your texts are not tokenized. It is quite common before NLP tasks to tokenize text so it's easier to process. In particular, it will add whitespace around punctuation etc so it's easy to distinct from words. I can suggest you to use Stanford CoreNLP Tokenizer i.e.
|
Thank you! It was not clear for me that the text had to be tokenized in that manner. You could think about adding this to the README as well 👍 As a remark, my shell did not like that I changed the tokenize(){
# Tokenize function using Stanford PTB Tokenizer
java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines \
< "$1" \
> "$1.tok"
} |
The example command in the description is not working:
$ extoracle source.txt target.txt -method greedy -output oracle.txt
Also the length is very long in my outputs. Even when I specify
-length 1
or-length_oracle
.Furthermore, I am not sure if this works with other languages than English - because then it even returned double the amount of sentences of the source.
$ pip freeze | grep extoracle extoracle==0.1
The text was updated successfully, but these errors were encountered: