Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper result for the trained model for NER #2

Open
siddharthBohra opened this issue May 19, 2017 · 0 comments
Open

Improper result for the trained model for NER #2

siddharthBohra opened this issue May 19, 2017 · 0 comments

Comments

@siddharthBohra
Copy link

Hi,
My own train file,

Abdera implementation of the Atom Syndication Format and Atom Publishing Protocol
Accumulo secure implementation of BigTable
ActiveMQ message broker supporting different communication protocols and clients, including a full Java Message Service (JMS) 1.1 client.
Allura Python-based an open source implementation of a software forge
Ant Java-based build tool
Apache Arrow "A high-performance cross-system data layer for columnar in-memory analytics".[1][2]
APR Apache Portable Runtime, a portability library written in C
Archiva Build Artifact Repository Manager
Apache Beam ,an uber-API for big data
Beehive Java visual object model
Bloodhound defect tracker based on Trac[3]
Calcite dynamic data management framework
Camel declarative routing and mediation rules engine which implements the Enterprise Integration Patterns using a Java-based domain specific language

Training and saving the model of NER [TrainGeneTag.java program],
import com.aliasi.chunk.CharLmHmmChunker;
import com.aliasi.corpus.parsers.GeneTagParser;
import com.aliasi.hmm.HmmCharLmEstimator;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.AbstractExternalizable;
import java.io.File;
import java.io.IOException;

public class LingPipeNER {
static final int MAX_N_GRAM = 8;
static final int NUM_CHARS = 256;
static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

public static void main(String[] args) throws IOException {
    File corpusFile = new File("D:\\resources\\lingPipe-technologies-trainFile.txt");
    File modelFile = new File("D:\\resources\\muc6.HmmChunker");

    System.out.println("Setting up Chunker Estimator");
    TokenizerFactory factory          = IndoEuropeanTokenizerFactory.INSTANCE;
    HmmCharLmEstimator hmmEstimator   = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
    CharLmHmmChunker chunkerEstimator = new CharLmHmmChunker(factory,hmmEstimator);

    System.out.println("Setting up Data Parser");
    GeneTagParser parser = new GeneTagParser();
    parser.setHandler(chunkerEstimator);

    System.out.println("Training with Data from File=" + corpusFile);
    parser.parse(corpusFile);

    System.out.println("Compiling and Writing Model to File=" + modelFile);
    AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
}

}

Now loading the model and passing an input sentence to get the entity output [RunChunker.Java program],

import com.aliasi.chunk.Chunk;
import com.aliasi.chunk.Chunker;
import com.aliasi.chunk.Chunking;
import com.aliasi.util.AbstractExternalizable;
import com.sun.org.apache.xerces.internal.xs.StringList;
import edu.stanford.nlp.ling.tokensregex.PhraseTable;
import java.io.File;
import java.util.Iterator;
import java.util.Set;

public class RunChunker {

public static void main(String[] args) throws Exception {
    File modelFile = new File("D:\\resources\\muc6.HmmChunker");
System.out.println("Reading chunker from file");
Chunker chunker  = (Chunker) AbstractExternalizable.readObject(modelFile);
String sentence = "Camel declarative routing and Camel rules engine which implements the Enterprise Integration Patterns";
System.out.println(sentence);
String[] strList =  sentence.split(" ");
    for (int i = 0; i < strList.length; ++i) {
        Chunking chunking = chunker.chunk(strList[i]);

// System.out.println("Chunking=" + chunking);
Set chunkSet = chunking.chunkSet();
Iterator it = chunkSet.iterator();
while (it.hasNext()) {
Chunk chunk = it.next();
int start = chunk.start();
int end = chunk.end();
String text = sentence.substring(start,end);
System.out.println(" chunk=" + chunk + " text=" + text);
}
}
}

}

Using lingpipe 3.9.3 version,

com.aliasi
lingpipe
3.9.3

Expected output:
Technology: camel
Technology: camel

but the output is empty. No chunking found for the chunker.chunk(strList[i]). Please guide me.
If you have any better example program to train and find the Name entity, please share it.

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant