kr.ac.kaist.swrc.jhannanum.plugin.SupplementPlugin.PlainTextProcessor.SentenceSegmentor
Class SentenceSegmentor

java.lang.Object
  extended by kr.ac.kaist.swrc.jhannanum.plugin.SupplementPlugin.PlainTextProcessor.SentenceSegmentor.SentenceSegmentor
All Implemented Interfaces:
Plugin, PlainTextProcessor

public class SentenceSegmentor
extends java.lang.Object
implements PlainTextProcessor

This plug-in reads a document which consists of more than one sentence, and recognize the end of each sentence based on punctuation marks. So if punctuation marks were not used correctly in the sentences, this plug-in will not work well.

It considers '.', '!', '?' as the marks for the end of sentence, but these symbols can be used in other purpose, so it deals with those problems.

For example,
- 12.42 : number
- A. Introduction : section title
- I'm fine... : ellipsis
- U.S. : abbreviation

It is a Plain Text Processor plug-in which is a supplement plug-in of phase 1 in HanNanum work flow.

Author:
Sangwon Park (hudoni@world.kaist.ac.kr), CILab, SWRC, KAIST

Field Summary
private  java.lang.String[] bufEojeols
          the buffer for storing the remaining part after one sentence returned
private  int bufEojeolsIdx
          the index of the buffer for storing the remaining part
private  java.lang.String bufRes
          the buffer for storing intermediate results
private  int documentID
          the ID of the document
private  boolean endOfDocument
          the flag to check whether current sentence is the end of document
private  boolean hasRemainingData
          the flag to check if there is remaining data in the input buffer
private  int sentenceID
          the ID of the sentence
 
Constructor Summary
SentenceSegmentor()
           
 
Method Summary
 PlainSentence doProcess(PlainSentence ps)
          It recognizes the end of each sentence and return the first sentence.
 PlainSentence flush()
          It returns the text which has been stored in the internal buffer.
 boolean hasRemainingData()
          It checks if there are some remaining text.
 void initialize(java.lang.String baseDir, java.lang.String configFile)
          This method is called before the work flow starts in order to initialize the plug-in.
private  boolean isSym(char c)
          Checks if the specified symbol can appear with previous symbols.
 void shutdown()
          This method is called before the work flow is closed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

documentID

private int documentID
the ID of the document


sentenceID

private int sentenceID
the ID of the sentence


hasRemainingData

private boolean hasRemainingData
the flag to check if there is remaining data in the input buffer


bufRes

private java.lang.String bufRes
the buffer for storing intermediate results


bufEojeols

private java.lang.String[] bufEojeols
the buffer for storing the remaining part after one sentence returned


bufEojeolsIdx

private int bufEojeolsIdx
the index of the buffer for storing the remaining part


endOfDocument

private boolean endOfDocument
the flag to check whether current sentence is the end of document

Constructor Detail

SentenceSegmentor

public SentenceSegmentor()
Method Detail

isSym

private boolean isSym(char c)
Checks if the specified symbol can appear with previous symbols.

Parameters:
c - - the character to check
Returns:
true: if the character can come together with the previous symbols, false: not possible

doProcess

public PlainSentence doProcess(PlainSentence ps)
It recognizes the end of each sentence and return the first sentence.

Specified by:
doProcess in interface PlainTextProcessor
Parameters:
ps - - the plain sentence which can consist of several sentences
Returns:
the first sentence recognized

initialize

public void initialize(java.lang.String baseDir,
                       java.lang.String configFile)
                throws java.io.FileNotFoundException,
                       java.io.IOException
Description copied from interface: Plugin
This method is called before the work flow starts in order to initialize the plug-in. A configuration file can be passed to the plug-in, which makes the plug-in more flexible.

Specified by:
initialize in interface Plugin
Parameters:
baseDir - - the base directory of HanNanum files
configFile - - the path for the configuration file
Throws:
java.io.FileNotFoundException
java.io.IOException

shutdown

public void shutdown()
Description copied from interface: Plugin
This method is called before the work flow is closed.

Specified by:
shutdown in interface Plugin

flush

public PlainSentence flush()
Description copied from interface: PlainTextProcessor
It returns the text which has been stored in the internal buffer. This method is called by HanNanum work flow only if hasRemainingData() returns true.

Specified by:
flush in interface PlainTextProcessor
Returns:
the data in the internal buffer, if the internal buffer is empty, null is returned

hasRemainingData

public boolean hasRemainingData()
Description copied from interface: PlainTextProcessor
It checks if there are some remaining text. If it returns true, the HanNanum work flow will not give more data to this plug-in by passing null for doProcess(). It's because from the next phase the processing unit should be just one sentence. This mechanism allows the plug-in not to manage am input buffer.

Specified by:
hasRemainingData in interface PlainTextProcessor
Returns:
true: there are some remaining data, false: all given text were processed