International Conference on Advanced Computing, Communication and Networks - CCN 2011
Author(s) : B ESWARA REDDY, K MUNIVELU REDDY
Most of the document clustering techniques are based on statistical analysis of a term, either a word or phrase.The statistical analysis of a term frequency captures the importance of the term within the document only. Thus, the underlying mining model should indicate terms that capture the semantics of the text. In this case, The mining model can capture terms that present the concepts of the sentence, which leads to the discovery of the topic of document. A new concept-based mining model focuses on the web document clustering;the model consists of three components: concept-based statistical analyzer, COG and concept extractor.The statistical analyzer is to analyze terms on the sentence and document levels. The COG is to extract the most important terms with respect to the meaning of the text. The concepts that have maximum weights are selected by the concept extractor.The similarity between documents is calculated based on the Concept-based document similarity measure; It is the combination of , and .The experimental results demonstrate extensive comparison between the concept-based analysis and the statistical analysis.