Improved Feature Selection Approach TFIDF in Text Mining Paper DiscussedAbstract
This paper, appearing in the Proceedings of the First International Conference on Machine Learning and Cybernetics, discusses the TFIDF scoring method and suggests an improvement which increased the success of classification using a Vector Space Model (VSM) and Naive Bayse Classifier. Term Frequency Inverse Document Frequency (TFIDF) is
a method used by many text mining systems to score individual words within text
documents in order to select concepts that accurately represent the content of
the article. Once representative concepts are chosen the documents can either
be classified or clustered based on
them.
TFIDF can be calculated by looking at the number of times a word appears in a document and multiplying that number by the log of the total number of documents divide by the number of documents the term in question resides in. For example: In document A a word appears 10 times, there are 2,000 documents in the set and the word in question appears in 15 of them. The score would then be, 10 X log(2,000/15) or 10 X 2.125 = 21.25. If we looked at another word in document A that appears 10 times also but this time the word in question appears in 400 of the 2,000 total documents than the score would be 10 X 0.7 = 7. The first word with a score of 21.25 is a better choice for clustering and classification than the second. When creating a Vector Space Model (VSM) using a Naive Bayse classifier the use of TFIDF allowed for a classification accuracy of 76% (if you are interested in the specific details of the study I will leave it to the reader to refer back to the original article). While this is fairly good and is representative of similar efforts using TFIDF the authors of this paper have proposed an alternative means of word scoring. In place of Inverse Document Frequency (the IDF part) this paper proposes that a mutual information (MI) evaluation function be used instead. The math of this function was not terribly well explained so I will leave it to those who are better at mathematical functions than I to have a look and decipher the equation. For me the interesting outcome was the fact that classification using TF X MI instead of TFIDF produced an accuracy of 88%. Results over 80% are usually pretty rare so this improvement could be a significant advance. Hopefully someone at Micropatent/Aurigin or OmniViz will see this and try the appropriate experiments with ThemeScape and OmniViz. Posted: Wed - April 30, 2003 at 07:16 AM Patinformatics Interesting Reference Articles Email Comments |
Quick Links
Calendar
Categories
General Information
Interesting Reference Articles New Presentations and Papers Tony's Reflections Vendor News Archives
XML/RSS Feed
Statistics
Total entries in this blog: 39
Total entries in this category: 10 Published On: Jun 06, 2003 06:53 AM |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||