AnalyticsDojo

Introduction to Text Mining in Python

rpi.analyticsdojo.com

33. Introduction to Text Mining in Python

These exercises were adapted from Mining the Social Web, 2nd Edition See origional here Simplified BSD License that governs its use.

33.1. Key Terms for Text Mining

  • A collection of documents – corpus

  • Document – a piece of text

  • Terms/tokens – a word in a document

  • Entity – Some type of person, place, or organization

 corpus = { 
 'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
 'b' : "Professor Plum has a green plant in his study.",
 'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
}

#This will separate the documents (sentences) into terms/tokins/words.
terms = {
 'a' : [ i.lower() for i in corpus['a'].split() ],
 'b' : [ i.lower() for i in corpus['b'].split() ],
 'c' : [ i.lower() for i in corpus['c'].split() ]
 }
terms
{'a': ['mr.',
  'green',
  'killed',
  'colonel',
  'mustard',
  'in',
  'the',
  'study',
  'with',
  'the',
  'candlestick.',
  'mr.',
  'green',
  'is',
  'not',
  'a',
  'very',
  'nice',
  'fellow.'],
 'b': ['professor',
  'plum',
  'has',
  'a',
  'green',
  'plant',
  'in',
  'his',
  'study.'],
 'c': ['miss',
  'scarlett',
  'watered',
  'professor',
  "plum's",
  'green',
  'plant',
  'while',
  'he',
  'was',
  'away',
  'from',
  'his',
  'office',
  'last',
  'week.']}

33.2. Term Frequency

  • A very common factor is to determine how frequently a word or term occurs with a document.

  • This is how early web search engines worked. (Not very well).

  • A common basic standarization method is to control for the number of words in the document.

from math import log

#This is our terms we would like to use.
QUERY_TERMS = ['mr.', 'green']

#This calculates the term frequency normalized by the length.
def tf(term, doc, normalize):
    doc = doc.lower().split()
    if normalize:
        return doc.count(term.lower()) / float(len(doc))
    else:
        return doc.count(term.lower()) / 1.0
 
#This prints the basic documents. We can see that Mr. Green is in the first document.
for (k, v) in sorted(corpus.items()):
    print (k, ':', v)
print('\n')
    
a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.
b : Professor Plum has a green plant in his study.
c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week.


# Score queries by calculating cumulative tf (normalized and unnormalized).
query_scores = {'a': 0, 'b': 0, 'c': 0}

#This starts the search for each query
for term in [t.lower() for t in QUERY_TERMS]:
    #This starts the search for each document in the corpus
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], True))
        
print('\n')   #Let's skip a line.     
print ("This does the same thing but unnormalized.")
for term in [t.lower() for t in QUERY_TERMS]:
    #This starts the search for each document in the corpus
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], False))
TF(a): mr. 0.10526315789473684
TF(b): mr. 0.0
TF(c): mr. 0.0
TF(a): green 0.10526315789473684
TF(b): green 0.1111111111111111
TF(c): green 0.0625


This does the same thing but unnormalized.
TF(a): mr. 2.0
TF(b): mr. 0.0
TF(c): mr. 0.0
TF(a): green 2.0
TF(b): green 1.0
TF(c): green 1.0

33.3. TF-IDF

  • TF-IDF incorporates the inverse document frequency in the analysis. This type of factor would limit the impact of frequent words that would show up in a large number of documents.

  • The tf-idf calc involves multiplying against a tf value less than 0, so it’s necessary to return a value greater than 1 for consistent scoring. (Multiplying two values less than 1 returns a value less than each of them.)

def idf(term, corpus):
    
    num_texts_with_term = len([True for text in corpus if term.lower()
                              in text.lower().split()])
    try:
        return 1.0 + log(float(len(corpus)) / num_texts_with_term)
    except ZeroDivisionError:
        return 1.0

    
for term in [t.lower() for t in QUERY_TERMS]:
        print ('IDF: %s' % (term, ), idf(term, corpus.values()))
        


IDF: mr. 2.09861228866811
IDF: green 1.0

#TF-IDF Just multiplies the two together
def tf_idf(term, doc, corpus):
    return tf(term, doc, True) * idf(term, corpus)

query_scores = {'a': 0, 'b': 0, 'c': 0}
for term in [t.lower() for t in QUERY_TERMS]:
    for doc in sorted(corpus):
        print ('TF(%s): %s' % (doc, term), tf(term, corpus[doc], True))
    print ('IDF: %s' % (term, ), idf(term, corpus.values()))
    print('\n')

    for doc in sorted(corpus):
        score = tf_idf(term, corpus[doc], corpus.values())
        print ('TF-IDF(%s): %s' % (doc, term), score)
        query_scores[doc] += score
        print('\n')

print ("Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), ))
for (doc, score) in sorted(query_scores.items()):
    print (doc, score)
TF(a): mr. 0.10526315789473684
TF(b): mr. 0.0
TF(c): mr. 0.0
IDF: mr. 2.09861228866811


TF-IDF(a): mr. 0.22090655670190631


TF-IDF(b): mr. 0.0


TF-IDF(c): mr. 0.0


TF(a): green 0.10526315789473684
TF(b): green 0.1111111111111111
TF(c): green 0.0625
IDF: green 1.0


TF-IDF(a): green 0.10526315789473684


TF-IDF(b): green 0.1111111111111111


TF-IDF(c): green 0.0625


Overall TF-IDF scores for query 'mr. green'
a 0.3261697145966431
b 0.1111111111111111
c 0.0625