● NMF = "non-negative matrix factorization"
● Dimension reduction technique
● NMF models are interpretable (unlike PCA)
● Easy to interpret means easy to explain!
● However, all sample features must be non-negative (>= 0)
● NMF expresses documents as combinations of topics (or "themes")
● NMF expresses images as combinations of patterns
import pandas as pd import matplotlib.pyplot as plt %matplotlib inline plt.imshow(plt.imread('img.jpg')) plt.axis('off')
(-0.5, 740.5, 246.5, -0.5)
● Word frequency array, 4 words, many documents
● Measure presence of words in each document using "tf-idf"
● "tf" = frequency of word in document
● "idf" reduces influence of frequent words
TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.
from sklearn.decomposition import PCA print str(PCA).split('.')[-1].replace("'>",'')
cat = plt.imread('cat.jpg') print cat.shape plt.imshow(cat)
(194L, 259L, 3L)
<matplotlib.image.AxesImage at 0xc951e80>
def curiosity(image, n): from sklearn.decomposition import PCA, NMF, TruncatedSVD import numpy as np ## turn color image to black and white try: image = image.mean(axis=2) except: pass models = [PCA, NMF, TruncatedSVD] results =  c = 0 plt.figure(figsize= (12,12)) for i in models: c+=1 m = i(n) d = m.fit_transform(image) a = m.inverse_transform(d) results.append(a) plt.subplot(1,3,c) plt.imshow(a, cmap='gray') plt.axis( 'off') title = str(i).split('.')[-1].replace("'>",'') plt.title(title, size=15) print '3 dementional reduction algorithms with compressing of ' print 'Number of components: '+ str(n) return results
test10 = curiosity(cat,10)
3 dementional reduction algorithms with compressing of Number of components: 10
test20 = curiosity(cat,30)
3 dementional reduction algorithms with compressing of Number of components: 30
test20 = curiosity(cat,50)
3 dementional reduction algorithms with compressing of Number of components: 50