For a dfm object, returns a (weighted) document frequency for each term. The default is a simple count of the number of documents in which a feature occurs more than a given frequency threshold. (The default threshold is zero, meaning that any feature occurring at least once in a document will be counted.)
docfreq( x, scheme = c("count", "inverse", "inversemax", "inverseprob", "unary"), base = 10, smoothing = 0, k = 0, threshold = 0 )
x | a dfm |
---|---|
scheme | type of document frequency weighting, computed as follows, where \(N\) is defined as the number of documents in the dfm and \(s\) is the smoothing constant:
|
base | the base with respect to which logarithms in the inverse document frequency weightings are computed; default is 10 (see Manning, Raghavan, and Schütze 2008, p123). |
smoothing | added to the quotient before taking the logarithm |
k | added to the denominator in the "inverse" weighting types, to prevent a zero document count for a term |
threshold | numeric value of the threshold above which a feature will considered in the computation of document frequency. The default is 0, meaning that a feature's document frequency will be the number of documents in which it occurs greater than zero times. |
a numeric vector of document frequencies for each feature
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
dfmat1 <- dfm(data_corpus_inaugural[1:2]) #> Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first. docfreq(dfmat1[, 1:20]) #> fellow-citizens of the senate and #> 1 2 2 1 2 #> house representatives : among vicissitudes #> 1 1 2 1 1 #> incident to life no event #> 1 2 1 1 1 #> could have filled me with #> 1 2 1 2 1 # replication of worked example from # https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf dfmat2 <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), features = c("this", "is", "a", "sample", "another", "example"))) %>% as.dfm() dfmat2 #> Document-feature matrix of: 2 documents, 6 features (33.33% sparse) and 0 docvars. #> features #> docs this is a sample another example #> document1 1 1 2 1 0 0 #> document2 1 1 0 0 2 3 docfreq(dfmat2) #> this is a sample another example #> 2 2 1 1 1 1 docfreq(dfmat2, scheme = "inverse") #> this is a sample another example #> 0.00000 0.00000 0.30103 0.30103 0.30103 0.30103 docfreq(dfmat2, scheme = "inverse", k = 1, smoothing = 1) #> this is a sample another example #> 0.2218487 0.2218487 0.3010300 0.3010300 0.3010300 0.3010300 docfreq(dfmat2, scheme = "unary") #> this is a sample another example #> 1 1 1 1 1 1 docfreq(dfmat2, scheme = "inversemax") #> this is a sample another example #> 0.00000 0.00000 0.30103 0.30103 0.30103 0.30103 docfreq(dfmat2, scheme = "inverseprob") #> this is a sample another example #> 0 0 0 0 0 0