Take a random sample of documents of the specified size from a dfm, with or without replacement, optionally by grouping variables or with probability weights.
dfm_sample(x, size = NULL, replace = FALSE, prob = NULL, by = NULL)
x | the dfm object whose documents will be sampled |
---|---|
size | a positive number, the number of documents to select; when used
with |
replace | if |
prob | a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by | optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
a dfm object (re)sampled on the documents, containing the document variables for the documents sampled.
set.seed(10) dfmat <- dfm(tokens(c("a b c c d", "a a c c d d d", "a b b c"))) dfmat #> Document-feature matrix of: 3 documents, 4 features (16.67% sparse) and 0 docvars. #> features #> docs a b c d #> text1 1 1 2 1 #> text2 2 0 2 3 #> text3 1 2 1 0 dfm_sample(dfmat) #> Document-feature matrix of: 3 documents, 4 features (16.67% sparse) and 0 docvars. #> features #> docs a b c d #> text3 1 2 1 0 #> text1 1 1 2 1 #> text2 2 0 2 3 dfm_sample(dfmat, replace = TRUE) #> Document-feature matrix of: 3 documents, 4 features (25.00% sparse) and 0 docvars. #> features #> docs a b c d #> text3.1 1 2 1 0 #> text2.1 2 0 2 3 #> text3.2 1 2 1 0 # by groups dfmat <- dfm(tokens(data_corpus_inaugural[50:58])) dfm_sample(dfmat, by = Party, size = 2) #> Document-feature matrix of: 4 documents, 3,000 features (74.72% sparse) and 4 docvars. #> features #> docs senator mathias , chief justice burger vice president bush #> 2009-Obama 0 0 130 0 0 0 0 1 1 #> 2013-Obama 0 0 99 1 2 0 1 2 0 #> 1989-Bush 2 0 166 1 2 0 1 6 0 #> 2001-Bush 0 0 110 0 3 0 1 3 0 #> features #> docs speaker #> 2009-Obama 0 #> 2013-Obama 0 #> 1989-Bush 3 #> 2001-Bush 0 #> [ reached max_nfeat ... 2,990 more features ]