Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.
tokens_sample(x, size = NULL, replace = FALSE, prob = NULL, by = NULL)
x | a tokens object whose documents will be sampled |
---|---|
size | a positive number, the number of documents to select; when used
with |
replace | if |
prob | a vector of probability weights for obtaining the elements of the
vector being sampled. May not be applied when |
by | optional grouping variable for sampling. This will be evaluated in
the docvars data.frame, so that docvars may be referred to by name without
quoting. This also changes previous behaviours for |
a tokens object (re)sampled on the documents, containing the document variables for the documents sampled.
set.seed(123) toks <- tokens(data_corpus_inaugural[1:6]) toks #> Tokens consisting of 6 documents and 4 docvars. #> 1789-Washington : #> [1] "Fellow-Citizens" "of" "the" "Senate" #> [5] "and" "of" "the" "House" #> [9] "of" "Representatives" ":" "Among" #> [ ... and 1,525 more ] #> #> 1793-Washington : #> [1] "Fellow" "citizens" "," "I" "am" "again" #> [7] "called" "upon" "by" "the" "voice" "of" #> [ ... and 135 more ] #> #> 1797-Adams : #> [1] "When" "it" "was" "first" "perceived" "," #> [7] "in" "early" "times" "," "that" "no" #> [ ... and 2,565 more ] #> #> 1801-Jefferson : #> [1] "Friends" "and" "Fellow" "Citizens" ":" "Called" #> [7] "upon" "to" "undertake" "the" "duties" "of" #> [ ... and 1,911 more ] #> #> 1805-Jefferson : #> [1] "Proceeding" "," "fellow" "citizens" #> [5] "," "to" "that" "qualification" #> [9] "which" "the" "Constitution" "requires" #> [ ... and 2,368 more ] #> #> 1809-Madison : #> [1] "Unwilling" "to" "depart" "from" "examples" "of" #> [7] "the" "most" "revered" "authority" "," "I" #> [ ... and 1,249 more ] #> tokens_sample(toks) #> Tokens consisting of 6 documents and 4 docvars. #> 1797-Adams : #> [1] "When" "it" "was" "first" "perceived" "," #> [7] "in" "early" "times" "," "that" "no" #> [ ... and 2,565 more ] #> #> 1809-Madison : #> [1] "Unwilling" "to" "depart" "from" "examples" "of" #> [7] "the" "most" "revered" "authority" "," "I" #> [ ... and 1,249 more ] #> #> 1793-Washington : #> [1] "Fellow" "citizens" "," "I" "am" "again" #> [7] "called" "upon" "by" "the" "voice" "of" #> [ ... and 135 more ] #> #> 1801-Jefferson : #> [1] "Friends" "and" "Fellow" "Citizens" ":" "Called" #> [7] "upon" "to" "undertake" "the" "duties" "of" #> [ ... and 1,911 more ] #> #> 1805-Jefferson : #> [1] "Proceeding" "," "fellow" "citizens" #> [5] "," "to" "that" "qualification" #> [9] "which" "the" "Constitution" "requires" #> [ ... and 2,368 more ] #> #> 1789-Washington : #> [1] "Fellow-Citizens" "of" "the" "Senate" #> [5] "and" "of" "the" "House" #> [9] "of" "Representatives" ":" "Among" #> [ ... and 1,525 more ] #> tokens_sample(toks, replace = TRUE) %>% docnames() #> [1] "1805-Jefferson.1" "1801-Jefferson.1" "1809-Madison.1" #> [4] "1809-Madison.2" "1789-Washington.1" "1793-Washington.1" tokens_sample(toks, size = 3, replace = TRUE) %>% docnames() #> [1] "1797-Adams.1" "1805-Jefferson.1" "1797-Adams.2" # sampling using by docvars(toks) #> Year President FirstName Party #> 1 1789 Washington George none #> 2 1793 Washington George none #> 3 1797 Adams John Federalist #> 4 1801 Jefferson Thomas Democratic-Republican #> 5 1805 Jefferson Thomas Democratic-Republican #> 6 1809 Madison James Democratic-Republican tokens_sample(toks, size = 2, replace = TRUE, by = Party) %>% docnames() #> [1] "1809-Madison.1" "1801-Jefferson.1" "1797-Adams.1" #> [4] "1797-Adams.2" "1789-Washington.1" "1789-Washington.2"