Returns document subsets of a tokens that meet certain conditions, including
direct logical operations on docvars (document-level variables).
tokens_subset()
functions identically to subset.data.frame()
, using
non-standard evaluation to evaluate conditions based on the docvars in the
tokens.
tokens_subset(x, subset, drop_docid = TRUE, ...)
x | tokens object to be subsetted |
---|---|
subset | logical expression indicating the documents to keep: missing values are taken as false |
drop_docid | if |
... | not used |
tokens object, with a subset of documents (and docvars) selected according to arguments
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e", d3 = "b b c e", d4 = "e e f a b"), docvars = data.frame(grp = c(1, 1, 2, 3))) toks <- tokens(corp) # selecting on a docvars condition tokens_subset(toks, grp > 1) #> Tokens consisting of 2 documents and 1 docvar. #> d3 : #> [1] "b" "b" "c" "e" #> #> d4 : #> [1] "e" "e" "f" "a" "b" #> # selecting on a supplied vector tokens_subset(toks, c(TRUE, FALSE, TRUE, FALSE)) #> Tokens consisting of 2 documents and 1 docvar. #> d1 : #> [1] "a" "b" "c" "d" #> #> d3 : #> [1] "b" "b" "c" "e" #>