R/tokens_group.R
tokens_group.Rd
Combine documents in a tokens object by a grouping variable, by concatenating the tokens in the order of the documents within each grouping variable.
tokens_group(x, groups = docid(x), fill = FALSE)
x | tokens object |
---|---|
groups | grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
fill | logical; if |
a tokens object whose documents are equal to the unique group combinations, and whose tokens are the concatenations of the tokens by group. Document-level variables that have no variation within groups are saved in docvars. Document-level variables that are lists are dropped from grouping, even when these exhibit no variation within groups.
corp <- corpus(c("a a b", "a b c c", "a c d d", "a c c d"), docvars = data.frame(grp = c("grp1", "grp1", "grp2", "grp2"))) toks <- tokens(corp) tokens_group(toks, groups = grp) #> Tokens consisting of 2 documents and 1 docvar. #> grp1 : #> [1] "a" "a" "b" "a" "b" "c" "c" #> #> grp2 : #> [1] "a" "c" "d" "d" "a" "c" "c" "d" #> tokens_group(toks, groups = c(1, 1, 2, 2)) #> Tokens consisting of 2 documents and 1 docvar. #> 1 : #> [1] "a" "a" "b" "a" "b" "c" "c" #> #> 2 : #> [1] "a" "c" "d" "d" "a" "c" "c" "d" #> # with fill tokens_group(toks, groups = factor(c(1, 1, 2, 2), levels = 1:3)) #> Tokens consisting of 2 documents and 1 docvar. #> 1 : #> [1] "a" "a" "b" "a" "b" "c" "c" #> #> 2 : #> [1] "a" "c" "d" "d" "a" "c" "c" "d" #> tokens_group(toks, groups = factor(c(1, 1, 2, 2), levels = 1:3), fill = TRUE) #> Tokens consisting of 3 documents and 1 docvar. #> 1 : #> [1] "a" "a" "b" "a" "b" "c" "c" #> #> 2 : #> [1] "a" "c" "d" "d" "a" "c" "c" "d" #> #> 3 : #> character(0) #>