Replace multi-token sequences with a multi-word, or "compound" token. The
resulting compound tokens will represent a phrase or multi-word expression,
concatenated with concatenator
(by default, the "_
" character) to form a
single "token". This ensures that the sequences will be processed
subsequently as single tokens, for instance in constructing a dfm.
tokens_compound( x, pattern, valuetype = c("glob", "regex", "fixed"), concatenator = "_", window = 0L, case_insensitive = TRUE, join = TRUE )
x | an input tokens object |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype | the type of pattern matching: |
concatenator | the concatenation character that will connect the words
making up the multi-word sequences. The default |
window | integer; a vector of length 1 or 2 that specifies size of the
window of tokens adjacent to |
case_insensitive | logical; if |
join | logical; if |
A tokens object in which the token sequences matching pattern
have been replaced by new compounded "tokens" joined by the concatenator.
Patterns to be compounded (naturally) consist of multi-word sequences,
and how these are expected in pattern
is very specific. If the elements
to be compounded are supplied as space-delimited elements of a character
vector, wrap the vector in phrase()
. If the elements to be compounded
are separate elements of a character vector, supply it as a list where each
list element is the sequence of character elements.
See the examples below.
txt <- "The United Kingdom is leaving the European Union." toks <- tokens(txt, remove_punct = TRUE) # character vector - not compounded tokens_compound(toks, c("United", "Kingdom", "European", "Union")) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" "leaving" "the" "European" #> [8] "Union" #> # elements separated by spaces - not compounded tokens_compound(toks, c("United Kingdom", "European Union")) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" "leaving" "the" "European" #> [8] "Union" #> # list of characters - is compounded tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union"))) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #> # elements separated by spaces, wrapped in phrase() - is compounded tokens_compound(toks, phrase(c("United Kingdom", "European Union"))) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #> # supplied as values in a dictionary (same as list) - is compounded # (keys do not matter) tokens_compound(toks, dictionary(list(key1 = "United Kingdom", key2 = "European Union"))) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #> # pattern as dictionaries with glob matches tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob") #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European" "Union" #> # note the differences caused by join = FALSE compounds <- list(c("the", "European"), c("European", "Union")) tokens_compound(toks, pattern = compounds, join = TRUE) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United" "Kingdom" #> [4] "is" "leaving" "the_European_Union" #> tokens_compound(toks, pattern = compounds, join = FALSE) #> Tokens consisting of 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" #> [5] "leaving" "the_European" "European_Union" #> # use window to form ngrams tokens_remove(toks, pattern = stopwords("en")) %>% tokens_compound(pattern = "leav*", join = FALSE, window = c(0, 3)) #> Tokens consisting of 1 document. #> text1 : #> [1] "United" "Kingdom" "leaving_European_Union" #>