Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see corpus_segment()
), tokens_segment
provides the option to perform this operation on tokens.
tokens_segment( x, pattern, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, extract_pattern = FALSE, pattern_position = c("before", "after"), use_docvars = TRUE )
x | tokens object whose token elements will be segmented |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
extract_pattern | remove matched patterns from the texts and save in
docvars, if |
pattern_position | either |
use_docvars | if |
tokens_segment
returns a tokens object whose documents
have been split by patterns
txts <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor." toks <- tokens(txts) # split by any punctuation tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex", extract_pattern = TRUE, pattern_position = "after") #> Tokens consisting of 2 documents and 1 docvar. #> text1.1 : #> [1] "Fellow" "citizens" "," "I" "am" "again" #> [7] "called" "upon" "by" "the" "voice" "of" #> [ ... and 10 more ] #> #> text1.2 : #> [1] "When" "the" "occasion" "proper" "for" "it" #> [7] "shall" "arrive" "," "I" "shall" "endeavor" #> [ ... and 11 more ] #> tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed", extract_pattern = TRUE, pattern_position = "after") #> Tokens consisting of 2 documents and 1 docvar. #> text1.1 : #> [1] "Fellow" "citizens" "," "I" "am" "again" #> [7] "called" "upon" "by" "the" "voice" "of" #> [ ... and 10 more ] #> #> text1.2 : #> [1] "When" "the" "occasion" "proper" "for" "it" #> [7] "shall" "arrive" "," "I" "shall" "endeavor" #> [ ... and 11 more ] #>