Developer function to match patterns in quanteda objects against token types.
object2id( x, types, valuetype = c("glob", "fixed", "regex"), case_insensitive = TRUE, concatenator = "_", levels = 1, remove_unigram = FALSE, keep_nomatch = FALSE ) object2fixed( x, types, valuetype = c("glob", "fixed", "regex"), case_insensitive = TRUE, concatenator = "_", levels = 1, remove_unigram = FALSE, keep_nomatch = FALSE )
x | a list of character vectors, dictionary or collocations object |
---|---|
types | token types against which patterns are matched |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
concatenator | the concatenation character that join multi-word
expression in |
levels | integers specifying the levels of entries in a hierarchical
dictionary that will be applied. The top level is 1, and subsequent levels
describe lower nesting levels. Values may be combined, even if these
levels are not contiguous, e.g. |
remove_unigram | if |
keep_nomatch | keep patterns that did not match |
a list of integer vectors containing indices of matched types
types <- c("A", "AA", "B", "BB", "B_B", "C", "C-C") # dictionary dict <- dictionary(list(A = c("a", "aa"), B = c("BB", "B B"), C = c("C", "C-C"))) object2fixed(dict, types) #> $A #> [1] "A" #> #> $A #> [1] "AA" #> #> $B #> [1] "BB" #> #> $B #> [1] "B" "B" #> #> $B #> [1] "B_B" #> #> $C #> [1] "C" #> #> $C #> [1] "C-C" #> object2fixed(dict, types, remove_unigram = TRUE) #> $B #> [1] "B" "B" #> # phrase pats <- phrase(c("a", "aa", "zz", "bb", "b b")) object2fixed(pats, types) #> $a #> [1] "A" #> #> $aa #> [1] "AA" #> #> $bb #> [1] "BB" #> #> $`b b` #> [1] "B" "B" #> object2fixed(pats, types, keep_nomatch = TRUE) #> $a #> [1] "A" #> #> $aa #> [1] "AA" #> #> $zz #> character(0) #> #> $bb #> [1] "BB" #> #> $`b b` #> [1] "B" "B" #>