Developer function to match patterns in quanteda objects against token types.

object2id(
  x,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  concatenator = "_",
  levels = 1,
  remove_unigram = FALSE,
  keep_nomatch = FALSE
)

object2fixed(
  x,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  concatenator = "_",
  levels = 1,
  remove_unigram = FALSE,
  keep_nomatch = FALSE
)

Arguments

x

a list of character vectors, dictionary or collocations object

types

token types against which patterns are matched

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

concatenator

the concatenation character that join multi-word expression in types

levels

integers specifying the levels of entries in a hierarchical dictionary that will be applied. The top level is 1, and subsequent levels describe lower nesting levels. Values may be combined, even if these levels are not contiguous, e.g. levels = c(1:3) will collapse the second level into the first, but record the third level (if present) collapsed below the first (see examples).

remove_unigram

if TRUE, ignores single-word patterns

keep_nomatch

keep patterns that did not match

Value

a list of integer vectors containing indices of matched types

See also

Examples

types <- c("A", "AA", "B", "BB", "B_B", "C", "C-C")

# dictionary
dict <- dictionary(list(A = c("a", "aa"), 
                        B = c("BB", "B B"),
                        C = c("C", "C-C")))
object2fixed(dict, types)
#> $A
#> [1] "A"
#> 
#> $A
#> [1] "AA"
#> 
#> $B
#> [1] "BB"
#> 
#> $B
#> [1] "B" "B"
#> 
#> $B
#> [1] "B_B"
#> 
#> $C
#> [1] "C"
#> 
#> $C
#> [1] "C-C"
#> 
object2fixed(dict, types, remove_unigram = TRUE)
#> $B
#> [1] "B" "B"
#> 

# phrase
pats <- phrase(c("a", "aa", "zz", "bb", "b b"))
object2fixed(pats, types)
#> $a
#> [1] "A"
#> 
#> $aa
#> [1] "AA"
#> 
#> $bb
#> [1] "BB"
#> 
#> $`b b`
#> [1] "B" "B"
#> 
object2fixed(pats, types, keep_nomatch = TRUE)
#> $a
#> [1] "A"
#> 
#> $aa
#> [1] "AA"
#> 
#> $zz
#> character(0)
#> 
#> $bb
#> [1] "BB"
#> 
#> $`b b`
#> [1] "B" "B"
#>