Substitute token types based on vectorized one-to-one matching. Since this
function is created for lemmatization or user-defined stemming. It supports
substitution of multi-word features by multi-word features, but substitution
is fastest when pattern
and replacement
are character vectors
and valuetype = "fixed"
as the function only substitute types of
tokens. Please use tokens_lookup()
with exclusive = FALSE
to replace dictionary values.
tokens_replace( x, pattern, replacement, valuetype = "glob", case_insensitive = TRUE, verbose = quanteda_options("verbose") )
x | tokens object whose token elements will be replaced |
---|---|
pattern | a character vector or list of character vectors. See pattern for more details. |
replacement | a character vector or (if |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
verbose | print status messages if |
tokens_lookup
toks1 <- tokens(data_corpus_inaugural, remove_punct = TRUE) # lemmatization taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation") lemma <- rep("TAX", length(taxwords)) toks2 <- tokens_replace(toks1, taxwords, lemma, valuetype = "fixed") kwic(toks2, "TAX") %>% tail(10) #> Keyword-in-context with 10 matches. #> [1925-Coolidge, 3004] a living we must have | TAX | #> [1925-Coolidge, 3116] correct course to follow in | TAX | #> [1981-Reagan, 273] for their labor by a | TAX | #> [1981-Reagan, 290] productivity But great as our | TAX | #> [1981-Reagan, 1521] and to lighten our punitive | TAX | #> [1985-Reagan, 496] were right to believe that | TAX | #> [1985-Reagan, 1106] lives We must simplify our | TAX | #> [1985-Reagan, 1418] permanently control Government's power to | TAX | #> [1985-Reagan, 1438] spend its citizens money and | TAX | #> [2013-Obama, 739] remake our government revamp our | TAX | #> #> reform The method of raising #> and all other economic legislation #> system which penalizes successful achievement #> burden is it has not #> burden And these will be #> rates have been reduced inflation #> system make it more fair #> and spend We must act #> them into servitude when the #> Code reform our schools and #> # stemming type <- types(toks1) stem <- char_wordstem(type, "porter") toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE) identical(toks3, tokens_wordstem(toks1, "porter")) #> [1] TRUE # multi-multi substitution toks4 <- tokens_replace(toks1, phrase(c("Supreme Court")), phrase(c("Supreme Court of the United States"))) kwic(toks4, phrase(c("Supreme Court of the United States"))) #> Keyword-in-context with 4 matches. #> [1857-Buchanan, 441:446] which legitimately belongs to the | #> [1861-Lincoln, 2323:2328] to be decided by the | #> [1861-Lincoln, 2465:2470] fixed by decisions of the | #> [1889-Harrison, 408:413] by the organization of the | #> #> Supreme Court of the United States | of the United States before #> Supreme Court of the United States | nor do I deny that #> Supreme Court of the United States | the instant they are made #> Supreme Court of the United States | shall have been suitably observed #>