A reliable string processing toolkit is a must-have for any data scientist.
A new release of the stringi
package is available on CRAN (please wait a few days for Windows and OS X binary builds). As for now, about 850 CRAN packages depend (either directly or recursively) on stringi
. And quite recently, the package got listed among the top downloaded R extensions.
# install.packages("stringi") or update.packages()
library("stringi")
stri_info(TRUE)
## [1] "stringi_0.5.2; en_US.UTF-8; ICU4C 55.1; Unicode 7.0"
apkg <- available.packages(contriburl="http://cran.rstudio.com/src/contrib")
length(tools::dependsOnPkgs('stringi', installed=apkg, recursive=TRUE))
## [1] 845
Refer to the INSTALL
file for more details if you compile stringi from sources (Linux users mostly).
Here’s a list of changes in version 0.5-2. There are many major (like date&time processing) and minor new features, enhancements, as well as bugfixes. In the current release we also focused on bringing stringr
package’s users even better string processing experience, as since the 1.0.0 release it is now powered by stringi
.
[BACKWARD INCOMPATIBILITY] The second argument to stri_pad_*()
has been renamed width
.
[GENERAL] #69: stringi
is now bundled with ICU4C 55.1.
stringi
releases; any comments are welcome):
stri_timezone_list()
- lists all known time zone identifierssample(stri_timezone_list(), 10)
## [1] "Etc/GMT+12" "Antarctica/Macquarie"
## [3] "Atlantic/Faroe" "Antarctica/Troll"
## [5] "America/Fort_Wayne" "PLT"
## [7] "America/Goose_Bay" "America/Argentina/Catamarca"
## [9] "Africa/Juba" "Africa/Bissau"
stri_timezone_set()
, stri_timezone_get()
- manage current default time zonestri_timezone_info()
- basic information on a given time zonestr(stri_timezone_info('Europe/Warsaw'))
## List of 6
## $ ID : chr "Europe/Warsaw"
## $ Name : chr "Central European Standard Time"
## $ Name.Daylight : chr "Central European Summer Time"
## $ Name.Windows : chr "Central European Standard Time"
## $ RawOffset : num 1
## $ UsesDaylightTime: logi TRUE
stri_timezone_info('Europe/Warsaw', locale='de_DE')$Name
## [1] "Mitteleuropäische Normalzeit"
stri_datetime_symbols()
- localizable date-time formatting datastri_datetime_symbols()
## $Month
## [1] "January" "February" "March" "April" "May"
## [6] "June" "July" "August" "September" "October"
## [11] "November" "December"
##
## $Weekday
## [1] "Sunday" "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
## [7] "Saturday"
##
## $Quarter
## [1] "1st quarter" "2nd quarter" "3rd quarter" "4th quarter"
##
## $AmPm
## [1] "AM" "PM"
##
## $Era
## [1] "Before Christ" "Anno Domini"
stri_datetime_symbols("th_TH_TRADITIONAL")$Month
## [1] "มกราคม" "กุมภาพันธ์" "มีนาคม" "เมษายน" "พฤษภาคม" "มิถุนายน" "กรกฎาคม"
## [8] "สิงหาคม" "กันยายน" "ตุลาคม" "พฤศจิกายน" "ธันวาคม"
stri_datetime_symbols("he_IL@calendar=hebrew")$Month
## [1] "תשרי" "חשון" "כסלו" "טבת" "שבט" "אדר א׳" "אדר"
## [8] "ניסן" "אייר" "סיון" "תמוז" "אב" "אלול" "אדר ב׳"
stri_datetime_now()
- return current date-timestri_datetime_fstr()
- convert a strptime
-like format string to an ICU date/time format stringstri_datetime_format()
- convert date/time to string stri_datetime_format(stri_datetime_now(), "datetime_relative_medium")
## [1] "today, 6:21:45 PM"
stri_datetime_parse()
- convert string to date/time objectstri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd")
## [1] "2015-02-28 18:21:45 CET" NA
stri_datetime_parse(c("2015-02-28", "2015-02-29"), stri_datetime_fstr("%Y-%m-%d"))
## [1] "2015-02-28 18:21:45 CET" NA
stri_datetime_parse(c("2015-02-28", "2015-02-29"), "yyyy-MM-dd", lenient=TRUE)
## [1] "2015-02-28 18:21:45 CET" "2015-03-01 18:21:45 CET"
stri_datetime_parse("19 lipca 2015", "date_long", locale="pl_PL")
## [1] "2015-07-19 18:21:45 CEST"
stri_datetime_create()
- construct date-time objects from numeric representationsstri_datetime_create(2015, 12, 31, 23, 59, 59.999)
## [1] "2015-12-31 23:59:59 CET"
stri_datetime_create(5775, 8, 1, locale="@calendar=hebrew") # 1 Nisan 5775 -> 2015-03-21
## [1] "2015-03-21 12:00:00 CET"
stri_datetime_create(2015, 02, 29)
## [1] NA
stri_datetime_create(2015, 02, 29, lenient=TRUE)
## [1] "2015-03-01 12:00:00 CET"
stri_datetime_fields()
- get values for date-time fieldsstri_datetime_fields(stri_datetime_now())
## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
## 1 2015 6 23 18 21 45 52 26 4
## DayOfYear DayOfWeek Hour12 AmPm Era
## 1 174 3 6 2 2
stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")
## Year Month Day Hour Minute Second Millisecond WeekOfYear WeekOfMonth
## 1 5775 11 6 18 21 45 56 40 2
## DayOfYear DayOfWeek Hour12 AmPm Era
## 1 272 3 6 2 1
stri_datetime_symbols(locale="@calendar=hebrew")$Month[
stri_datetime_fields(stri_datetime_now(), locale="@calendar=hebrew")$Month
]
## [1] "Tamuz"
stri_datetime_add()
- add specific number of date-time units to a date-time objectx <- stri_datetime_create(2015, 12, 31, 23, 59, 59.999)
stri_datetime_add(x, units="months") <- 2
print(x)
## [1] "2016-02-29 23:59:59 CET"
stri_datetime_add(x, -2, units="months")
## [1] "2015-12-29 23:59:59 CET"
[NEW FUNCTIONS] stri_extract_*_boundaries()
extract text between text boundaries.
[NEW FUNCTION] #46: stri_trans_char()
is a stringi
-flavoured chartr()
equivalent.
stri_trans_char("id.123", ".", "_")
## [1] "id_123"
stri_trans_char("babaab", "ab", "01")
## [1] "101001"
stri_width()
approximates the width of a string in a more Unicodish fashion than nchar(..., "width")
stri_width(LETTERS[1:5])
## [1] 1 1 1 1 1
nchar(stri_trans_nfkd("\u0105"), "width") # provides incorrect information
## [1] 0
stri_width(stri_trans_nfkd("\u0105")) # A and ogonek (width = 1)
## [1] 1
stri_width( # Full-width equivalents of ASCII characters:
stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
stri_pad()
and stri_wrap()
now by default bases on code point widths instead of the number of code points. Moreover, the default behavior of stri_wrap()
is now such that it does not get rid of non-breaking, zero width, etc. spacesx <- stri_flatten(c(
stri_dup(LETTERS, 2),
stri_enc_fromutf32(as.list(0xFF21:0xFF3a))
), collapse=' ')
# Note that your web browser may have problems with properly aligning
# this (try it in RStudio)
cat(stri_wrap(x, 11), sep='\n')
## AA BB CC DD
## EE FF GG HH
## II JJ KK LL
## MM NN OO PP
## QQ RR SS TT
## UU VV WW XX
## YY ZZ A B
## C D E F
## G H I J
## K L M N
## O P Q R
## S T U V
## W X Y Z
[NEW FEATURE] #133: stri_wrap()
silently allows for width <= 0
(for compatibility with strwrap()
).
[NEW FEATURE] #139: stri_wrap()
gained a new argument: whitespace_only
.
[GENERAL] #144: Performance improvements in handling ASCII strings (these affect stri_sub()
, stri_locate()
and other string index-based operations)
[GENERAL] #143: Searching for short fixed patterns (stri_*_fixed()
) now relies on the current libC
’s implementation of strchr()
and strstr()
. This is very fast e.g. on glibc
utilizing the SSE2/3/4
instruction set.
x <- stri_rand_strings(100, 10000, "[actg]")
microbenchmark::microbenchmark(
stri_detect_fixed(x, "acgtgaa"),
grepl("actggact", x),
grepl("actggact", x, perl=TRUE),
grepl("actggact", x, fixed=TRUE)
)
## Unit: microseconds
## expr min lq mean
## stri_detect_fixed(x, "acgtgaa") 349.153 354.181 381.2391
## grepl("actggact", x) 14017.923 14181.416 14457.3996
## grepl("actggact", x, perl = TRUE) 8280.282 8367.426 8516.0124
## grepl("actggact", x, fixed = TRUE) 3599.200 3637.373 3726.6020
## median uq max neval cld
## 362.7515 391.0655 681.267 100 a
## 14292.2815 14594.4970 15736.535 100 d
## 8463.4490 8570.0080 9564.503 100 c
## 3686.6690 3753.4060 4402.397 100 b
[GENERAL] #141: a local copy of icudt*.zip
may be used on package install; see the INSTALL
file for more information.
[GENERAL] #165: the ./configure
option --disable-icu-bundle
forces the use of system ICU when building the package.
[BUGFIX] locale specifiers are now normalized in a more intelligent way: e.g. @calendar=gregorian
expands to DEFAULT_LOCALE@calendar=gregorian
.
[BUGFIX] #134: stri_extract_all_words()
did not accept simplify=NA
.
[BUGFIX] #132: incorrect behavior in stri_locate_regex()
for matches of zero lengths.
[BUGFIX] stringr/#73: stri_wrap()
returned CHARSXP
instead of STRSXP
on empty string input with simplify=FALSE
argument.
[BUGFIX] #164: libicu-dev usage used to fail on Ubuntu.
[BUGFIX] #135: C++11 is now used by default (see the INSTALL
file, however) to build stringi from sources. This is because ICU4C uses the long long
type which is not part of the C++98 standard.
[BUGFIX] #154: Dates and other objects with a custom class attribute were not coerced to the character type correctly.
[BUGFIX] #168: Build now fails if icudt
is not available.
[BUGFIX] Force ICU u_init()
call on stringi dynlib load.
[BUGFIX] #157: many overfull hboxes in the package PDF manual has been corrected.
Enjoy! Any comments and suggestions are welcome.