stringi 0.2-2 (2014-04-19 ) Compatibility Tables: Computing "length" of a string

Determining whether a given string is empty

Basic Functionality

base
nzchar() – returns a logical vector; determines whether a string is NOT empty

Note that missing values are not handled properly.

!nzchar(c("", "not empty", NA))
## [1]  TRUE FALSE FALSE
stringr
(not available directly)

You may use the following, keeping in mind performance issues, especially for UTF-8-encoded strings:

(str_length(c("", "not empty", NA)) == 0)
## [1]  TRUE FALSE    NA
stringi
stri_isempty() – handles missing values properly.
stri_isempty(c("", "not empty", NA))
## [1]  TRUE FALSE    NA

Performance comparison

test1 <- rep(c("", "not empty", NA), 100)
microbenchmark(nzchar(test1), str_length(test1) == 0, stri_isempty(test1))
## Unit: nanoseconds
##                    expr   min      lq  median      uq    max neval
##           nzchar(test1)   817   951.5  1103.5  1214.5   3492   100
##  str_length(test1) == 0 54940 56892.5 58188.5 61168.5 174070   100
##     stri_isempty(test1)  2284  2882.5  3547.0  4477.5  21730   100

Calculate the number of code points (characters) in a string

Basic Functionality

base
nchar() – does not handle NAs properly
nchar(c("ąśćźół", "abc", NA, ""))
## [1] 6 3 2 0
stringr
str_length() – handles NAs properly
str_length(c("ąśćźół", "abc", NA, ""))
## [1]  6  3 NA  0
stringi
stri_length() – handles NAs properly
stri_length(c("ąśćźół", "abc", NA, ""))
## [1]  6  3 NA  0

General Remark

If a given string is in UTF-8 and not has been properly Unicode normalized (e.g. by stri_trans_nfc), the returned number may sometimes be misleading.

Determining the length of an 8-bit-encoded string is O(1) [as it is not the same as calculating the number of bytes in a string], and in UTF-8 has linear time complexity.

Performance comparison

test1 <- rep(c("ąśćźół", "abc", NA, ""), 100)          # first string is in UTF-8
microbenchmark(nchar(test1), str_length(test1), stri_length(test1))
## Unit: microseconds
##                expr    min      lq  median      uq     max neval
##        nchar(test1) 36.749 37.7550 38.0050 38.2800  77.063   100
##   str_length(test1) 66.064 67.9150 69.3035 70.2555 125.020   100
##  stri_length(test1)  6.322  6.9685  7.4345  8.5595  31.301   100

Calculate the number of bytes in a string

Basic Functionality

base
nchar() with argument type='bytes' – does not handle NAs properly
nchar(c("ąśćźół", "abc", NA, ""), type='bytes')
## [1] 12  3  2  0
stringr
(none)
stringi
stri_numbytes() handles missing values properly.
stri_numbytes(c("ąśćźół", "abc", NA, ""))
## [1] 12  3 NA  0

General Remark

This group of functions count the number of bytes needed to store all the characters of each string in computer's memory. These are not the functions you would normally use in your string processing activities – see rather stri_length().

Performance comparison

test1 <- rep(c("ąśćźół", "abc", NA, ""), 100)        # first string is in UTF-8
microbenchmark(nchar(test1, type='bytes'), stri_numbytes(test1))
## Unit: microseconds
##                          expr   min     lq median     uq    max neval
##  nchar(test1, type = "bytes") 9.564 9.7995  9.999 10.286 33.683   100
##          stri_numbytes(test1) 2.531 2.7240  2.894  3.117 24.026   100

Conunt the width of characters in a string

Basic Functionality

base
nchar() with argument type='width' – does not handle NAs properly; Returns the estimated number of columns that the cat() function will use to print the string in a monospaced font. The same as chars if this cannot be calculated.

The R manual does not state how the numbers are determined.

nchar(c("gryzeldą", "", NA,
   "持続可能な統計環境"), type='width')
## [1]  8  0  2 18
stringr
(none)
stringi
(not available yet)

TO DO: add stri_width().