Fuzzy matching and regular expressions in R

Learning to code with regular expressions is challenging at first because the format is not intuitive. I find myself looking up the same rules repeatedly for that reason. But learning regular expressions is highly valuable. Here are some notes on regular expressions in R, including approximate or fuzzy matching.

The easy functions include sub, grep, and substr. sub replaces one text fragment with another:

strings = c('abc', 'grownup', 'schaal', 'caaks', 'baaks')
fragment = 'aa'
replacement = 'oo'
sub(fragment, replacement, strings)
[1] "abc"     "grownup" "school"  "cooks"   "books"

If the fragment appears more than once in the same string, sub will only replace its first appearance. Use gsub to replace them all.

substr will extract a portion of the input string, such as the first 3 character:

substr(strings, 1, 3)
[1] "abc" "gro" "sch" "caa" "baa"

grep tells you which elements in a vector match a particular text fragment:

grep(fragment, strings)
[1] 3 4 5

grepl is the same as grep except it returns TRUE or FALSE for each element in the vector:

grepl(fragment, strings)
[1] FALSE FALSE  TRUE  TRUE  TRUE

Those examples are using regular text fragments...but the word 'regular' has multiple meanings. A 'regular expression' is a string that communicates to the computer what you are looking for, which often differs from the literal text. I won't get into it but for the sake of those who are unfamiliar with it here are some examples:

# ^a means 'starts with a' 
grep('^a', strings)
[1] 1

# $s means 'ends with s' (notice the order)
grep('s$', strings)
[1] 4 5

# x|y means 'either x or y'
grep('abc|up', strings)
[1] 1 2

# [0-9] means 'any number from zero to nine'
grep('[0-9]', c('characs', 'chars33srahc'))
[1] 2

These are nice if you want to subset a vector, list, or data.frame by index:

idx = grep('s$', strings)
strings[ idx ]
[1] "caaks" "baaks"

Its also valuable to know about paste-collapse:

# collapse a character vector into a single string
vec <- c('abc', 'up')
rx <- paste(vec, collapse = '|')
print(rx)
[1] "abc|up"

# then you can do this
grep(rx, strings)
[1] 1 2

R can also do approximate matching, or fuzzy matching. (I was unaware until recently; this is the motivation for this post.) When drawing data from multiple sources, you often encounter different spellings of the same names (which are not always misspellings). To decide if two strings approximately match, you can count the differences between them. For example, 'connor' and 'conner' (which is a misspelling) differ by one string.

# none of the strings contain 'abd'
grep('abd', strings)
 integer(0)

# but the first is pretty close to it
agrep('abd', strings)
 [1] 1

# the canonical example
agrep('connor', c('jane', 'jon', 'Conner'), ignore.case = TRUE)
[1] 3

You can control the level of tolerance using the max.distance argument (read the docs). According to the documentation, agrep uses the generalized Levenshtein edit distance: "the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another."

There is one common practice that base R does not have a single function for. Say you want to identify all strings that match a given regular expression, and instead of merely identifying which strings they are you want to pull out the fragment that matches. For example,

# desired behavior
pull_out('a|b', 'b')
 [1] 'b'

We can easily do this using base R, we just need to combine two functions like so (from this StackOverflow answer):

# match a or b
txt <- "aaa12xxx"
regmatches(txt, regexpr("a|b", txt))
[1] "a"

# match a or b, return all matches
regmatches(txt, gregexpr("a|b", txt))
[[1]]
[1] "a" "a" "a"

# match 'a, repeated one or more times in a row'
regmatches(txt, gregexpr("a+|b", txt))
[[1]]
[1] "aaa"

This is nice but it becomes a problem if you want to use it with a vector (and R is excellent with vectorization).

regmatches(strings, regexpr('a|b', strings))
[1] "a" "a" "a" "b"

Why is that a problem? The result length is four, but 'strings' has five elements. It means you cannot modify columns of a data.frame, like this:

# create a data.frame 
df <- data.frame(id = 1:length(strings), str = strings)

# add a new column using a regular expression match
transform(df, new_column = regmatches(strings, regexpr('a|b', strings)))
Error in data.frame(list(id = 1:5, str = c("abc", "grownup", "schaal",  : 
  arguments imply differing number of rows: 5, 4

Above, I tried to create a new column based on the regular expression match, but this (thankfully) produced an error because the function output was the wrong size.

It would be great if that would work because the alternative would be to use a table join, which requires more code and would be harder to review for mistakes.

The 'stringr' package has a function for this, intuitively named str_extract. But I couldn't use that because str_extract replaces non-matches with NA and I needed to control what was returned for non-matching strings. (More importantly, I am stubborn and decided to stop using 'tidvyerse' altogether after getting burned by some 'deprecations' and after realizing that it's mostly superfluous, and anyways data.table is so much better.)

So I wrote this grab function. It uses the composite base R function above but it is safe for vector replacements. Its an alternative to str_extract with some control over the return value.

#' Extract regexpression matches from a string; safe for vectors and vector replacements
#' @regex A regular expression to match to.
#' @text A character string, possibly a vector of strings, in which to search for 'regex'.
#' @fill_rule What to return when there is no match: NA, the original string, or nothing. If nothing ('drop'), the return vector is not the same length as the input text (which is not safe for replacing values in a vector).
#'
#' @source
#' See also: https://stackoverflow.com/questions/2192316/extract-a-regular-expression-match
grab <- function(regex, text, fill_rule = c('na', 'self', 'drop')) {
    fill_rule <- match.arg(fill_rule)
    stopifnot(inherits(regex, "character"))
    stopifnot(length(regex) == 1)
    stopifnot(inherits(text, "character"))
    mx <- regexpr(regex, text)
    if (fill_rule == 'self')
        text[ mx != -1 ] <- regmatches(text, mx)
    if (fill_rule == 'drop')
        text <- regmatches(text, mx)
    if (fill_rule == 'na') {
        text[ mx == -1 ] <- NA
        text[ mx != -1 ] <- regmatches(text, mx)    
    }    
    return (text)
}

df <- data.frame(id = 1:length(strings), str = strings)
transform(df, str_fragment = grab('a|b', strings, fill = 'self'))
  id     str str_fragment
1  1     abc            a
2  2 grownup      grownup
3  3  schaal            a
4  4   caaks            a
5  5   baaks            b