Comparing Data

Techniques and tools to compare data in R

By Chi Kit Yeung in R Data Analysis Notes

August 5, 2022

Operators


Custom ‘NOT IN’ Operator

R has a built in %in% operator that’s useful for comparing values that’s similar to SQL’s LIKE operator. However, it doesn’t have a built in NOT LIKE operator like SQL does which is useful in some cases.

# Defining the operator

`%!in%` <- Negate(`%in%`)

Example:

fav_fruits <- c('apple', 'oranges', 'bananas', 'grape')
shopping_list <- c('bananas', 'persimmon', 'peach', 'apple', 'custard apple')

# Normal `%in%` operator use
fav_fruits[fav_fruits %in% shopping_list]
## [1] "apple"   "bananas"

Above we can see two of our favorite fruits being on the shopping list

Next, using the custom defined %!in% operator

# Favorite fruits not being bought

fav_fruits[fav_fruits %!in% shopping_list]
## [1] "oranges" "grape"
# Not so favorite fruits being bought :(

shopping_list[shopping_list %!in% fav_fruits]
## [1] "persimmon"     "peach"         "custard apple"

Intersect and SetDiff

intersect(fav_fruits, shopping_list)
## [1] "apple"   "bananas"
setdiff(fav_fruits, shopping_list)
## [1] "oranges" "grape"

Uncategorized

Regex

Regular expressions can be utilized using the str_detect() function.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
str_detect(fav_fruits, "a..le")
## [1]  TRUE FALSE FALSE FALSE
# Matching
fav_fruits[str_detect(fav_fruits, "a..le")]
## [1] "apple"
# Not matching
fav_fruits[!str_detect(fav_fruits, "a..le")]
## [1] "oranges" "bananas" "grape"

Unique

Getting unique values from a list using the unique() function.

# A lot of dupes here, I just want the unique values!
locales <- c("ar_SA", "ar_SA", "de_DE", "de_DE", "de_DE", "en_AU", "es_ES", "es_ES", "es_ES", "es_ES", "es_MX", "es_MX", "es_MX", "es_MX", "es_US", "es_US", "es_US", "es_US", "es_US", "es_US", "fr_FR", "he_IL", "he_IL", "it_IT", "it_IT", "it_IT", "nb_NO", "nb_NO", "ru_RU", "ru_RU", "ru_RU", "sv_SE", "sv_SE", "sv_SE", "tr_TR", "tr_TR", "tr_TR")

locales
##  [1] "ar_SA" "ar_SA" "de_DE" "de_DE" "de_DE" "en_AU" "es_ES" "es_ES" "es_ES"
## [10] "es_ES" "es_MX" "es_MX" "es_MX" "es_MX" "es_US" "es_US" "es_US" "es_US"
## [19] "es_US" "es_US" "fr_FR" "he_IL" "he_IL" "it_IT" "it_IT" "it_IT" "nb_NO"
## [28] "nb_NO" "ru_RU" "ru_RU" "ru_RU" "sv_SE" "sv_SE" "sv_SE" "tr_TR" "tr_TR"
## [37] "tr_TR"
unique(locales)
##  [1] "ar_SA" "de_DE" "en_AU" "es_ES" "es_MX" "es_US" "fr_FR" "he_IL" "it_IT"
## [10] "nb_NO" "ru_RU" "sv_SE" "tr_TR"
Posted on:
August 5, 2022
Length:
2 minute read, 359 words
Categories:
R Data Analysis Notes
Tags:
R
See Also:
Statistics Notebook
Inferential Statistics
Descriptive Statistics