Skip to content

Latest commit

 

History

History
226 lines (157 loc) · 10.9 KB

2-DateTime-RegEx.md

File metadata and controls

226 lines (157 loc) · 10.9 KB

Data Tidying: Dates, Time, and Regular Expressions

Drs. Sarangan Ravichandran and Randall Johnson

Table of Contents

  1. Dates and Time

    1.1 Base R

    1.2 lubridate

    1.3 Arithmetic

    1.4 Exercises

  2. Regular Expressions

    2.1 Definition

    2.2 Find and Replace

    2.3 Exercises

Dates and Time

Base R

Dates and times are stored in one of three formats in R:

  • Date - number of days since January 1, 1970
  • POSIXct - number of seconds since January 1, 1970
  • POSIXlt - a list containing information about the time (see Table 1)
Position Name Value
1 sec seconds
2 min minutes
3 hour hours
4 mday day of the month (1-31)
5 mon month of the year (0-11)
6 year years since 1900
7 wday day of the week (0: Sunday - 6: Saturday)
8 yday day of the year (0-365)
9 isdst daylight savings (Yes/No)
10 zone time zone
11 gmtoff seconds off of GMT

If you are interested, there are some examples in Examples.R comparing the format of these two variables, starting on line 60.

You can use the as.date() function to create a date variable in R, and you can also use it to convert strings into dates. In either case, you will want to be aware of how the text you give to as.Date() is formatted. According to the documentation (see ?as.Date), the default format is "YYYY-MM-DD" or "YYYY/MM/DD". If neither works and you didn't specify a different format, as.date() will return NA.

To specify formatting, see the values in Table 2. For example, the default formats as.Date() expects would be coded as %Y-%m-%d or %Y/%m/%d. If you have a date, "May 3, 2018", you would need to specify the format as %B %d, %Y

Code Value
%d Day of the month (integer, 01-31)
%m Month (integer, 01-12)
%b Month (abbreviation)
%B Month (full name)
%y Year (2 digit)
%Y Year (4 digit)
as.Date('1920-6-16')
## [1] "1920-06-16"
as.Date('2017/03/07')
## [1] "2017-03-07"
as.Date("May 3, 2018", format = "%B %d, %Y")
## [1] "2018-05-03"
## Get current Date and Time from the system 
Sys.Date() # returns a Date object
## [1] "2018-03-13"
Sys.time() # returns a POSIXct object
## [1] "2018-03-13 08:48:25 EST"

Formatting times is a little more complicated. For this task strftime() and strptime() are our friends.

  • strftime() returns a character string (or character vector of strings) representing the input.
  • strptime() returns POSIXlt formatted values represented by the input.

These function both have a default format of %Y-%m-%d %H:%M:%S if any element has a time component which is not midnight, and %Y-%m-%d otherwise (see ?strptime for the full list of codes). For example:

# get the current system time as a string
Sys.time() %>% strftime()
## [1] "2018-03-13 08:48:25"
# convert this string to POSIXlt format (assumes local time zone)
strptime("May 3, 2018 1:45 PM", format = "%B %d, %Y %H:%M %p")
## [1] "2018-05-03 01:45:00 EST"

lubridate

The lubridate package, which is part of the tidyverse, has some additional functions that will help us work with dates and time.

  • now() essentially does the same thing as Sys.time() by default, but you can optionally specify a different timezone.
  • second(), minute(), hour(), day(), year(), ... allow you to extract or set (change) the specified part of a POSIXlt formatted variable.
  • seconds(), minutes(), hours(), days(), years(), ... allow you to specify a period of time (e.g. days(5) is 5 days).

See line 85of Examples.R for some examples using the lubridate package.

Note: there seems to be a bug with R's handling of the default time zone settings on some systems (as of early 2018). If you happen to run into this error,

Error in (function (dt, year, month, yday, mday, wday, hour, minute, second,  : 
  Invalid timezone of input vector: ""

this bit of code will set things straight (or something similar, e.g. EDT or GMT):

Sys.setenv(TZ="EST")

Arithmetic

"If anyone drove a time machine, they would crash" - Hadley Wickham, author of the tidyverse

As Hadley Wickham points out here, arithmetic with dates isn't as straight forward as it could be. Take the following operation for example: January 31 + one month. There are at least three possible answers:

  • February 31 (doesn't exist)
  • March 4 (31 days after January 31st)
  • February 28 (or the 29th if it is a leap year)

Ug! To avoid abiguity, lubridate implements addition in a very specific way.

  • By default, adding time, increments the specified slot approriately in the data structure. For example, when you add a month, you simply increment the month slot by 1.
    • If you add a month to December 23, the month slot goes back to 1 and the year is incrmented by 1.
    • If you add a month to January 31, you get NA, because February 31 doesn't exist. If you want one of the other two options above, you need to craft your statement a little more carefully (e.g. by adding days(30) instead of months(1)).
  • Alternately, some shorcut functions allow the addition of a fixed period of time.
    • For example, dyears(1) is 365 days, even on a leap year, while years(1) increments the years slot by 1 year, without respect to leap years.

See line 108 of Examples.R for some examples of date arithmetic.

Excercises

Practice exercises for this section can be found in Exercsies.Rmd.

Regular Expressions

Define regular expressions

A regular expression (sometimes referred to as a regex) is a string of special charcaters defining search criteria for a pattern search of some text.

  • Was originally developed for PERL
  • Regular Expressions help us identify patterns in text
  • Cross-platform compatible (mostly)
  • Speed up calculations

There are three types of regular expression parts:

  • Anchors, used to specify a position on the line of text
  • Character Sets, used to match characters
  • Modifiers, used to specify how many times the previous character should be repeated
Type Regex Meaning
Anchors ^ Beginning of the line
$ End of the line
Character Sets [a-z] A lower case character
[A-Z] An upper case character
[0-9] An number character
. Any single character
[abc] Any letter in {a, b, c}
[^a] Any character that is not an 'a'
\e Escape
\f Form feed
\n New line
\r Carriage return
\t Tab
Modifiers ? 0 or 1 repetitions
* 0 or more repetitions
+ 1 or more repetitions
\{n\} exactly n repetitions
\{n,m\} at least n and not more than m repetitions

As you can see, some characters have special meaning in regular expressions. If you want to search for a '[' or a '.', you will need to escape the special meaning with a \ (e.g. \[ or \.). Note that \ is an escape character in R strings, so you will need to enter it as \\ in your strings (e.g. if you want to include \[ in your regular expression, you'll need to give R the string: "\[")

You can also search for multiple patterns at the same time by including the | character between regular expressions. For example, if you wanted to search for vcf files, you may want to use this regular expression to find both compressed and uncompressed files with the proper ending: vcf$|vcf.tar.gz$.

Check out line 121 of Examples.R for some examples of using the grep() and grepl() functions to search for IDs containing a specific pattern.

Find and replace

Sometimes we want to find and replace specific text in a string. For this, we generally will use the sub() or gsub() functions.

  • sub() replaces the first occurance of pattern with replacement.
  • gsub() replaces all occurances of pattern with replacement.

See line 145 of Examples.R for some examples using sub() and gsub().

Exercises

Practice exercises for this section can be found in Exercsies.Rmd.