aliases | category | classification | date | date_modified | draft | id | image | links | local_archive_links | pinned | series | tags | title | type | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
sed |
public |
2020-10-06 04:14:53 -0700 |
2020-10-06 04:14:53 -0700 |
false |
20201006111453 |
false |
false |
|
Sed Backreferences and RegEx Capturing Groups |
tech-note |
Sed is very powerful, and I've just learnt about how you can utilise regular expression capturing groups, and sed backreferences to do some particularly powerful edits.
At work we had some synthetic data and I needed to strip the first 3 characters from strings in two specific columns in a 1GB+ CSV. I had to make sure that it was only these columns, so I couldn't just match the first 3 digits, as that would match on all numerical strings. Thankfully the structure of the strings in the SigningNum
and SigningNumAdd
columns were unique.
The sed
command would look like this:
sed --regexp-extended --in-place=.orig "s/(,[0-9]{3})([0-9]{2}\ \-)/,\2/g" <file>
Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
1GYS4FEJ3CR110987,BD9B6E186B5C0D967911B96F697E697A6DC61D4B,1FTNF2B51AE007310,216.239.174.105,92.151.147.165,Z,10001 - 12/08/1990,2009,FALSE,TRUE,10001 - 12/08/1990,2008,2020-11-30,2020-10-01
JM3TB2BVXD0238933,EECA35C73E66A5A58BA4E763C26B2C4083C684BA,SCBLC43F26C295723,244.218.199.98,64.49.209.152,Z,10002 - 12/08/1990,1986,FALSE,TRUE,10002 - 12/08/1990,2008,2020-11-30,2020-10-01
JN8AZ2NC2E9366591,C607477F79B9CCE088EE6EAC12CB50D9FE70C11C,WBALM5C51AE038487,181.252.242.234,135.155.158.99,Z,10003 - 12/08/1990,1995,FALSE,TRUE,10003 - 12/08/1990,2005,2020-11-30,2020-10-01
The string structure for columns SigningNum
and SigningNumAdd
is the same, it's structured like this NNNNN - DD/MM/YYYY
but we can't specify a column like we can in Excel, so we have to match the string based on something in the structure that is unique to these columns.
It's enough to match based on this pattern ,NNNNN -
where N
is a number. This is because no other columns have strings like this. We also include the comma delimiter ,
at the start because it helps ensure we're matching the start of the column, and not the middle of a string (if there were other similar patterns to this column).
We then split the pattern into two Regular Expression Capturing Group's.
- Pattern:
,NNN
- Regex to match pattern:(,[0-9]{3})
- Pattern:
NN -
- Regex to match pattern:([0-9]{2}\ \-)
Capturing Group 1 then becomes the part of the string we want to remove, and group 2 is the part of the string we want to keep.
It's then just a matter of telling sed
to do just that in the "replace" section of the command with a backreference to the 2nd capturing group. So ,\2
tells sed to replace (,[0-9]{3})([0-9]{2}\ \-)
with the contents of the 2nd capturing group, prepended with a comma (so we make sure we still have the delimiter).
So you end up with the strings in both columns looking like this: ,NN - DD/MM/YYYY
(where the N
and DD/MM/YYYY
will be a number and date) which are stripped of the match in the 1st capturing group, and thus have had the first 3 characters removed.
The resulting data then looks like the below.
Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
1GYS4FEJ3CR110987,BD9B6E186B5C0D967911B96F697E697A6DC61D4B,1FTNF2B51AE007310,216.239.174.105,92.151.147.165,Z,01 - 12/08/1990,2009,FALSE,TRUE,01 - 12/08/1990,2008,2020-11-30,2020-10-01
JM3TB2BVXD0238933,EECA35C73E66A5A58BA4E763C26B2C4083C684BA,SCBLC43F26C295723,244.218.199.98,64.49.209.152,Z,02 - 12/08/1990,1986,FALSE,TRUE,02 - 12/08/1990,2008,2020-11-30,2020-10-01
JN8AZ2NC2E9366591,C607477F79B9CCE088EE6EAC12CB50D9FE70C11C,WBALM5C51AE038487,181.252.242.234,135.155.158.99,Z,03 - 12/08/1990,1995,FALSE,TRUE,03 - 12/08/1990,2005,2020-11-30,2020-10-01