aliases

Command

The sed command would look like this:

sed --regexp-extended --in-place=.orig "s/(,[0-9]{3})([0-9]{2}\ \-)/,\2/g" <file>

Source Data

Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
1GYS4FEJ3CR110987,BD9B6E186B5C0D967911B96F697E697A6DC61D4B,1FTNF2B51AE007310,216.239.174.105,92.151.147.165,Z,10001 - 12/08/1990,2009,FALSE,TRUE,10001 - 12/08/1990,2008,2020-11-30,2020-10-01
JM3TB2BVXD0238933,EECA35C73E66A5A58BA4E763C26B2C4083C684BA,SCBLC43F26C295723,244.218.199.98,64.49.209.152,Z,10002 - 12/08/1990,1986,FALSE,TRUE,10002 - 12/08/1990,2008,2020-11-30,2020-10-01
JN8AZ2NC2E9366591,C607477F79B9CCE088EE6EAC12CB50D9FE70C11C,WBALM5C51AE038487,181.252.242.234,135.155.158.99,Z,10003 - 12/08/1990,1995,FALSE,TRUE,10003 - 12/08/1990,2005,2020-11-30,2020-10-01

Process

The string structure for columns SigningNum and SigningNumAdd is the same, it's structured like this NNNNN - DD/MM/YYYY but we can't specify a column like we can in Excel, so we have to match the string based on something in the structure that is unique to these columns.

It's enough to match based on this pattern ,NNNNN - where N is a number. This is because no other columns have strings like this. We also include the comma delimiter , at the start because it helps ensure we're matching the start of the column, and not the middle of a string (if there were other similar patterns to this column).

We then split the pattern into two Regular Expression Capturing Group's.

Pattern: ,NNN - Regex to match pattern: (,[0-9]{3})
Pattern: NN - - Regex to match pattern: ([0-9]{2}\ \-)

Capturing Group 1 then becomes the part of the string we want to remove, and group 2 is the part of the string we want to keep.

It's then just a matter of telling sed to do just that in the "replace" section of the command with a backreference to the 2nd capturing group. So ,\2 tells sed to replace (,[0-9]{3})([0-9]{2}\ \-) with the contents of the 2nd capturing group, prepended with a comma (so we make sure we still have the delimiter).

So you end up with the strings in both columns looking like this: ,NN - DD/MM/YYYY (where the N and DD/MM/YYYY will be a number and date) which are stripped of the match in the 1st capturing group, and thus have had the first 3 characters removed.

Result

The resulting data then looks like the below.

Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13,Column14
1GYS4FEJ3CR110987,BD9B6E186B5C0D967911B96F697E697A6DC61D4B,1FTNF2B51AE007310,216.239.174.105,92.151.147.165,Z,01 - 12/08/1990,2009,FALSE,TRUE,01 - 12/08/1990,2008,2020-11-30,2020-10-01
JM3TB2BVXD0238933,EECA35C73E66A5A58BA4E763C26B2C4083C684BA,SCBLC43F26C295723,244.218.199.98,64.49.209.152,Z,02 - 12/08/1990,1986,FALSE,TRUE,02 - 12/08/1990,2008,2020-11-30,2020-10-01
JN8AZ2NC2E9366591,C607477F79B9CCE088EE6EAC12CB50D9FE70C11C,WBALM5C51AE038487,181.252.242.234,135.155.158.99,Z,03 - 12/08/1990,1995,FALSE,TRUE,03 - 12/08/1990,2005,2020-11-30,2020-10-01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sed-backreferences-regex-capture-groups.md

sed-backreferences-regex-capture-groups.md

Command

Source Data

Process

Result

Files

sed-backreferences-regex-capture-groups.md

Latest commit

History

sed-backreferences-regex-capture-groups.md

File metadata and controls

Command

Source Data

Process

Result