Data Wrangling with sed

How to get clean data is important, however Excel or any GUI application might not be able to help us clean the raw data. Moreover, if the data set is large, the application might crush. If you are also using unix system like I did, there are many useful utilities that could help with data wrangling. I’ll give some example of using sed in this post.

How to use sed

sed 's/old/new/' file.txt

Substitute old with new in file.txt (Only first occurrence)

sed 's/old/new/g' file.txt

Substitute old with new in file.txt(All occurrence)

sed -n '3p' file.txt

Print the third line. The -n flag suppresses automatic printing.

sed '2d' file.txt

Delete second line

sed '2a\New text' file.txt

Appends “New text” after the 2nd line.

sed '3i\New text' file.txt

Inserts “New text” before the 3rd line.

sed '2,4s/old/new/g' file.txt

Replaces all occurrences of old with new only in lines 2 to 4.

sed '/^$/d' file.txt

Deletes lines that are empty (“^$” matches an empty line).

sed '/pattern/d' file.txt

Deletes lines containing pattern.

sed 's/\(old\) text/\1 new/' file.txt

Change old text to old new.
\(old\) is the first capture group
\1: Replaces with the text matched by the first capture group (old).

Let’s say we have some “dirty” data as below. The desired data we want is leave only tcp or udp in each line and blank line if the line does not contain tcp or udp.

icmppermit
192.168.0.1tcp
192.168.21.252udp
10.11.230.4tcp
10.153.23.177tcp
tcpdeny
anyntp
sed -e "s/[[:digit:]]//g" -e "s/\.//g" -e "s/icmppermit//g" -e "s/deny//g" -e "s/anyntp//g" dirty.csv

It would be huge pain if you would like to clear this with “Replace” function in any GUI application. The numbers are changing and length is not fixed.

Here is a little trick, the “\d” which most regular expression engines will recognize as digits does not work in sed. For digits in sed we can use [0-9] or [[:digit:]]. There are other difference in sed when using regular expression as shown in the table below.

FeatureModern Regexsed Basic Regexsed Extended Regex (-E)
Digit (\d)\d[0-9], [[:digit:]]Same as Basic
Word Char (\w)\w[a-zA-Z0-9_], [[:alnum:]_]Same as Basic
Whitespace (\s)\s[[:space:]]Same as Basic
Word Boundary (\b)\b\<, \> (no equivalent for \b)Same as Basic
Grouping()\(\)()
Alternation|Not supported|
Quantifiers (+)+, {m,n}\+, \{m,n\}+, {m,n}
Summary of Key Differences in sed

Sample clean data


tcp
udp
tcp
tcp
tcp

Another “dirty” data example. We want to remove leading 4 spaces, remove top 2 line and remove anything inside parenthesis.

Vlan 276 Out
IP Access list vlan276
    10 permit icmp any any (2732matches)
    20 permit ip 192.168.30.163 0.0.255.255 any
    30 deny ip 192.168.100.253 0.0.255.255 172.20.15.0 0.0.255.255
sed -n -E -e '3,$p' -e 's/    //g' -e 's/\(.*\)//g' dirty2.csv

-n flag with argument ‘3,$p’ means to print from line 3 to the end.

Sample clear data

10 permit icmp any any
20 permit ip 192.168.30.163 0.0.255.255 any
30 deny ip 192.168.100.253 0.0.255.255 172.20.15.0 0.0.255.255

Leave a comment

Your email address will not be published. Required fields are marked *