How to get clean data is important, however Excel or any GUI application might not be able to help us clean the raw data. Moreover, if the data set is large, the application might crush. If you are also using unix system like I did, there are many useful utilities that could help with data wrangling. I’ll give some example of using sed in this post.
How to use sed
sed 's/old/new/' file.txt
Substitute old with new in file.txt (Only first occurrence)
sed 's/old/new/g' file.txt
Substitute old with new in file.txt(All occurrence)
sed -n '3p' file.txt
Print the third line. The -n flag suppresses automatic printing.
sed '2d' file.txt
Delete second line
sed '2a\New text' file.txt
Appends “New text” after the 2nd line.
sed '3i\New text' file.txt
Inserts “New text” before the 3rd line.
sed '2,4s/old/new/g' file.txt
Replaces all occurrences of old
with new
only in lines 2 to 4.
sed '/^$/d' file.txt
Deletes lines that are empty (“^$” matches an empty line).
sed '/pattern/d' file.txt
Deletes lines containing pattern
.
sed 's/\(old\) text/\1 new/' file.txt
Change old text to old new.
\(old\) is the first capture group
\1: Replaces with the text matched by the first capture group (old).
Let’s say we have some “dirty” data as below. The desired data we want is leave only tcp or udp in each line and blank line if the line does not contain tcp or udp.
icmppermit
192.168.0.1tcp
192.168.21.252udp
10.11.230.4tcp
10.153.23.177tcp
tcpdeny
anyntp
sed -e "s/[[:digit:]]//g" -e "s/\.//g" -e "s/icmppermit//g" -e "s/deny//g" -e "s/anyntp//g" dirty.csv
It would be huge pain if you would like to clear this with “Replace” function in any GUI application. The numbers are changing and length is not fixed.
Here is a little trick, the “\d” which most regular expression engines will recognize as digits does not work in sed. For digits in sed we can use [0-9] or [[:digit:]]. There are other difference in sed when using regular expression as shown in the table below.
Feature | Modern Regex | sed Basic Regex | sed Extended Regex (-E ) |
---|---|---|---|
Digit (\d ) | \d | [0-9] , [[:digit:]] | Same as Basic |
Word Char (\w ) | \w | [a-zA-Z0-9_] , [[:alnum:]_] | Same as Basic |
Whitespace (\s ) | \s | [[:space:]] | Same as Basic |
Word Boundary (\b ) | \b | \< , \> (no equivalent for \b ) | Same as Basic |
Grouping | () | \(\) | () |
Alternation | | | Not supported | | |
Quantifiers (+ ) | + , {m,n} | \+ , \{m,n\} | + , {m,n} |
Sample clean data
tcp
udp
tcp
tcp
tcp
Another “dirty” data example. We want to remove leading 4 spaces, remove top 2 line and remove anything inside parenthesis.
Vlan 276 Out
IP Access list vlan276
10 permit icmp any any (2732matches)
20 permit ip 192.168.30.163 0.0.255.255 any
30 deny ip 192.168.100.253 0.0.255.255 172.20.15.0 0.0.255.255
sed -n -E -e '3,$p' -e 's/ //g' -e 's/\(.*\)//g' dirty2.csv
-n flag with argument ‘3,$p’ means to print from line 3 to the end.
Sample clear data
10 permit icmp any any
20 permit ip 192.168.30.163 0.0.255.255 any
30 deny ip 192.168.100.253 0.0.255.255 172.20.15.0 0.0.255.255