Tuesday, November 10, 2015

Parsing Substrings w/Sed, Grep

Sometimes its not enough to have a simple substitution possibility on a regex match, and it would be useful if you could match a regex in a string and then perform a further regex / substitution. This would probably be a useful and welcome improvement to sed, if it were incorporated. For example if you want to make a replacement only within quoted text in a string, in lets say you have a csv file named test.csv that looks like:

test.csv:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"Microsoft, Word, Document1"
2015-03-23 09:34:11,"John,Doe",3,4,Canon 2000,"Further, comma, trouble"

When it's parsed as a csv the commas within in the quotes cause trouble, its necessary to get rid of those commas at least temporarily. It would be useful if we could do it with sed in one pass.

Wishful (non-existant) sed command: sed 's/"[^"]*"/{s/,/ /g}/g'

Instead we can use bash and grep to loop over each line and then search and replace the text with our parsed versions.
FILE=test.csv && i=1 && IFS=$(echo -en "\n\b") && for a in $(< "${FILE}"); do 
 var="${a}"
 for b in $(sed -n ${i}p "${FILE}" | grep -o '"[^"]*"'); do 
  repl="$(sed "s/,/ /g"  <<< "${b}")" 
  var="$(sed "s#${b}#${repl}#" <<< "${var}")" 
 done 
 let i+=1
 echo "${var}" 
done    
When run, all commas from the quotes are removed:
2015-03-23 08:50:22,Jogn.Doe,1,1,Ineo 4000p,"Microsoft Word Document1"
2015-03-23 09:34:11,"John Doe",3,4,Canon 2000,"Further comma trouble"

No comments:

Post a Comment