Article
0 comment

Note to self: Crawling the web and stripping HTML and entities on the shell

notetoselfEver tried to download a list of strings from a web page? There are numerous solutions to such problems. Here is my sort of a toolbox solution which only uses shell commands. This means it’s scriptable for many sites/urls.

In my case the HTML contained the desired list of strings, each on it’s own line, each surrounded by <b> Tags. So we can filter out all lines not starting with a <b> tag:

curl http://sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > out.txt

If you try to crawl several sites, the for loop would look like this:

for sitename in site1 site2 site3; do
  curl http://$sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > $sitename.txt
done

This will leave us with (a) file(s) still containing HTML entities. To strip them from the file you can use a text based HTML browser like w3m:

echo "Hällo" | w3m -dump -T text/html

With our for loop over sites we have several text files which all need to be filtered. Use a “triangle swap” for that:

for sitename in site1 site2 site3; do
  cat "$sitename.txt" | w3m -dump -T text/html > tmp.txt; mv tmp.txt "$sitename.txt"
done

Happy crawling!

Article
0 comment

Numbering lines with Unix

notetoselfHave you ever had a csv file and wanted to import it into a database? And you would like to add a leading ID column numbered from 0, separated by, let’s say a colon? Here’s a hint: use the Unix pr (for print) utility:

pr -tn, -N0 test.csv | sed -e 's/^[ \t]*//' > new.csv

My test.csv contains a list of all world manufacturer ids (WMI) for car VINs (vehicle identification number). the first few rows look like:

AFA,Ford South Africa
AAV,Volkswagen South Africa
JA3,Mitsubishi

Please note that column headers are added later on. Now the output looks like this:

0,AFA,Ford South Africa
1,AAV,Volkswagen South Africa
2,JA3,Mitsubishi

Now for the curious: what does the command line do?
First for the pr part:

  • -t means: omit headers (remember: normally pr is used to print paginated content …)
  • -n, means: number lines. Use colon as a separator
  • -N0 means: start with 0

So much for that part. The pr utility normally numbers lines within a given column width (standard is 5 chars). This results in leading whitespace. We don’t want that, so the sed command removes spaces and tabs at the beginning of the line.
Enough Unix magic for now. Happy hacking!

Update: Detlef Kreuz just mentioned on Twitter, that this task could also be accomplished with awk:

awk -e '{print NR-1 "," $0;}' test.csv > new.csv

Here awk executes the commands inside the curly braces for every line of input. Each line will first print the line number minus 1, followed by a colon and the complete line. $0 is an internal awk variable containing the complete currect line, while $1, $2 … contain the split up fields (where to split is determined by FS, the field separator, which defaults to a space). Thanks Detlef!