Article
0 comment

Note to self: Crawling the web and stripping HTML and entities on the shell

notetoselfEver tried to download a list of strings from a web page? There are numerous solutions to such problems. Here is my sort of a toolbox solution which only uses shell commands. This means it’s scriptable for many sites/urls.

In my case the HTML contained the desired list of strings, each on it’s own line, each surrounded by <b> Tags. So we can filter out all lines not starting with a <b> tag:

curl http://sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > out.txt

If you try to crawl several sites, the for loop would look like this:

for sitename in site1 site2 site3; do
  curl http://$sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > $sitename.txt
done

This will leave us with (a) file(s) still containing HTML entities. To strip them from the file you can use a text based HTML browser like w3m:

echo "Hällo" | w3m -dump -T text/html

With our for loop over sites we have several text files which all need to be filtered. Use a “triangle swap” for that:

for sitename in site1 site2 site3; do
  cat "$sitename.txt" | w3m -dump -T text/html > tmp.txt; mv tmp.txt "$sitename.txt"
done

Happy crawling!

Leave a Reply

Required fields are marked *.