Ever tried to download a list of strings from a web page? There are numerous solutions to such problems. Here is my sort of a toolbox solution which only uses shell commands. This means it’s scriptable for many sites/urls.
In my case the HTML contained the desired list of strings, each on it’s own line, each surrounded by <b> Tags. So we can filter out all lines not starting with a <b> tag:
curl http://sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > out.txt
If you try to crawl several sites, the for loop would look like this:
for sitename in site1 site2 site3; do curl http://$sitename | egrep "^.*" | sed -e 's/<[^>]*>//g' > $sitename.txt done
This will leave us with (a) file(s) still containing HTML entities. To strip them from the file you can use a text based HTML browser like w3m:
echo "Hällo" | w3m -dump -T text/html
With our for loop over sites we have several text files which all need to be filtered. Use a “triangle swap” for that:
for sitename in site1 site2 site3; do cat "$sitename.txt" | w3m -dump -T text/html > tmp.txt; mv tmp.txt "$sitename.txt" done
Happy crawling!