0 comment

Note to self: Crawling the web and stripping HTML and entities on the shell

notetoselfEver tried to download a list of strings from a web page? There are numerous solutions to such problems. Here is my sort of a toolbox solution which only uses shell commands. This means it’s scriptable for many sites/urls.

In my case the HTML contained the desired list of strings, each on it’s own line, each surrounded by <b> Tags. So we can filter out all lines not starting with a <b> tag:

If you try to crawl several sites, the for loop would look like this:

This will leave us with (a) file(s) still containing HTML entities. To strip them from the file you can use a text based HTML browser like w3m:

With our for loop over sites we have several text files which all need to be filtered. Use a “triangle swap” for that:

Happy crawling!

Leave a Reply

Required fields are marked *.