Did you know ... | Search Documentation: |
How to parse HTML pages? |
This is a bit more complicated than what you may expect. First of all, you do not want to process HTML pages as text. Parsing HTML text without prior translation of the HTML into a data structure is bound to be sensitive to pages created in different ways. But, HTML found in the wild is often invalid. Worse, it is sometimes in fact XHTML, while documents that claim to be XHTML often contain HTML features, such as missing close-tags, use of uppercase, etc.
SWI-Prolog's SGML/SML parser can produce results that are often good enough for scraping. The best option set depends a little on the nature of the document. Notable different encoding or sites using strict and proper XML may ask for different options. Below is some code to get started. It silently ignores all errors. This is often what you want for real scraping, but viewing the messages may be helpful if you get weird output.
http_load_html(URL, DOM) :- setup_call_cleanup(http_open(URL, In, [ timeout(60) ]), ( dtd(html, DTD), load_structure(stream(In), DOM, [ dtd(DTD), dialect(sgml), shorttag(false), max_errors(-1), syntax_errors(quiet) ]) ), close(In)).
Note that the use of setup_call_cleanup/3 guarantees that the opened connection will eventually be closed. However, it does not stop exceptions and errors are a normal part of processing web-documents. Often, the scraper will use a control structure as below to process a URL and just ignore it if it is not possible to retrieve data from the URL. Note that we give this as example code instead of libraries because the demands for an actual scraper can vary a lot. E.g., it might be unacceptable to give up and if there is an error you may want to wait and retry, or your decision may depend on the error returned, etc.
scrape_no_error(URL) :- catch(http_load_html(URL, DOM), Error, ( print_message(warning, Error), fail )), !, process_dom(DOM). scrape_no_error(_).