About Extracting Information from an HTML File

You can extract information from an HTML file by extending the html.parser.HTMLParser and overwriting the handle_*() methods. For example, this class lets you extract Open Graph information from a web page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from html.parser import HTMLParser
import requests
from pprint import pprint

class OpenGraphParser(HTMLParser):
    OG_PROPERTIES = ["og:title", "og:type", "og:image", "og:url"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.og_data = {}

    def handle_starttag(self, tag, attrs):
        if tag.lower() == "meta":
            attrs_dict = dict(attrs)
            if (
                (prop := attrs_dict.get("property"))
                and (content := attrs_dict.get("content"))
                and prop in self.OG_PROPERTIES
            ):
                self.og_data[prop.replace("og:", "")] = content

    def get_data(self):
        return self.og_data

if __name__ == "__main__":
    response = requests.get("https://www.djangotricks.com/tricks/3J96KxVxbApk/")
    og_parser = OpenGraphParser()
    og_parser.feed(response.text)
    og_data = og_parser.get_data()
    pprint(og_data)

These methods are called repetitively for each occurrence, so you can collect them or search for a specific tag, text, character, or comment:

  • handle_startendtag(self, tag, attrs) - for each self-closing tag
  • handle_starttag(self, tag, attrs) - for each opening tag
  • handle_endtag(self, tag) - for each closing tag
  • handle_charref(self, name) - for each character reference, e.g. 🤩
  • handle_entityref(self, name) - for each entity reference, e.g. €
  • handle_data(self, data) - for each piece of inner text, including inline scripts and styles
  • handle_comment(self, data) - for each HTML comment

Tips and Tricks Programming Python 3 HTML5 Open Graph