Beautiful Soup

Intro

Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Install:

pip install beautifulsoup4

Create a BeautifulSoup object from a string containing HTML:

from bs4 import BeautifulSoup

soup = BeautifulSoup("html string", "lxml")

Tags and NavigableStrings

When you’re searching and navigating around in the HTML document, your results will be Tags and NavigableStrings.

A Tag represents an HTML tag and everything inside it. The tag name is tag.name (str) and what’s inside is tag.contents (list of Tags and NavigableStrings), or tag.children (generator of Tags and NavigableStrings).

A NavigableString represents a piece of text that has no further HTML tags inside it.

For both, if you want a str representation of the part of the document they represent, you can just call str on them: str(some_tag) or str(nav_string). (For tags, this includes the tag element itself, not just what’s inside of it.)

If you have a Tag (but not a nav string), you can call .get_text() to get the text inside the tag, and also allow some options to process the text before returning it:

s = tag.get_text()

If you have a NavigableString (but not a tag), you can reference .string to get a str with the string’s content. This is the same as calling str() on it.

You can access .string on a Tag, but the meaning in that case is convoluted. I find it easier to just avoid it. str and get_text() are enough anyway.