Algroveon-Parser – Self-built RSS/Atom Parser in Python

The Starting Point

Anyone wanting to parse RSS feeds in Python will typically reach for an established library. That is, ready-made code that you integrate into your own project instead of developing the function yourself. In this context, "parsing" simply means: reading the XML content of a feed, recognizing its structure, and extracting fields like title, link, date, or description so that your own code can work with them. This is pragmatic. It also works well for a while. But eventually, you reach a point where it no longer feels right: the dependency is deeply embedded in the core of the project, the library handles things automatically that are not fully transparent, and every new requirement leads you to someone else's documentation instead of directly to your own code.

That was the starting point for Algroveon-Parser. Not because external solutions were the problem—on the contrary, many of them have matured over years and make sense for many use cases.

What a Feed Parser Actually Needs to Do

RSS and Atom are XML formats. On paper, it sounds simple: the file is read, individual entries are recognized, and then the most important information like title, link, date, or description is extracted. In practice, however, it is significantly messier.

Format Proliferation: RSS 2.0 is the most common RSS variant today, but RSS 0.91 can still be encountered out in the wild. Atom 1.0 appears in newer sources, such as The Verge. RDF/RSS 1.0 is rare but should at least be cleanly recognized or intercepted. A parser that only masterfully handles RSS 2.0 will quickly hit its limits with real-world sources.

Namespace Chaos: Almost no feed is limited to basic XML. content:encoded for the full article text, dc:creator for the author's name, media:thumbnail for images, media:content as an alternative—every feed combines these in slightly different ways.

Encoding Lies: A specific feed (motorsport_magazin) declares encoding="ISO-8859-1" in its XML prolog line, but actually delivers UTF-8. Python's xml.etree.ElementTree trusts this declaration, which can lead to parsing problems in such cases. The pragmatic fallback: ignore the declaration, process the content as UTF-8 as a test, and try again.

Date Diversity: RSS typically uses dates close to RFC-2822 (Sat, 21 Mar 2026 20:03:31 +0100), while Atom usually uses ISO 8601 (2026-03-21T13:00:00-04:00). Both can appear with timezone offsets or GMT or -0000. email.utils.parsedate_to_datetime helps with RFC 2822 but resolves -0000 to a naive datetime—and that is exactly what must then be cleanly corrected.

Images that aren't placed cleanly anywhere: Some feeds deliver images via media:thumbnail, others via media:content, and others hide the image as the first <img> tag in the HTML body of content:encoded or description. Without explicit image extraction, the image simply remains undetected in many feeds.

Design Decisions

Zero External Dependencies

This was the most important and earliest decision. pyproject.toml has dependencies = []. This sounds radical, but in essence, it isn't: Python's standard library brings everything needed for this parser. xml.etree.ElementTree for XML parsing, html.parser for the HTML sanitizer, email.utils for RFC-2822 date parsing, and re for ISO-8601 and image extraction.

The advantage is very concrete: no additional pip dependencies that could clash with other projects, and no external updates that change behavior unnoticed. The parser runs anywhere Python 3.12 runs—without further preparation. This doesn't fundamentally make it better than established libraries, but for this narrowly defined purpose, it is deliberately manageable and easy to control.

Strict Input Interface: raw bytes

The public API accepts raw bytes—no URL, no HTTP client, no automatic downloading. This is a conscious limitation. It ensures that the transport method—whether HTTP, file, or test fixture—lies completely outside the parser and remains separately testable there. For testing, this means: read fixture files, parse directly, no network required.

raw = urllib.request.urlopen(url).read()
feed = parse(raw, source_url=url)

Typed Output

The result is always a Feed object with a typed Entry list—no dicts, no optional keys that you have to defensively check everywhere. Entry has fixed fields: title, url, published (timezone-aware datetime or None), summary, summary_text, content, author, guid, image_url.

The two summary variants are deliberately separated: summary as cleaned HTML for display in the browser, and summary_text as plain text for passing to an LLM.

HTML Sanitizer: Allowlist instead of Blocklist

The sanitizer in sanitize.py works with an allowlist of permitted tags—everything else is silently removed, while the text content is preserved. script, style, iframe, and form are completely deleted along with their content. Links are checked for http- and https-URLs; javascript: is discarded. Images only keep a src with a secure scheme.

This is not an extra, but a requirement. Feed content comes from arbitrary sources and is rendered in the browser. Without a sanitizer, XSS is not a theoretical possibility, but a very imminent problem.

The Challenges in Detail

Parsing ISO-8601 manually

email.utils can handle RFC 2822, and Python's datetime.fromisoformat has supported much of ISO 8601 since 3.11. In practice, however, there are still enough variants that I didn't want to rely on it blindly—especially when milliseconds, Z, or different offset notations come together. The decision was therefore: a hand-written regex that covers exactly the patterns relevant to the project and constructs a timezone-aware datetime from them. Manageable, controllable, and sufficient for the specific use case—without the claim of providing a general reference implementation for all ISO-8601 variants.

`media:thumbnail` with correct Namespace URIs

Feeds declare the media: namespace as http://search.yahoo.com/mrss/. ElementTree resolves this correctly, but only if you use the full Clark notation: {http://search.yahoo.com/mrss/}thumbnail. This is not particularly intuitive and was one of the places where the first test runs yielded silently incorrect results—the thumbnail extraction returned None even though images were present in the feed.

Atom Links: `rel="alternate"` is optional

In the Atom standard, <link> has a rel attribute. rel="alternate" refers to the article link. However, many feeds omit this attribute—according to the specification, alternate is the default. An XPath search like [@rel='alternate'] will then find nothing. The fallback is accordingly simple: if no link with rel="alternate" is found, the first <link> tag with an href attribute is taken.

Where Algroveon-Parser is used

The parser is integrated into Algroveon-Agent via API. There, it reads the configured news feeds, extracts articles, and prepares them for LLM summarization—summary_text goes to Ollama, while summary and image_url go to the display.

Decoupling was the right decision here: Algroveon-Agent needs no XML knowledge, and Algroveon-Parser needs no knowledge of Ollama. This keeps both sides simpler, clearer, and more testable than if everything were contained in a single block.

Algroveon-Parser – Self-built RSS/Atom Parser in Python

The Starting Point

What a Feed Parser Actually Needs to Do

Design Decisions

Zero External Dependencies

Strict Input Interface: raw bytes

Typed Output

HTML Sanitizer: Allowlist instead of Blocklist

The Challenges in Detail

Parsing ISO-8601 manually

`media:thumbnail` with correct Namespace URIs

Atom Links: `rel="alternate"` is optional

Where Algroveon-Parser is used

More posts

Algroveon-Mini-SSG – How a script became a tool

Joni Fussballmanager

Algroveon-Parser – Self-built RSS/Atom Parser in Python

The Starting Point

What a Feed Parser Actually Needs to Do

Design Decisions

Zero External Dependencies

Strict Input Interface: raw bytes

Typed Output

HTML Sanitizer: Allowlist instead of Blocklist

The Challenges in Detail

Parsing ISO-8601 manually

media:thumbnail with correct Namespace URIs

Atom Links: rel="alternate" is optional

Where Algroveon-Parser is used

More posts

Algroveon-Mini-SSG – How a script became a tool

Joni Fussballmanager

`media:thumbnail` with correct Namespace URIs

Atom Links: `rel="alternate"` is optional