content/posts/2004-10-25-new-raggle-engine-in-cvs.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

---
date: "2004-10-25T02:29:48Z"
title: New Raggle Engine in CVS
---

<p>
What will probably become the new <a href='http://raggle.org/'>Raggle</a> engine is now in <a href='/cvs/'><acronym title='Concurrent Versioning System'>CVS</acronym></a>, under the module name <code>squaggle</code>.  Here's what I've got so far:
</p>

<ul>
<li><a href='http://sqlite.org/'>SQLite</a> backend.</li>
<li>Full <a
href='http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers'>Conditional
<acronym title='HyperText Transfer Protocol'>HTTP</acronym> GET</a> support
(both <code>ETag</code> and <code>Last-Modified</code>)</a></li>
<li><acronym title='HyperText Transfer Protocol'>HTTP</acronym> proxy support (via the <code>http_proxy</code> <abbr title='Environment'>env</abbr> variable or the config hash; there's a stub for win32 proxy support at the moment)</li>
<li><acronym title='HyperText Transfer Protocol'>HTTP</acronym> 1.0 basic authentication support</li>
<li>Simple adding and listing feeds (via the <code>Squaggle#feeds</code> and <code>Squaggle#feed_items</code> methods</li>
<li>Engine should be <a href='http://ruby-lang.org/'>Ruby</a> thread safe, but at the moment there's some quirk with the <a href='http://sqlite.org/'>SQLite</a> behavior.</li>
<li>Significantly better memory consumption (memory use will ultimately depend on the interface implementation, but the engine is designed so the interface can query as much or as little information about feeds and feed items as it wants)</li>
<li>Basic <acronym title='Really Simple Summary / RDF Site Summary / Rich Site Summary / probably more'>RSS</acronym> 0.91-0.92 (Userland), 1.0, and 2.0 support (presumably it'll work with Netscape 0.90-0.91 and Userland 0.93-0.94 feeds as well, although I haven't tested with those).  There are stubs for <acronym title='Really Simple Summary / RDF Site Summary / Rich Site Summary / probably more'>RSS</acronym> 1.0 modules (via the <code>feed_attrs</code> table, for elements I haven't implemented yet, and for <a href='http://www.mnot.net/drafts/draft-nottingham-atom-format-02.html'>Atom</a> support as well.  I have more to say about this one below</li>
</ul>

<p>
I spent a bunch of time in the last month reading through as many <acronym title='Really Simple Summary / RDF Site Summary / Rich Site Summary / probably more'>RSS</acronym> specs as could get my hands on.  I read through the <a href='http://www.mnot.net/drafts/draft-nottingham-atom-format-02.html'>Atom spec</a> as well.  The three biggest problems users have had with <a href='http://raggle.org/'>Raggle</a> are speed, memory use, and supported feeds.  I'm attempting to address the speed issue in a couple of ways: by deferring as much of the internal searching and sorting to <a href='http://sqlite.org/'>SQLite</a> (aside: this also has a side benefit of dramatically simplifying the code, since all the funky array indexing, time conversions, ID hashing, etc goes away and becomes <acronym title='Structured Query Language'>SQL</acronym> queries :D).  The memory use has also been addressed with a caveat (see my note above about the end-user interfaces and  memory requirements).  Paradoxically, the Ncurses interface may end up using more memory than the web interface, because the Ncurses interface has more speed and caching requirements than the web interface.  As for proper feed support, that one is a little bit trickier.
</p>

<p>
Supporting <acronym title='Really Simple Summary / RDF Site Summary /
Rich Site Summary / probably more'>RSS</acronym> properly is actually
kind of a bitch, because there is no official standard (although there
are <a
href='http://diveintomark.org/archives/2004/02/04/incompatible-rss'>plenty
of specifications</a>).  Even worse, a <em>lot</em> of feeds play fast
an loose with requirements, so strict <acronym title='Really Simple
Summary / RDF Site Summary / Rich Site Summary / probably
more'>RSS</acronym> parsers (like the undocumented one included with <a
href='http://ruby-lang.org/'>Ruby</a> 1.8, or <a
href='http://www.chadfowler.com/'>Chad Fowler's</a> <a
href='http://www.chadfowler.com/ruby/rss/'>Ruby/RSS</a> module) are nice
pieces of code, but useless for writing an <acronym title='Really Simple Summary / RDF Site Summary / Rich Site Summary / probably more'>RSS</acronym> aggregator, in the same way that strict <acronm title='HyperText Markup Language'>HTML</acronym> parsers are useless for web browsers.
</p> 

<p>
The way I dealt with this problem in previous versions of <a
href='http://raggle.org/'>Raggle</a> was to simply ignore the specs that
were out there and look for specific elements in feeds.  This has worked
so well I'm going to keep doing it, with a twist.  My goal with Squaggle
is to keep <a href='http://raggle.org/'>Raggle</a> aware of as much of
the <acronym title='Really Simple Summary / RDF Site Summary / Rich Site Summary / probably more'>RSS</acronym> spectrum as I can, but have the engine (Squaggle) only pay attention to what it absolutely has to.  For example, if a feed has mixed RSS 0.92/1.0 elements, <a href='http://raggle.org/'>Raggle</a> will parse it blindly and save what it can.
</p>

<p>
What I've got so far is available in <acronym title='Concurrent Versioning System'>CVS</acronym> under the module <a href='http://cvs.pablotron.org/?m=squaggle'><code>squaggle</code></a>.  Play around with it and let me know what you think.
</p>