The SIMPLE WG has been working on some HTTP extensions and using XML in order to allow instant messaging clients to interoperably edit buddy lists (stored on the IM server) and other configuration data. Special functionality to modify/retrieve XML stored on an HTTP server is rampant these days, so it seemed like a good idea to consider general mechanisms, rather than only design mechanisms limited to SIMPLE use cases. So Jari Urpalainen has been working on a general XML diff, or patch algorithm -- like Unix diff files, only specialized for XML (operations that can add or remove branches from the XML tree structure, rather than operations on lines as in text diffs).
Once the SIMPLE WG was potentially working on such general mechanisms, it seemed like a good idea to hold a BOF (Birds Of a Feather) meeting to see if there were general use cases and find or identify other potential IETF participants. Some places where we thought we'd see interest:
- WebDAV allows authors to collaborate on documents stored on HTTP servers. Sometimes these documents are quite large and it would be useful to be able to upload changes without sending the entire file again. In fact, Adobe engineers have talked to me about this -- some of their WebDAV functionality is intentionally designed to limit the number of times large files are exchanged between client and server, so that the user isn't constantly waiting for slow uploads or downloads. Obviously an XML patch format only works if the document is in XML, but some Adobe tools do support XML formats (e.g. InDesign). Another piece to this puzzle is the HTTP PATCH operation I've proposed, an idea I intend to come back to shortly particularly if I get any help (hint, hint).
- The NETCONF WG is pursuing ways to interoperably configure network devices and has also settled on using XML and HTTP. They've got very similar problems of wanting to make small changes to large data sets.
- Large Web pages in XHTML could be edited using an XML diff format to upload only changes.
- Large Web pages in XHTML could be downloaded faster using RFC3229 and an XML diff format. A text diff is used today but an XML diff format could be even more efficient, particularly for...
- Blog feeds. Today, a blog feed can be a large XML file, in Atom or RSS format. Today, if the ETag or Last-Modified timestamp of the blog feed changes, the newsreader client downloads the entire file. Similarly, to add a single new post to the feed, blog editing tools may have to upload a new feed file (unless the server does this magically somehow). This is really just a special case of the general "large files being shared" case, but since blogging generates so much traffic it seemed worth mentioning.
to form a separate effort. So the work proceeds on the SIMPLE mailing list. Still, I plan to keep up with Jari's work and possibly help him generalize it further -- for example, we may add the ability to make changes to text values of XML elements without replacing the entire text value.
Note that there exist other XML diff formats, but none of them are standardized. Microsoft's got one, the W3C has tackled this both for rdf and more generally (though the W3C didn't have any guidance for the IETF when we asked about this BOF), and it's been the subject of several theses: treepatch, diffxml and a survey.