Comparing one URI to another turns out to be a commonly desired feature. Browsers look up cached pages based on URI comparison. If I click a link to bookmark in delicious.com, I'd like it to be bookmarked only once and if I've already bookmarked it, bring up that page so I can see what I tagged it with and when. Outside of http URIs, I'd still like to know if I'm already subscribed to an XMPP user, etc.
Comparison is harder than it sounds, but you already know that if you've dealt with any code requiring canonicalization, conversion or encoding. If a link in a Web page contains a space, at some point my browser has to convert that space to %20 to use in an HTTP request. Should the browser do that conversion before or after looking up the URL in the cache? Should delicious.com bookmark the URL with the space or the one that's used in HTTP requests? This is the tip of a very large iceberg that potentially includes all of internationalization and Unicode. Be very clear on what character set is used by each part of a URI, and if it's all ASCII, say so.
Case sensitivity is a frequent issue. Be very clear on which parts of the URI are case sensitive.
For new URIs, giving options makes the comparison job much harder. Let's say a new URI scheme needed to include a country designation: it seems nice to let users put a two-letter country code, a three-letter country code, a TLD or a OID in there. Only now one needs a horrid table to convert and compare these, and string comparison is no longer enough.
Optional syntaxes are similarly difficult; even allowing for '/' vs '\' can lead to error.
When the URI form is an alternate form for an identifier that already exists, now the URI may have to be comparable to something that's not a URI. For example, both IRI and URN forms exist for ISO OIDs. Don't they need to be compared to each other?
Can the URI form have query syntax? Is that part of the comparison or is that stripped off first? In HTTP URIs if I stripped off the query syntax I'd retrieve quite a different resource, but in some URI forms, the query syntax is used to carry information other than resource-identifying information. For example, can I compare two mailto URIs that have the same mail address, even if one of them has a query part with "?subject=Hey%20There"?
Can the URI form contain multiple values? The SMS URI definition had to include text on comparison when multiple SMS addresses were packed into the same URI. Does order matter to comparison?
Frequently URI schemes need to avoid collisions, so that there isn't an attempt to give two different things the same identifier. The problem here is delegating the ability to create new URIs, while still avoiding collisions. The major fallback here is the DNS: URIs that contain a domain name, where the resource being named belongs to that domain, effectively delegate the uniqueness concern to that domain holder.
For example, we don't need to worry that "xmpp:email@example.com" will conflict with other resources, because the domain 'jabber.org' assigns usernames uniquely within that domain and prevents collisions.
The other main option is to use registries. For example, all OIDs, defined by ISO, use the ISO process to register numerical values and string values for use in the parts of an OID. Other times, IANA is the registry (e.g. for port numbers in HTTP URIs). If there is a new registry needed by the URI, this is more work and more to get right.
Some processes for ensuring uniqueness are quite heavyweight. Many IANA registries have processes which can take weeks or months to resolve. If the registrar is not IANA, who is going to actually run the registration process and under which rules? The OGC URN defined in RFC 5165 includes sub-namespaces issued by OGC itself . The first consequence of this is that the OGC organization must be referred to for any new OGC URN unless it explicitly delegates that part of the namespace. To reduce the burden of being a registrar in the case of non-permanent, test or experimental OGC URNs, the URN definition mentions the possibility of an experimental sub-namespace and the possibility of collisions within that namespace. Now implementors have to consider the possibility of leaked experimental names and dealing with collisions. The approval discussion of RFC5165 was lengthy, because of these nuances.
Some URIs need to refer to the same thing over only a short time, but typically the desired stable period is long or even longer. Domain names can be a problem here. Initially it might seem great to use HTTP URIs as XML namespaces, but consider whether the holder of the "example.org" domain will change over time, and whether the new holder will have the same policies regarding use and allocation of URIs in that namespace.
If a registry is used to achieve unique assignment, and the registrar is not IANA, then the stability of the registry must be considered. How long is the organization going to exist and maintain the registry publicly? We look for a public commitment, existing Web pages, a long-lived organization and so on. An explanation of the process and deciding factors for how names are assigned and how the organization ensures they are not reassigned, shows that they've thought about this commitment. See RFC 5328 for an example.
Random List of Gotchas
- Does your URI scheme include use of fragment identifiers (like the #iri part of http://example.org/faq#iri)? Forget it; fragment identifiers relate to the media type of the *resource*, not the type of the URI. So if the URI "foo:bar:baz" retrieves a HTML page, then the fragment identifier would act like a HTML document fragment identifier.
- ABNF is hard to get right. Get it reviewed by an expert. Use a ABNF generator or something like that to test your instincts. Refer to existing productions where possible. One common issue is to use a separator like "=" between two constructs, and then define one of those constructs in a way that it includes the separator character itself.
- Another ABNF/syntax issue is to accidentally use a character that has an obscure meaning in URI syntax, or is simply reserved.
- URIs that contain phone numbers include a whole barrel of troublesome monkeys. It's so hard to get telephone numbers right, with variable length and special encodings, '+' prefixes, dashes or spaces, and extension numbers, that it's worth trying very hard to use an existing phone-based URI instead of defining a new one.
- If query parameters are used, can key values be extended? Can new key value be defined by anybody? Do they have global meaning (like mailto URIs) or purely local meaning (like HTTP URIs)?
- The "community considerations" section required for URN registrations is frequently misunderstood. What the IETF looks for in this section is an indication that the work done to standardize this scheme and allocate a new scheme or URN type will add value; that there is benefit to the Internet community and not only to a private consortium or private company.
- If any reviewer blithely says "whyever invent a new URI scheme, use HTTP for everything!" just ignore them until they provide actual reasoning for this proposal.
- Embedding URIs within URIs, or any syntax that is infinitely extensible, is asking for trouble.
Here are the documents and registries that govern the registration and syntax of new URI schemes and new URNs.
- RFC3406:How to define a new URN namespace or "NID" or "Namespace Identifier".
- RFC2141: URN syntax, or the syntax of URIs that begin with 'urn:'.
- RFC3986: URI syntax, how to parse all URIs and URNs regardless of scheme.
- RFC4395: Guidelines and Registration Procedures for new URI schemes.
- IANA Scheme Registry: Existing registered URI schemes
- IANA URN-NID registry: Existing URN Namespace registrations