Thursday, November 04, 2004

Today I managed to explain (better than I've ever explained before) a few principles in the design of a network system. I use a client/server network system although you can generalize this to P2P easily. This is the diagram I drew on the whiteboard.



If it's hard to grok this completely abstractly, an IMAP client/server are good to mentally plug in. There are so many different IMAP clients and servers and they all have different APIs, storage models, and internal data models. By "data model" I mean data structures or object models, including caching and relationships between things. So if your Java code instantiates a MailMessage object which has a link to a EmailAddress instance for the 'From' field, that's all part of the internal model. The protocol's data model is similar: in IMAP there are folders and mail messages, mail messages have headers, one of which is the From header, and so on.

So I intended this diagram to convey a whole lot of stuff.

The protocol is the most inflexible part of this system. If you've got any interoperability at all with your protocol, even between unsynchronized releases of client and server, then your protocol is the most fixed element in the system. People constantly use new clients to talk to old servers, and old clients to talk to new servers, which means that even when new clients talk to new servers you're likely using an old protocol. Since your protocol is the hardest thing to change, both its syntax and its data model, it had better be extensible, basic and support many usage models.

The internal data model is the most flexible part of this system. APIs and protocols must continue to be supported across releases. Storage formats are easier to change than APIs but often require upgrade steps. Thus, the internal model is actually the easiest thing to change. Doesn't that mean that it's less important to get that right, because it can be tweaked and refactored to do new things or benefit from new knowledge? Yet many architects focus deeply on the internal model, spending much more time getting it right than the API or the protocol.

Client, server and protocol data models and content models diverge. Many architects design a networked system that starts with the same data model on the client and server and thus naturally they want the same data model expressed in the protocol. But these diverge naturally, sometimes even before Server 1.0 and Client 1.0 ship. For example the implementors of Server 1.0 discover that they need to cache certain strings together for scalability and subtly the data model on the server begins to change. Be aware from the beginning that this will happen. It's not a bad thing. It may even be a good thing to allow the server to be fast and the client to innovate on features.

Practice information hiding at the dotted lines. These are the places to really focus on modularization. Many software developers already understand that your API shouldn't be tied directly through to your storage model and this principle can easily be extended to the protocol modules. I've written bad code that tied too much to the protocol so I'm guilty of that one myself. It seems that unless there's a good reason, the protocol implementation shouldn't be tied directly to the storage model (the implementation should instead retain the freedom to change how things are stored without rewriting everything). It might not be so bad to tie the protocol to the API, i.e., by implementing the protocol logic using only the API. That way, any internal changes that leave the API unchanged, also leave the protocol unchanged. But that isn't always the best choice -- sometimes the protocol support needs access to internals and you don't want to complicate the API too much just to make the protocol fast.

Corollary: Use the best technology and architecture choice for each component independently. Because your client model will diverge from your protocol model and that one from the server model, data model consistency is not a good reason to use the exact same table structure or even the same database software on the client and server. (There may be other good reasons like expertise). Don't try to create the same indexes; the client and the server data access patterns will also diverge if they're even the same to begin with. Don't try to recreate the same caches. Send your server and client teams to different countries to work, maybe! That way the protocol becomes one of the most important ways they client and server teams communicate and they can make fewer hidden assumptions about how the code on the other side works (but they will make some anyway which will bite you in the ass).

Standard protocols and proprietary protocols aren't much different. If the protocol data model and client and server protocol naturally diverge, then even if your system starts out with highly similar models by implementing a proprietary protocol, that advantage erodes and becomes a disadvantage, hindering extensibility. OTOH if you start out implementing a standard protocol and enforcing good separation between the data models, this is a good long-term strategy. You know from the start that there will be translation between the data models -- every protocol message that comes in will have to result in objects or data structures being instantiated in the internal format, and every protocol message that goes out is a tranformation from internal objects or data structures. So that translation layer is solid from the beginning. Furthermore, if the system is using a proven protocol, the extensibility and performance features are likely to be better than one can easily design from scratch.

Protocol syntax isn't very important as long as it's extensible. Translating between models that are different is harder than translating between different syntaxes. It's like translating a business course from American into Chinese -- the language is the easy part, the culture and environment are so different that you can easily mean something you didn't intend to mean. So it's not the end of the world if the syntax is header fields or XML documents, as long as there's a clear way to extend either one. The extensibility is key so that as the clients and servers evolve they're not totally hamstrung by an inflexible protocol.

Whew. That's asking a lot of a l'il ol' whiteboard sketch. Comments welcome.

No comments:

Blog Archive

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License.