I'm hiring again at OSAF -- this position is for our very first server developer. I'm looking for somebody who can lead the charge, crank out code, and make the server do us proud.
Tuesday, November 30, 2004
Monday, November 29, 2004
Data modeling is hard. Some loosely correlated thoughts and links:
The "relatively flat" observation seems to hold at least some validity in data formats, programs and even books. Experienced programmers, with the help of good indenting, can see quickly that they're within an 'else' statement inside a loop inside another loop inside an 'if' statement, but even experienced programmers screw this up sometimes (and even more experienced programmers flatten out the code by delegating some reasonable piece off to another method). Books are better if there's no more than three (maybe four) layers -- chapter, section, sub-section, and even this much organization requires human-readable text to link from one section to another and summarize what a bunch of sections are going to say.
- Model-driven architecture is rigid, at least with the tools as we know them today.
- RDF has a simple basic model but leads to very complex structures, as Adam Bosworth explains.
- Pictures express complex relationships relatively readably, like the picture in this paper. Unfortunately we need to translate the pictures into text in order to use these in software and network protocols.
- The more complex your picture is, the more unreadable your text is.
- Text has to be relatively flat to be readable.
- References in data formats are like "goto" jumps in programming -- you lose context.
- Maybe if data modelers put a little more thought into flattening their models we'd find them easier to use? This may make the models seem less "rich" but "KISS" is good too.
The "relatively flat" observation seems to hold at least some validity in data formats, programs and even books. Experienced programmers, with the help of good indenting, can see quickly that they're within an 'else' statement inside a loop inside another loop inside an 'if' statement, but even experienced programmers screw this up sometimes (and even more experienced programmers flatten out the code by delegating some reasonable piece off to another method). Books are better if there's no more than three (maybe four) layers -- chapter, section, sub-section, and even this much organization requires human-readable text to link from one section to another and summarize what a bunch of sections are going to say.
Thursday, November 25, 2004
There's a higher quality of homeless people in Palo Alto. I just saw a hand-lettered sign outside of Whole Foods, explaining that the homeless of Palo Alto need food donated in the holiday season (and a very shifty looking guy collecting food on their behalf). Among the list of requested foods was "organic turkey". No cheap turkey for the Palo Alto homeless, please!
Tuesday, November 16, 2004
Friday, November 12, 2004
A while back I posted on honesty in journalistic bias. This week TechCentralStation has a longer essay about why we might see more openness around bias and why that's a fine thing.
Being aware of bias is something I agree with, but I do worry about blinkered views of the world. Too many people reading some highly biased source will simply not read any opposing source, or do so with only mockery in mind. We've got plenty of polarization, thank-you. So my preferred model is journalists who say "Here is my natural bias, and here is me being as unbiased as I can be in covering this topic, through rigorous reasoning and discourse with others who disagree with me."
Being aware of bias is something I agree with, but I do worry about blinkered views of the world. Too many people reading some highly biased source will simply not read any opposing source, or do so with only mockery in mind. We've got plenty of polarization, thank-you. So my preferred model is journalists who say "Here is my natural bias, and here is me being as unbiased as I can be in covering this topic, through rigorous reasoning and discourse with others who disagree with me."
Sunday, November 07, 2004
Writing protocol standards is hard work, harder than writing specifications, although they are similar tasks. One of the reasons is that you have to describe the protocol in sufficient detail that somebody who wasn't involved in the process and has different software experience (different features, different user interactions, different architecture, different platform or different programming languages) can still implement the standard and interoperate with other implementors. (Actually it's so hard to do this that no standard gets it "right". At the IETF we're well aware that we do successive approximations, first doing internet-drafts and then doing RFCs at different stages of maturity. ) But we can at least try to do it right, and a proper effort requires a lot of effort including:
The model
The model is key for first-time readers and for people who need to know something shallow about the protocol. There are different kinds of models that are important for protocols, and some of them are described (and examples given) in one of Ekr's works-in-progress:
The model is important not just for first-time readers and shallow users but also later on for deep users who want to extend the protocol. HTTP has been extended in many ways by people unfamiliar with the way the model is supposed to work. For example, HTTP normally uses the Content-Type to declare the type of the message body, just as one would expect from a concept borrowed from MIME and a messaging system. However, one extension to HTTP (now part of HTTP 1.1 or RFC2616) breaks that model by applying an encoding to the body and that encoding is specified in a different header. So if that feature is used the Content-Type no longer strictly works that way. RFC 3229 moves further away from the MIME-like model as it extends HTTP -- it defines an alternative model, where the Content-Type refers to the type of the resource that is addressed. So now of course there's a schism in the HTTP community about which is the best model to proceed with, to the point of having academic papers written about the alternative models. More clarity about the model in the first place would have helped not only first-time readers of the HTTP spec but also might have helped have fewer problems with these extensions.
Finally, a clear model helps implementors remember and understand each of the requirements. Humans have trouble fitting a bald list of requirements into some memorable pattern, so give implementors a mental model (or several) and they'll do so much faster, with less confusion and mistakes.
Requirements
The requirements are deeply important, as much so as the model. At the IETF we place so much importance on the wording of requirements that we have a whole standard describing the wording of requirements. Why?
First, models can be interpreted differently by different people. This can happen very easily. IMAPv4 was originally defined in RFC 1730 and there was a lot of text about the model, particularly the different states. However a lot of people implemented the details differently and RFC2060 had to get more specific. Finally, RFC 3501 revised RFC 2060, and most of the changes made in RFC3501 were simply clarifying what the consequences of the model were for various cases -- because implementors made different assumptions, came to different conclusions, and argued persistently about the validity of their incompatible conclusions. Chris Newman explained this to me today when the topic of models + requirements came up, and he should know -- he authored/edited RFC 3501.
Second, a model explains how things fit together, whereas requirements explain what an implementation must do. Implementors are human and operating under different pressures, so it is easy for implementors to read a lot of flexibility into the model and the examples. Clients want to believe that servers will do things similarly (makes their logic easier) so they tend to assume that is the case. So when things are flexible, they must be explained to be so, to encourage client implementors to account for differences. E.g. RFC 3501 says
A few more reasons why requirements are needed:
Examples are, thankfully, better understood. It's pretty rare to see a protocol go to RFC without a few good examples. Readers expect and demand them (more so than the model or requirements) because we know from reading many kinds of technical documents how useful examples are. I hope I don't need to justify this too much, in fact I find I need to do the opposite and remind people that examples do not replace requirements or models. Implementors need examples to understand the requirements and models but they can easily draw conclusions from examples that are counter to the requirements and don't fit in the model. When a specification has an inconsistency between a requirement and an example, trust most developers to implement to match the example, not the requirement.
Definitions/Schemas
Definitions and schemas also tend not to need much justification in a techie crowd. We're attracted by the idea of having absolute certainty about what's valid by trusting a program to compare an example to a definition or schema and validate it. So once again, I have a caveat to offer rather than a justification: make sure that definitions or schemas are put in context very carefully. Can an implementor use the schema to validate incoming XML and reject anything that doesn't match the schema? Probably not, or else it would be impossible to extend the protocol. Early WebDAV implementors built XML schema validators into their servers and rejected client requests that extended the protocol in minor ways that should have been compatible, so I'm taking this lesson from actual experience.
I certainly can't say that when I'm a protocol author, I succeed in doing all of this. But after eight years reviewing and implementing good and bad protocol specifications, I'm beginning to see what works.
Comments welcome.
- A description of the model
- Implementation requirements
- Examples of protocol usage
- Definitions/schemas
The model
The model is key for first-time readers and for people who need to know something shallow about the protocol. There are different kinds of models that are important for protocols, and some of them are described (and examples given) in one of Ekr's works-in-progress:
- The protocol messaging model. Do messages have headers and bodies, or do they have XML element containers? Does the server respond to messages in the same connection? In a fixed order? Can the server originate messages?
- The protocol state machine. Are there different states (e.g. pre-handshake, pre-authentication, and main state)?
- The protocol's data model. What data types are there and what relationship do they have to each other -- folders and messages and flags (IMAP), or collections, resources and properties (WebDAV)?
- The addressing model, which is almost part of the data model. In SIMPLE you can address other people whereas in XMPP you can address not only human actors but specific software instances running on behalf of those humans. And not to be speciesist, non-humans as well.
The model is important not just for first-time readers and shallow users but also later on for deep users who want to extend the protocol. HTTP has been extended in many ways by people unfamiliar with the way the model is supposed to work. For example, HTTP normally uses the Content-Type to declare the type of the message body, just as one would expect from a concept borrowed from MIME and a messaging system. However, one extension to HTTP (now part of HTTP 1.1 or RFC2616) breaks that model by applying an encoding to the body and that encoding is specified in a different header. So if that feature is used the Content-Type no longer strictly works that way. RFC 3229 moves further away from the MIME-like model as it extends HTTP -- it defines an alternative model, where the Content-Type refers to the type of the resource that is addressed. So now of course there's a schism in the HTTP community about which is the best model to proceed with, to the point of having academic papers written about the alternative models. More clarity about the model in the first place would have helped not only first-time readers of the HTTP spec but also might have helped have fewer problems with these extensions.
Finally, a clear model helps implementors remember and understand each of the requirements. Humans have trouble fitting a bald list of requirements into some memorable pattern, so give implementors a mental model (or several) and they'll do so much faster, with less confusion and mistakes.
Requirements
The requirements are deeply important, as much so as the model. At the IETF we place so much importance on the wording of requirements that we have a whole standard describing the wording of requirements. Why?
First, models can be interpreted differently by different people. This can happen very easily. IMAPv4 was originally defined in RFC 1730 and there was a lot of text about the model, particularly the different states. However a lot of people implemented the details differently and RFC2060 had to get more specific. Finally, RFC 3501 revised RFC 2060, and most of the changes made in RFC3501 were simply clarifying what the consequences of the model were for various cases -- because implementors made different assumptions, came to different conclusions, and argued persistently about the validity of their incompatible conclusions. Chris Newman explained this to me today when the topic of models + requirements came up, and he should know -- he authored/edited RFC 3501.
Second, a model explains how things fit together, whereas requirements explain what an implementation must do. Implementors are human and operating under different pressures, so it is easy for implementors to read a lot of flexibility into the model and the examples. Clients want to believe that servers will do things similarly (makes their logic easier) so they tend to assume that is the case. So when things are flexible, they must be explained to be so, to encourage client implementors to account for differences. E.g. RFC 3501 says
"Server implementations are permitted to "hide" otherwise accessible mailboxes from the wildcard characters, by preventing certain characters or names from matching a wildcard in certain situations."When things aren't flexible, the document needs to say so so that implementors aren't given any wiggle room or room for confusion. In RFC3501 we see
The STATUS command MUST NOT be used as a "check for new messages in the selected mailbox" operationThis text is much stronger than saying that the "STATUS command requests the status of the indicated mailbox" (that sentence is also in RFC3051). It's even stronger than saying that the STATUS command isn't intended as a way to check for new messages. (It might be even clearer to say that "client implementations MUST NOT use the STATUS command..." but this is good enough.) IETF standards-writers and implementors have learned painfully that they need to use well-defined terms in attention-getting ALL CAPS in order to get implementors not to misunderstand wilfully or accidentally, whether something is a requirement.
A few more reasons why requirements are needed:
- Requirements often add more detail than the model should hold. Since the model should be high-level and readably concise, it can't be expected to define all behaviors.
- Sometimes requirements are examples of the conclusions that somebody would draw if they fully understood the model and all its implications. These have to be complete, however, not only selected examples, because no two people have the same full understanding of the model and all its implications. The requirements help people go back to the model and understand it the same way.
- Human readers need repetition in order to understand things. Sometimes the requirements restate the model in a different form, and that's fine. When essay writers want their audience to understand they say what they're going to say, say it, then say what they said. We can make our standards more interoperable by balancing that approach with our typical engineering love of elegance through avoiding redundancy. Humans aren't computers, so the engineering avoidance of redundancy in code isn't fully applicable to human-readable text.
Examples are, thankfully, better understood. It's pretty rare to see a protocol go to RFC without a few good examples. Readers expect and demand them (more so than the model or requirements) because we know from reading many kinds of technical documents how useful examples are. I hope I don't need to justify this too much, in fact I find I need to do the opposite and remind people that examples do not replace requirements or models. Implementors need examples to understand the requirements and models but they can easily draw conclusions from examples that are counter to the requirements and don't fit in the model. When a specification has an inconsistency between a requirement and an example, trust most developers to implement to match the example, not the requirement.
Definitions/Schemas
Definitions and schemas also tend not to need much justification in a techie crowd. We're attracted by the idea of having absolute certainty about what's valid by trusting a program to compare an example to a definition or schema and validate it. So once again, I have a caveat to offer rather than a justification: make sure that definitions or schemas are put in context very carefully. Can an implementor use the schema to validate incoming XML and reject anything that doesn't match the schema? Probably not, or else it would be impossible to extend the protocol. Early WebDAV implementors built XML schema validators into their servers and rejected client requests that extended the protocol in minor ways that should have been compatible, so I'm taking this lesson from actual experience.
I certainly can't say that when I'm a protocol author, I succeed in doing all of this. But after eight years reviewing and implementing good and bad protocol specifications, I'm beginning to see what works.
Comments welcome.
Thursday, November 04, 2004
Today I managed to explain (better than I've ever explained before) a few principles in the design of a network system. I use a client/server network system although you can generalize this to P2P easily. This is the diagram I drew on the whiteboard.
If it's hard to grok this completely abstractly, an IMAP client/server are good to mentally plug in. There are so many different IMAP clients and servers and they all have different APIs, storage models, and internal data models. By "data model" I mean data structures or object models, including caching and relationships between things. So if your Java code instantiates a MailMessage object which has a link to a EmailAddress instance for the 'From' field, that's all part of the internal model. The protocol's data model is similar: in IMAP there are folders and mail messages, mail messages have headers, one of which is the From header, and so on.
So I intended this diagram to convey a whole lot of stuff.
The protocol is the most inflexible part of this system. If you've got any interoperability at all with your protocol, even between unsynchronized releases of client and server, then your protocol is the most fixed element in the system. People constantly use new clients to talk to old servers, and old clients to talk to new servers, which means that even when new clients talk to new servers you're likely using an old protocol. Since your protocol is the hardest thing to change, both its syntax and its data model, it had better be extensible, basic and support many usage models.
The internal data model is the most flexible part of this system. APIs and protocols must continue to be supported across releases. Storage formats are easier to change than APIs but often require upgrade steps. Thus, the internal model is actually the easiest thing to change. Doesn't that mean that it's less important to get that right, because it can be tweaked and refactored to do new things or benefit from new knowledge? Yet many architects focus deeply on the internal model, spending much more time getting it right than the API or the protocol.
Client, server and protocol data models and content models diverge. Many architects design a networked system that starts with the same data model on the client and server and thus naturally they want the same data model expressed in the protocol. But these diverge naturally, sometimes even before Server 1.0 and Client 1.0 ship. For example the implementors of Server 1.0 discover that they need to cache certain strings together for scalability and subtly the data model on the server begins to change. Be aware from the beginning that this will happen. It's not a bad thing. It may even be a good thing to allow the server to be fast and the client to innovate on features.
Practice information hiding at the dotted lines. These are the places to really focus on modularization. Many software developers already understand that your API shouldn't be tied directly through to your storage model and this principle can easily be extended to the protocol modules. I've written bad code that tied too much to the protocol so I'm guilty of that one myself. It seems that unless there's a good reason, the protocol implementation shouldn't be tied directly to the storage model (the implementation should instead retain the freedom to change how things are stored without rewriting everything). It might not be so bad to tie the protocol to the API, i.e., by implementing the protocol logic using only the API. That way, any internal changes that leave the API unchanged, also leave the protocol unchanged. But that isn't always the best choice -- sometimes the protocol support needs access to internals and you don't want to complicate the API too much just to make the protocol fast.
Corollary: Use the best technology and architecture choice for each component independently. Because your client model will diverge from your protocol model and that one from the server model, data model consistency is not a good reason to use the exact same table structure or even the same database software on the client and server. (There may be other good reasons like expertise). Don't try to create the same indexes; the client and the server data access patterns will also diverge if they're even the same to begin with. Don't try to recreate the same caches. Send your server and client teams to different countries to work, maybe! That way the protocol becomes one of the most important ways they client and server teams communicate and they can make fewer hidden assumptions about how the code on the other side works (but they will make some anyway which will bite you in the ass).
Standard protocols and proprietary protocols aren't much different. If the protocol data model and client and server protocol naturally diverge, then even if your system starts out with highly similar models by implementing a proprietary protocol, that advantage erodes and becomes a disadvantage, hindering extensibility. OTOH if you start out implementing a standard protocol and enforcing good separation between the data models, this is a good long-term strategy. You know from the start that there will be translation between the data models -- every protocol message that comes in will have to result in objects or data structures being instantiated in the internal format, and every protocol message that goes out is a tranformation from internal objects or data structures. So that translation layer is solid from the beginning. Furthermore, if the system is using a proven protocol, the extensibility and performance features are likely to be better than one can easily design from scratch.
Protocol syntax isn't very important as long as it's extensible. Translating between models that are different is harder than translating between different syntaxes. It's like translating a business course from American into Chinese -- the language is the easy part, the culture and environment are so different that you can easily mean something you didn't intend to mean. So it's not the end of the world if the syntax is header fields or XML documents, as long as there's a clear way to extend either one. The extensibility is key so that as the clients and servers evolve they're not totally hamstrung by an inflexible protocol.
Whew. That's asking a lot of a l'il ol' whiteboard sketch. Comments welcome.
If it's hard to grok this completely abstractly, an IMAP client/server are good to mentally plug in. There are so many different IMAP clients and servers and they all have different APIs, storage models, and internal data models. By "data model" I mean data structures or object models, including caching and relationships between things. So if your Java code instantiates a MailMessage object which has a link to a EmailAddress instance for the 'From' field, that's all part of the internal model. The protocol's data model is similar: in IMAP there are folders and mail messages, mail messages have headers, one of which is the From header, and so on.
So I intended this diagram to convey a whole lot of stuff.
The protocol is the most inflexible part of this system. If you've got any interoperability at all with your protocol, even between unsynchronized releases of client and server, then your protocol is the most fixed element in the system. People constantly use new clients to talk to old servers, and old clients to talk to new servers, which means that even when new clients talk to new servers you're likely using an old protocol. Since your protocol is the hardest thing to change, both its syntax and its data model, it had better be extensible, basic and support many usage models.
The internal data model is the most flexible part of this system. APIs and protocols must continue to be supported across releases. Storage formats are easier to change than APIs but often require upgrade steps. Thus, the internal model is actually the easiest thing to change. Doesn't that mean that it's less important to get that right, because it can be tweaked and refactored to do new things or benefit from new knowledge? Yet many architects focus deeply on the internal model, spending much more time getting it right than the API or the protocol.
Client, server and protocol data models and content models diverge. Many architects design a networked system that starts with the same data model on the client and server and thus naturally they want the same data model expressed in the protocol. But these diverge naturally, sometimes even before Server 1.0 and Client 1.0 ship. For example the implementors of Server 1.0 discover that they need to cache certain strings together for scalability and subtly the data model on the server begins to change. Be aware from the beginning that this will happen. It's not a bad thing. It may even be a good thing to allow the server to be fast and the client to innovate on features.
Practice information hiding at the dotted lines. These are the places to really focus on modularization. Many software developers already understand that your API shouldn't be tied directly through to your storage model and this principle can easily be extended to the protocol modules. I've written bad code that tied too much to the protocol so I'm guilty of that one myself. It seems that unless there's a good reason, the protocol implementation shouldn't be tied directly to the storage model (the implementation should instead retain the freedom to change how things are stored without rewriting everything). It might not be so bad to tie the protocol to the API, i.e., by implementing the protocol logic using only the API. That way, any internal changes that leave the API unchanged, also leave the protocol unchanged. But that isn't always the best choice -- sometimes the protocol support needs access to internals and you don't want to complicate the API too much just to make the protocol fast.
Corollary: Use the best technology and architecture choice for each component independently. Because your client model will diverge from your protocol model and that one from the server model, data model consistency is not a good reason to use the exact same table structure or even the same database software on the client and server. (There may be other good reasons like expertise). Don't try to create the same indexes; the client and the server data access patterns will also diverge if they're even the same to begin with. Don't try to recreate the same caches. Send your server and client teams to different countries to work, maybe! That way the protocol becomes one of the most important ways they client and server teams communicate and they can make fewer hidden assumptions about how the code on the other side works (but they will make some anyway which will bite you in the ass).
Standard protocols and proprietary protocols aren't much different. If the protocol data model and client and server protocol naturally diverge, then even if your system starts out with highly similar models by implementing a proprietary protocol, that advantage erodes and becomes a disadvantage, hindering extensibility. OTOH if you start out implementing a standard protocol and enforcing good separation between the data models, this is a good long-term strategy. You know from the start that there will be translation between the data models -- every protocol message that comes in will have to result in objects or data structures being instantiated in the internal format, and every protocol message that goes out is a tranformation from internal objects or data structures. So that translation layer is solid from the beginning. Furthermore, if the system is using a proven protocol, the extensibility and performance features are likely to be better than one can easily design from scratch.
Protocol syntax isn't very important as long as it's extensible. Translating between models that are different is harder than translating between different syntaxes. It's like translating a business course from American into Chinese -- the language is the easy part, the culture and environment are so different that you can easily mean something you didn't intend to mean. So it's not the end of the world if the syntax is header fields or XML documents, as long as there's a clear way to extend either one. The extensibility is key so that as the clients and servers evolve they're not totally hamstrung by an inflexible protocol.
Whew. That's asking a lot of a l'il ol' whiteboard sketch. Comments welcome.
I have to say, I love a good Fisking, or to Canadianize that, a Frumming. On TCS, Radley Balko takes on David Frum's National Review column on obesity and taxes. It's a good read in its entirety, but I thought it would be fun to summarize anyway, to show how each argument was demolished.
Frum argues:
Frum argues:
- Canadians are less obese than Americans
- Portion sizes are smaller in Canada than in US.
- It's because Canadians are less wealthy that portion sizes are smaller.
- Smaller portions lead to less obesity.
- Obesity leads to health care costs.
- Making sodas more expensive (by taxation) will cause lower consumption of sodas (conclusion: also reduce obesity, also reduce health care costs).
- Canadians are similarly obese to Americans and Frum's evidence was only anecdotal.
- Portion sizes are similar and Frum's evidence was only anecdotal.
- Since portion sizes aren't smaller in Canada, wealth isn't a factor in portion sizes (at least the wealth difference between CA/US doesn't matter to that). Also note that total consumption of caloric sodas has been steady for decades as Canadians have gotten significantly richer (and soda cheaper).
- This one requires more data to completely demolish, but the evidence that total consumption of caloric sodas has been steady for decades does cast doubt on the idea that smaller cans of sodas will reduce consumption.
- There's more evidence that poor fitness (sedentary lifestyles) has a much greater health care cost than obesity.
- Such a small increase in price of soda is unlikely to change consumption, given that consumption has been steady for decades as soda production has gotten cheaper and people richer.
Subscribe to:
Posts (Atom)
Blog Archive
-
▼
2004
(65)
-
▼
November
(8)
- I'm hiring again at OSAF -- this position is for o...
- Data modeling is hard. Some loosely correlated th...
- There's a higher quality of homeless people in Pal...
- Timboy has a good medium-sized post on why meeting...
- A while back I posted on honesty in journalistic b...
- Writing protocol standards is hard work, harder th...
- Today I managed to explain (better than I've ever ...
- I have to say, I love a good Fisking, or to Canadi...
-
▼
November
(8)