Open Source and Open Data

Last update: Nov 29, 1998

Because of the recent "Halloween Documents", there has been a lot of attention paid lately to Microsoft's comments about "de-commoditizing protocols." Something that has been overlooked in the mess that's been stirred up is the fact that Microsoft already has already effectively "de-commoditized" the data that nearly everyone uses on the desktop for the past several years. Computers, applications and networks don't do anything without data to act upon. Microsoft has locked-in a large percentage of the data on machines out there by continuing to develop closed, proprietary file formats that change with every consecutive version of their Microsoft Office suite.

What better way to beat Microsoft at their own game than to "commoditize" those file formats? By creating open standards for file formats, the Open Source community can take a huge step toward the idea of not just commoditizing the computer hardware, but the operating system as well 1. Make the applications the important part of the system. If all computers speak the same, open protocols and read the same, open format files, then the only thing that will differentiate systems in the marketplace will be the quality and feature-set of the applications they run. (Currently, a major factor in the "quality" of an office suite that doesn't come from Microsoft is whether or not it can read the latest files created by the Microsoft Office Flavor-of-the-Week. Anyone writing office applications today has to somehow comply with the de-facto standards set by Microsoft, who can raise the bar higher and higher with every new rev of their Microsoft Office suite, causing competitors to expend resources constantly just to be able to read and write the same files as "everyone else.")

By standardizing file formats, the Open Source community can also avoid "competing" against Microsoft and instead concentrate on taking and retaining the lead. If enough support can be brought behind standardized file formats (and it seems that a lot of major players such as Corel, Star Division, Applix and others are actively looking for new ways to compete against Microsoft) then perhaps it can be Microsoft who is forced into the position of "do you support that file format?" when people buy office software. Let Microsoft worry about "chasing the competition" while we "chase the dream" instead.

What Data?

The first step is to determine what the core applications are in a "complete" office suites. From looking around on the web at products like Microsoft Office, Word Perfect Office, Applixware, StarOffice and the various Open/Free Software projects like KOffice, Siag, GNOME and so forth, the following categories become distinct:

Word Processing, Spreadsheet

A word processor and spreadsheet are at the core of nearly every office suite. The quality and feature set of other components can be lacking as long as the word processor and spreadsheet are exceptional.

Presentation, Graphics

Presentation graphics and some sort of minimal vector graphics applications are common. Graphing and charting components are often present also, and feature sets overlap with the presentation graphics, i.e. the ability to draw basic geometric primitives. (the ability to draw lines, boxes, circles, etc, translates into the ability to generate line graphs, bar and pie charts, etc.)

Database

Occasionally there is an application that has some ability to display and manipulate the data in a remote database, but this is usually the exception, rather than the rule.

From a very high-level viewpoint, it seems that the functionality that needs to be represented for basic office applications are the abilities to store textual mark-up and display, handle different number formats and understand formulas, understand basic geometric shapes and, if we want to go all the way, support some standard for database connectivity.

Reuse the Wheel

There are a number of standards available right now that can do a large portion of this. RTF 2 is a fairly well-established format for distributing textual data. HTML and Stylesheets are close to being able to effectively handle basic document display issues and may even be able to take advantage of XML and DOM or an "embeddable" language like Javascript or Perl to handle spreadsheet-style formulas. (I don't know enough about XML and DOM as I should, so I might be off-base with that idea.) Alternately, take a step up the evolutionary ladder and utilize SGML, the great grandaddy of HTML and XML and all the other *MLs. There are a number of page-formatting languages such as TeX that have a number of translators to the wonderful graphics-representation language, Postscript. Database connectivity is a protocol issue and we already have a number of those availble, such as ODBC, or for more platform-independance, JDBC or CORBA/IIOP. (Again, I'm not an expert with CORBA, IIOP, etc, so don't flame me for incorrect usage.)

It's been pointed out by a couple of people that Microsoft has started an initiative to have their Microsoft Office suite of products utilize XML (Or XSL) as the standard file format, which could be A Good Thing. It was also pointed out that an important step in using XML as a standardized file format is to prevent Microsoft from controlling the visualization tools. i.e. make sure Microsoft doesn't get a lock on the products that can actually use XML for anything useful. (Could be difficult as Microsoft is one of the prime developers/proponents of XML.) If they have a chance, they'll just add their own proprietary extensions to XML that are support by, and only supported by Office. Embrace and Extend...

I also got EMail letting me know that Adobe is working on an XML-based vector file format, possibly something that could replace Postscript/PDF. This would be interesting to see. The same person made the point that if Microsoft and Adobe continue along their current lines of platform- and application-neutral file formats, they stand a great chance of losing one of the only advantages they have in the marketplace, namely their ability to lock customers into their application line by being the only application vendor who can read/write the files they already have. (This is not quite so much in the case of Adobe, as they have made specs available for at least their Photoshop PSD files and PDF.) Further, by not going through with their promises of XML-based file formats, they stand to lose face in the marketplace by breaking a promise, (nothing new to either company) which could also cost customers.

Another point that has been made in EMail is that it would be a great benefit to create Open Source, easily usable filters/converters for the current most commonly used file formats. If there were a library available that could easily read/write Word97 files, then it would be fairly simple to include this in whatever application you were writing as an import filter and use some other more-easily-read/write format as the "native" file format. (Some of the recent Open Source application works may have import filters for some document types like this. It would be worth a look to see how easily they could be ripped out of the code and turned into a generic library function or an external utility that applications could take advantage of by piping files through.)

The Advantages

The advantages of defining open file format standards are immense. Once methods have been written for reading and writing the file formats, Office application authors don't have to worry about implementing them themselves. Likewise, if effective objects can be created for display, then, say, an object with the ability to handle word processing can be easily embedded inside of a spreadsheet application and so forth.

Not the least advantage is, of course, that companies like Microsoft would no longer be able to corner the market in office application software simply by creating a de-facto standard for file formats that they, and only they, control. Office applications can once again be judged on quality, and not what file formats they are able to read/write.

(In my little Utopian world, I can even imagine one single file format that is extensible enough to contain word processing documents, spreadsheets, vector and bitmap graphics, and so forth... Write your thousand-page thesis, generate hundreds of graphs from the data you've tabulated in dozens of spreadsheets, scan a dozen photos that you've used to corroborate your theories, then save it all in one file. But then maybe I'm a hopeless dreamer...)

(Many people have sent EMail mentioning IFF (Interchange File Format) from the long-ago Amiga days. I remembered it myself after I posted the original document online. I've looked back over some specs on IFF and it seems to mirror a lot of the ideas I was thinking about (Repressed memories? :) regarding well-defined CHUNKs containing various forms of data.)

The Disadvantages

I am very aware of the fact that this isn't a simple matter of "Oh, let's define some new standards" and next week, everyone is happy. It will be a lot of work to come up with standards that everyone is happy with and that are extensible enough so that application vendors can still add their own capabilities into the files and still have other vendor's applications able to read them.

Existing "standards" such as HTML already cause lots of controversy. ("But HTML is a mark-up language, not a page-display language!!", "I know, let's make it a page-display language!", "ARRGGHH!!", etc.) and a push like this would certainly cause more.

Can it even be done? I don't know. I think it would be very cool if it could get accomplished, but after seeing what has happened with splits like BSD License/GNU Public License/etc., KDE/GNOME, and others, even inside of the "Open Software" community, that maybe there is no way enough people can agree on something this sweeping for it ever to happen.

Conclusion

Obviously, there's no way I can examine every office application that's availble out there and find out every file format they are capable of reading, but it looks like there are a few at least who are trying to conform to existing standards like RTF and use things like XML to extend their functionality inside the application. This is a good start.

I'm sure stuff like this has been proposed, and even implemented before. Postscript is a pretty good language for describing just about anything - same with TeX - but obviously something was lacking in those formats or they would have caught on quicker and been more prevalent. HTML is a great way of distributing textual information, but with very little ability to control appearance. Maybe with all the positive press that Open Software has been getting, it's time to try again?

If this goal could be accomplished, the advantages are great on several fronts. It would not only be a massive techincal acheivement, but it would be an addition to the already impressive accomplishments of OSS: "See, not only can we produce good software and develop useful open protocols but we don't need to chase someone's taillights to innovate in other areas, either!"

It would be a way to raise the bar ourselves.

Author - Derek Glidden. Commentary and suggestions gratefully accepted. Flames will be gleefully ignored.

Footnotes

1. Re: commoditizing the OS: Since I wrote this document, a lot of people have used this particular phrase in the media, so now I guess I'll never figure out where I first heard it...

2. Many people have sent pointers to locations for finding more data on the RTF specs. I should clarify I've only referred to RTF as a fairly "standard" word processor file format, not that I entirely recommend it or anything. I know that it was designed and originally produced by Microsoft, which isn't a big plus, in my and other's opinions.