Monday, June 18, 2007

I don't like XML! But what are the alternatives?

I think XML is one of the worst formats for data. It is extremely ambiguous. There are so many ways to put a simple data structure into XML. For example:
<book 
title="The Return of the King"
author="J.R.R. Tolkien"/>
or
<book>
<title>The Return of the King</title>
<author>J.R.R. Tolkien</author>
</book>
And in the second case, can there be more than one author? And more titles?

XML is not self describing. Look at the XML below. It is very ambiguous (if you don't use one of the many schema descriptions (DTD, XML-Schema, XMI, DSD, ...)).
<data
x="null"
y="true"
z="42">
<a>NULL<a/>
<a>false<a/>
<b>TRUE<b/>
</data>
What is a String? What is Boolean? What is a number? Is x a list? You simply can't infer it from the XML. When you want to write the data, which fields are written as tags which are written as attributes?

There is also a huge problem with IDs and references. From a plain XML file there's no way to figure out what an id of an object is and what references are. A good example are plugin.xml files. They are full of IDs and references but it is so hard to know which string refers which other XML element. Control-click does not work. Why? Because references are difficult to resolve!

Are there any good alternatives?

JSON is much simpler, self-describing, less verbose and much better suited for data storage and exchange. But it has no notion of IDs and references. And it does not name object (record) types (it has only lists, maps, strings, numbers and boolean).

What else is out there? I want something like JSON + an ID/Reference model + named Records...

19 comments:

  1. What do you think of YAML ?

    It is nearly of superset of JSON. I am sure that it has internal references. I don't know about external ones though.

    ReplyDelete
  2. maybe you should have a look at yaml: http://www.yaml.org/

    ReplyDelete
  3. JSON is as self-describing as XML is; it's up to you to come up with good names:

    {data:{x:"null";y:42;a:["null"]}

    Garbage in, garbage out. It doesn't matter what syntax you use for the garbage.

    ReplyDelete
  4. May be UBF will meet your requirements.

    Cheers!

    ReplyDelete
  5. Try YAML? Their "Anchor/Alias" work like Reference/ID. See http://www.yaml.org/start.html

    ReplyDelete
  6. S-expressions (what Lisp source is made of) used as data is more readable than XML or JSON, and easier to write parsers for.

    (body
    ..(h1 "This is a heading")
    ..(p "This is a "
    .....(font :color "red" "normal paragraph ")
    ....."with some markup"))

    You can add your own idea of what an ID is if you like.

    ReplyDelete
  7. Al,

    >{data:{x:"null";y:42;a:["null"]}

    > Garbage in, garbage out.
    > It doesn't matter what syntax you use for the garbage.

    It's not the names its the ambiguity! With JSON I know that a is a list and y is a number. With XML I don't know this.

    ReplyDelete
  8. UBF has two levels - A and B. A is just a transport mechanism. B includes support for contract checking. Also, UBF is easy to parse. You can find more about UBF here:

    http://www.sics.se/~joe/ubf/site/home.html

    Cheers!

    ReplyDelete
  9. BTW, you're so-called "XML" is not even well-formed.

    try again when you know something about what you trash...

    ReplyDelete
  10. Rebel1,

    you are right! I just fixed it :-D!

    Michael

    ReplyDelete
  11. I am just reading Terence Parr's book on ANTLR. He agrees with you: "(I implore everyone to please stop using XML as
    a human interface!)". His solution, create a domain specific language that is at least human readable and write a translator for it (using ANTLR, of course :). But, again, it won't be self describing. You'll still need a language spec. But at least you won't have to type in all the '<' | '>' s.

    ReplyDelete
  12. Terence Parr just posted to the ANTLR list about an example config parser he built for a talk he's giving in Sydney.

    This could be just what you are looking for

    Fig

    ReplyDelete
  13. Try regular JSON expressions:

    http://laurentszyster.be/jsonr/

    Regards,

    ReplyDelete
  14. > It's not the names its the ambiguity! With JSON
    > I know that a is a list and y is a number. With
    > XML I don't know this.

    Actually, attributes are always text, and child elements are a mixture of text and nested elements. So you do know something about the types from the format.

    In any case, just knowing whether something is a list or an integer doesn't tell you much about the data itself; you've just deciphered the container format. It says nothing about the data itself, which was my point.

    Then again, neither XML nor JSON ever claimed to support your concept of human readability. It's just a structural format. You can choose to organise a JSON data structure as a single string with some kind of separator, like a CSV file. It wouldn't be good, but it would still be JSON.

    {data:"1,2,3,4,5,6,7,a,b,2,x,6"}

    So, my advice; don't confuse format of structure with the choice of how to arrange that structure :-)

    ReplyDelete
  15. Al,
    > Then again, neither XML nor JSON
    > ever claimed to support your
    > concept of human readability.
    > It's just a structural format.

    I did not talk about human readability! I talk about machine readability. I talk about getting a file and converting it into a simple in memory representation of records with lists, strings, boolean aand number attributes. And then converting this data structure back into a file.

    With a schema you can read XML, but there are so many schema formats and a XML reader that can deal with all possible schemata gets *very* complex.

    I believe that the data should be self describing and you should be able to infer the schema from the data.

    The problem of references is not only a problem for humans, it's also a problem for programs.

    ReplyDelete
  16. True, but but when it comes to programs, few people should be trying to implement an XML parser themselves. The meat is to map the output of the parser into the objects you care about. The XML syntax itself and its handling is just gravy.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. Nitin,

    but mapping the XML to data structures is the problem! Parsing the XML itself can be done by a library. But there soooo many ways to map data into XML. In something like JSON or that mapping is very simple.

    Michael

    ReplyDelete
  19. > I want something like JSON + an ID/Reference model + named Records...

    Try Harpoon:
    http://harpoon.sourceforge.net

    It has named records (tagged), as well as lists and tuples. It doesn't have id/reference model, but you can make yours by some convention. I plan to add metadata in future, which will solve that issue.

    ReplyDelete