Choosing among jason, yaml, and toml.

To continue the discussion in previous post, we want a folder strucutre standard instead of HDF5 to store dataset temporarily for processing or permantantly for sharing. To enable the flexibility of such folder structure apporach, we only impose minimum requirements on such folder and leave the rest fine-definition to the meta-data file. So what is the best format for such meta data?

Basically, we want a hash talbe that establishes relationship between keyword and values that are meaningful to the user/audience. We want to user/human to be able to read and understand such file while not being too hash for the processor/computer to parse them to machine language. The three options that I consider are json, yaml, and, the new commer, toml.

This post for hugo community did a good job in comparing these three formats and recommended toml for the config standard. Coincidentally, Rust community decided to use toml as [their choice for config file}(https://users.rust-lang.org/t/why-does-cargo-use-toml/3577) (in Cargo) as well. Digging a little further to the standards of these formats, you will soon see that json being the more machine-friendly, toml being most human-friendly while yaml wanting to do both for which reason makes it the most complex standard among them. Since we only want this format to do very simple job as specify folder strucutre, simplicity is valued higher. Both json and toml is sufficient for the job. However, human-updating of json files is less enjoyable than that of toml. More importantly, the deal-breaker is the ability to comment in toml. Json is fast for machine, but we want to use to read and understand the folder strucutre directly by reading (cat) the meta file. Yaml can do both, yet, it is a over-kill for such simple job. The downside of toml is the fact that it is still young and keep changing every day. Yet, we have no intension to make a widely used and backward-compatible standard. Thus, flexibility wins over stability.

I will contiue discuss how the format will look like in the future post.


UPDATE 10-22-17

After actual both json and toml in python, I realized too things:

  1. toml requires installzation of an additional package, e.g. toml while json package in built-in with python distribution.

  2. Python’s json parser has a paramter called indent usig which makes json file much more readable.

Although I still think toml is the better choice between these two format. The ability to include an object instead of python standard types is attractive, the standard stable package of json is very tampting for portable code. Since both library share similar syntax, I will go with json for now and wait until the toml standard stablize.

More importantly, json is not that terrible in terms of readability. The indent parameter give the formating according to human eyes as follows:

If indent is a non-negative integer or string, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0, negative, or "" will only insert newlines. None (the default) selects the most compact representation. Using a positive integer indent indents that many spaces per level. If indent is a string (such as "\t"), that string is used to indent each level.

Here is an example:

 print(json.dumps({'a': 'test', 'b': [str(c) for c in list(range(20))]}))
{"a": "test", "b": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"]}

becomes

print(json.dumps({'a': 'test', 'b': [str(c) for c in list(range(20))]}, indent = 4))
{
    "a": "test",
    "b": [
        "0",
        "1",
        "2",
        "3",
        "4",
        "5",
        "6",
        "7",
        "8",
        "9",
        "10",
        "11",
        "12",
        "13",
        "14",
        "15",
        "16",
        "17",
        "18",
        "19"
    ]
}

The one without indent gives very compact output; however, if file size concerns you, other binary format may be better suited. Moreoever, an easy gzip will compress such text based file easily without lossing portability. This is the primary reason I did not choose a customized binary format in the first place.

As an comparison, toml file of the above dictionary looks like following:

print(toml.dumps({'a': 'test', 'b': list(range(20))}))
a = "test"
b = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,]

Moreover, it can store the range object directly

print(toml.dumps({'a': 'test', 'b': range(20)}))
a = "test"
b = range(0, 20)

toml is clearly more readable than json (and it is designed to do so); however, we will go with json for now and continue working the development of toml.


UPDATE 10-26-17

I came into problem when trying to store the following structure with json:

dic = {('a', 1): [(1, 'I'),
                  (2, 'II')],
       ('b', 2): [(3, 'III'),
                  (4, 'IV')],
    ]}

The problem is that json does not distinguish between list and tuple. When trying to load the json file dumped using this dictionary, Python will complain about mutable dictionary key since json dump ('a', 1) as ['a', 1] which changes a immutable object to mutable in the process.

I was initially trying to solve this problem; however, it did not take too long for me to realize that I am using json for the wrong task. Json is a text-based format to store mostly configuration files. If I want to store some strucutre data, a binary format will be better. In this case, I changed my json parser by pickle. This may lost the portabiliyt of the data, but I used this for temporary file that only my script/program uses. Thus, it is easier to use a native binary format to Python than any other portable format such as HDF5.

Related

comments powered by Disqus