CowTalk

Saturday, August 18, 2012

Replacing standard JDK serialization using Jackson (JSON/Smile), java.io.Externalizable

1. Background

The default Java serialization provided by JDK is a two-edged sword: on one hand, it is a simple, convenient way to "freeze and thaw" Objects you have, handling about any kind of Java object graphs. It is possibly the most powerful serialization mechanism on Java platform, bar none.

But on the other hand, its shortcomings are well-document (and I hope, well-known) at this point. Problems include:

Poor space-efficiency (especially for small data), due to inclusion of all class metadata: that is, size of output can be huge, larger than about any alternative, including XML
Poor performance (especially for small data), partly due to size inefficiency
Brittleness: smallest changes to class definitions may break compatibility, preventing deserialization. This makes it a poor choice for both data exchange between (Java) systems as well as long-term storage

Still, the convenience factor has led to many systems using JDK serialization to be the default serialization method to use.

Is there anything we could do to address downsides listed above? Plenty, actually. Although there is no way to do much more for the default implementation (JDK serialization implementation is in fact ridiculously well optimized for what it tries to achieve -- it's just that the goal is very ambitious), one can customize what gets used by making objects implement java.io.Externalizable interface. If so, JDK will happily use alternate implementation under the hood.

Now: although writing custom serializers may be fun sometimes -- and for specific case, you can actually write very efficient solution as well, given enough time -- it would be nice if you could use an existing component to address listed short-comings.

And that's what we'll do! Here's one possible way to improve on all problems listed above:

Use an efficient Jackson serializer (to produce either JSON, or perhaps more interestingly, Smile binary data)
Wrap it in nice java.io.Externalizable, to make it transparent to code using JDK serialization (albeit not transparent for maintainers of the class -- but we will try minimizing amount of intrusive code)

2. Challenges with java.io.Externalizable

First things first: while conceptually simple, there are couple of rather odd design decisions that make use of java.io.Externalizable bit tricky:

Instead of passing instances of java.io.InputStream, java.io.OutputStream, instead java.io.ObjectOutput and java.io.ObjectInput are used; and they do NOT extend stream versions (even though they define mostly same methods!). This means additional wrapping is needed
Externalizable.readExternal() requires updating of the object itself, not that of constructing new instances: most serialization frameworks do not support such operation
How to access external serialization library, as no context is passed to either of methods?

These are not fundamental problems for Jackson: first one requires use of adapter classes (see below), second that we need to use "updating reader" approach that Jackson was supported for a while (yay!). And to solve the third part, we have at least two choices: use of ThreadLocal for passing an ObjectMapper; or, use of a static helper class (approach shown below)

So here are the helper classes we need:

final static class ExternalizableInput extends InputStream
{
  private final ObjectInput in;

  public ExternalizableInput(ObjectInput in) {
   this.in = in;
  }

  @Override
  public int available() throws IOException {
    return in.available();
  }

  @Override
  public void close() throws IOException {
    in.close();
  }

  @Override
  public boolean  markSupported() {
    return false;
  }

  @Override
  public int read() throws IOException {
   return in.read();
  }

  @Override
  public int read(byte[] buffer) throws IOException {
    return in.read(buffer);
  }

  @Override
  public int read(byte[] buffer, int offset, int len) throws IOException {
    return in.read(buffer, offset, len);
  }

  @Override
  public long skip(long n) throws IOException {
   return in.skip(n);
  }
}

final static class ExternalizableOutput extends OutputStream
{
  private final ObjectOutput out;

  public ExternalizableOutput(ObjectOutput out) {
   this.out = out;
  }

@Override
public void flush() throws IOException {
out.flush();
}

@Override
public void close() throws IOException {
out.close();
}

@Override
public void write(int ch) throws IOException {
out.write(ch);
}

@Override
public void write(byte[] data) throws IOException {
out.write(data);
}

@Override
public void write(byte[] data, int offset, int len) throws IOException {
out.write(data, offset, len);
}
}

/* Use of helper class here is unfortunate, but necessary; alternative would
 * be to use ThreadLocal, and set instance before calling serialization.
 * Benefit of that approach would be dynamic configuration; however, this
 * approach is easier to demonstrate.
 */
class MapperHolder {
  private final ObjectMapper mapper = new ObjectMapper();
  private final static MapperHolder instance = new MapperHolder();
  public static ObjectMapper mapper() { return instance.mapper; }
}

and given these classes, we can implement Jackson-for-default-serialization solution.

3. Let's Do a Serialization!

So with that, here's a class that is serializable using Jackson JSON serializer:

  static class MyPojo implements Externalizable
  {
        public int id;
        public String name;
        public int[] values;

        public MyPojo() { } // for deserialization
        public MyPojo(int id, String name, int[] values)
        {
            this.id = id;
            this.name = name;
            this.values = values;
        }

        public void readExternal(ObjectInput in) throws IOException {
            MapperHolder.mapper().readerForUpdating(this).readValue(new ExternalizableInput(in));
        }
        public void writeExternal(ObjectOutput oo) throws IOException {
            MapperHolder.mapper().writeValue(new ExternalizableOutput(oo), this);
        }
  }

to use that class, use JDK serialization normally:

  // serialize as bytes (to demonstrate):
  MyPojo input = new MyPojo(13, "Foobar", new int[] { 1, 2, 3 } );
  ByteArrayOutputStream bytes = new ByteArrayOutputStream();
  ObjectOutputStream obs = new ObjectOutputStream(bytes);
  obs.writeObject(input);
  obs.close();
  byte[] ser = bytes.toByteArray();

  // and to get it back:
  ObjectInputStream ins = new ObjectInputStream(new ByteArrayInputStream(ser));
  MyPojo output = (MyPojo) ins.readObject();
  ins.close();

And that's it.

4. So what's the benefit?

At this point, you may be wondering if and how this would actually help you. Since JDK serialization is using binary format; and since (allegedly!) textual formats are generally more verbose than binary formats, how could this possibly help with size of performance?

Turns out that if you test out code above and compare it with the case where class does NOT implement Externalizable, sizes are:

Default JDK serialization: 186 bytes
Serialization as embedded JSON: 130 bytes

Whoa! Quite unexpected result? JSON-based alternative 30% SMALLER than JDK serialization!

Actually, not really. The problem with JDK serialization is not the way data is stored, but rather the fact that in addition to (compact) data, much of Class definition metadata is included. This metadata is needed to guard against Class incompatibilities (which it can do pretty well), but it comes with a cost. And that cost is particularly high for small data.

Similarly, performance typically follows data size: while I don't have publishable results (I may do that for a future post), I expect embedded-JSON to also perform significantly better for single-object serialization use cases.

5. Further ideas: Smile!

But perhaps you think we should be able to do better, size-wise (and perhaps performance) than using JSON?

Absolutely. Since the results are not exactly readable (to use Externalizable, bit of binary data will be used to indicate class name, and little bit of stream metadata), we probably do not greatly care what the actual underlying format is.
With this, an obvious choice would be to use Smile data format, binary counterpart to JSON, a format that Jackson supports 100% with Smile Module.

The only change that is needed is to replace the first line from "MapperHolder" to read:

private final ObjectMapper mapper = new ObjectMapper(new SmileFactory());

and we will see even reduced size, as well as faster reading and writing -- Smile is typically 30-40% smaller in size, and 30-50% faster to process than JSON.

6. Even More compact? Consider Jackson 2.1, "POJO as array!"

But wait! In very near future, we may be able to do EVEN BETTER! Jackson 2.1 (see the Sneak Peek) will introduce one interesting feature that will further reduce size of JSON/Smile Object serialization. By using following annotation:

@JsonFormat(shape=JsonFormat.Shape.OBJECT)

you can further reduce the size: this occurs as the property names are excluded from serialization (think of output similar to CSV, just using JSON Arrays).

For our toy use case, size is reduced further from 130 bytes to 109; further reduction of almost 20%. But wait! It gets better -- same will be true for Smile as well, since while it can reduce space in general, it still has to retain some amount of name information normally; but with POJO-as-Arrays it will use same exclusion!

7. But how about actual real-life results?

At this point I am actually planning on doing something based on code I showed above. But planning is in early stages so I do not yet have results from "real data"; meaning objects of more realistic sizes. But I hope to get that soon: the use case is that of storing entities (data for which is read from DB) in memcache. Existing system is getting CPU-bound both from basic serialization/deserialization activity, but especially from higher number of GCs. I fully expect the new approach to help with this; and most importantly, be quite easy to deploy: this because I do not have to change any of code that actually serializes/deserializes Beans -- I just have to modify Beans themselves a bit.

Posted by Tatu Saloranta at Saturday, August 18, 2012 4:26 PM
Categories: Java, JSON, Performance
| Permalink |Comments |links to this post

Forcing escaping of HTML characters (less-than, ampersand) in JSON using Jackson

1. The problem

Jackson handles escaping of JSON String values in minimal way using escaping where absolutely necessary: it escapes two characters by default -- double quotes and backslash -- as well as non-visible control characters. But it does not escape other characters, since this is not required for producing valid JSON documents.

There are systems, however, that may run into problems with some characters that are valid in JSON documents. There are also use cases where you might prefer to add more escaping. For example, if you are to enclose a JSON fragment in XML attribute (or Javascript code), you might want to use apostrophe (') as quote character in XML, and force escaping of all apostrophes in JSON content; this allows you to simple embed encoded JSON value without other transformations.

Another specific use case is that of escaping "HTML funny characters", like less-than, greater-than, ampersand and apostrophe characters (double-quote are escaped by default).

Let's see how you can do that with Jackson.

2. Not as easy to change as you might think

Your first thought may be that of "I'll just do it myself". The problem is two-fold:

When using API via data-binding, or regular Streaming generator, you must pass unescaped String, and it will get escaped using Jackson's escaping mechanism -- you can not pre-process it (*)
If you decide to post-process content after JSON gets written, you need to be careful with replacements, and this will have negative impact on performance (i.e. it is likely to double time serialization takes)

(*) actually, there is method 'JsonGenerator.writeRaw(...)' which you can use to force exact details, but its use is cumbersome and you can easily break things if you are not careful. Plus it is only applicable via Streaming API

3. Jackson (1.8) has you covered

Luckily, there is no need for you to write custom post-processing code to change details of content escaping.

Version 1.8 of Jackson added a feature to let users customize details of escaping of characters in JSON String values.
This is done by defining a CharacterEscapes object to be used by JsonGenerator; it is registered on JsonFactory. If you use data-binding, you can set this by using ObjectMapper.getJsonFactory() first, then define CharacterEscapes to use.

Functionality is handled at low-level, during writing of JSON String values; and CharacterEscapes abstract class is designed in a way to minimize performance overhead.
While there is some performance overhead (little bit of additional processing is required), it should not have significant impact unless significant portion of content requires escaping.
As usual, if you care a lot about performance, you may want to measure impact of the change with test data.

4. The Code

Here is a way to force escaping of HTML "funny characters", using functionality Jackson 1.8 (and above) have.

import org.codehaus.jackson.SerializableString;
import org.codehaus.jackson.io.CharacterEscapes;

// First, definition of what to escape
public class HTMLCharacterEscapes extends CharacterEscapes
{
    private final int[] asciiEscapes;
    
    public HTMLCharacterEscapes()
    {
        // start with set of characters known to require escaping (double-quote, backslash etc)
        int[] esc = CharacterEscapes.standardAsciiEscapesForJSON();
        // and force escaping of a few others:
        esc['<'] = CharacterEscapes.ESCAPE_STANDARD;
        esc['>'] = CharacterEscapes.ESCAPE_STANDARD;
        esc['&'] = CharacterEscapes.ESCAPE_STANDARD;
        esc['\''] = CharacterEscapes.ESCAPE_STANDARD;
        asciiEscapes = esc;
    }
    // this method gets called for character codes 0 - 127
    @Override public int[] getEscapeCodesForAscii() {
        return asciiEscapes;
    }
    // and this for others; we don't need anything special here
    @Override public SerializableString getEscapeSequence(int ch) {
        // no further escaping (beyond ASCII chars) needed:
        return null;
    }
}

// and then an example of how to apply it
public ObjectMapper getEscapingMapper() {
    ObjectMapper mapper = new ObjectMapper();
    mapper.getJsonFactory().setCharacterEscapes(new HTMLCharacterEscapes());
    return mapper;
}

// so we could do:
public byte[] serializeWithEscapes(Object ob) throws IOException
{
    return getEscapingMapper().writeValueAsBytes(ob);
}

And that's it.

Posted by Tatu Saloranta at Saturday, August 18, 2012 3:14 PM
Categories: JSON
| Permalink |Comments |links to this post

Thursday, May 24, 2012

Doing actual non-blocking, incremental HTTP access with async-http-client

Async-http-client library, originally developed at Ning (by Jean-Francois, Tom, Brian and maybe others and since then by quite a few others) has been around for a while now.
Its main selling point is the claim for better scalability compared to alternatives like Jakarta HTTP Client (this is not the only selling points: its API also seems more intuitive).

But although library itself is capable of working well in non-blocking mode, most examples (and probably most users) use it in plain old blocking mode; or at most use Future to simply defer handling of respoonses, but without handling content incrementally when it becomes available.

While this lack of documentation is bit unfortunate just in itself, the bigger problem is that most usage as done by sample code requires reading the whole response in memory.
This may not be a big deal for small responses, but in cases where response size is in megabytes, this often becomes problematic.

1. Blocking, fully in-memory usage

The usual (and potentially problematic) usage pattern is something like:

  AsyncHttpClient asyncHttpClient = new AsyncHttpClient();
  Future<Response> f = asyncHttpClient.prepareGet("www.ning.com/ ").execute();
  Response r = f.get();
  byte[] contents = r.getResponseBodyAsBytes();

which gets the whole response as a byte array; no surprises there.

2. Use InputStream to avoid buffering the whole entity?

The first obvious work around attempt is to have a look at Response object, and notice that there is method "getResponseBodyAsStream()". This would seemingly allow one to read response, piece by piece, and process it incrementally, by (for example) writing it to a file.

Unfortunately, this method is just a facade, implemented like so:

 public InputStream getResponseBodyAsStream() {
   return new ByteArrayInputStream(getResponseBodyAsBytes());
 }

which actually is no more efficient than accessing the whole content as a byte array. :-/

(why is it implemented that way? Mostly because underlying non-blocking I/O library, like Netty or Grizzly, provides content using "push" style interface, which makes it very hard to support "pull" style abstractions like java.io.InputStream -- so it is not really AHC's fault, but rather a consequence of NIO/async style of I/O processing)

3. Go fully async

So what can we do to actually process large response payloads (or large PUT/POST request payloads, for that matter)?

To do that, it is necessary to use following callback abstractions:

To handle response payloads (for HTTP GETs), we need to implement AsyncCompletionHandler interface.
To handle PUT/POST request payloads, we need to implement BodyGenerator (which is used for creating a Body instance, abstraction for feeding content)

Let's have a look at what is needed for the first case.

(note: there are existing default implementations for some of the pieces -- but here I will show how to do it from ground up)

4. A simple download-a-file example

Let's start with a simple case of downloading a large file into a file, without keeping more than a small chunk in memory at any given time. This can be done as follows:

public class SimpleFileHandler implements AsyncHandler<File>
{
 private File file;
 private final FileOutputStream out;
 private boolean failed = false;

 public SimpleFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }
}

Voila. Code is not very brief (event-based code seldom is), and it could use some more handling for error cases.
But it should at least show the general processing flow -- nothing very complicated there, beyond basic state machine style operation.

5. Booooring. Anything more complicated?

Downloading a large file is something useful, but while not a contriver example, it is rather plain. So let's consider the case where we not only want to download a piece of content, but also want uncompress it, in one fell swoop. This serves as an example of additional processing we may want to do, in incremental/streaming fashion -- as an alternative to having to store an intermediate copy in a file, then uncompress to another file.

But before showing the code, however, it is necessary to explain why this is bit tricky.

First, remember that we can't really use InputStream-based processing here: all content we get is "pushed" to use (without our code ever blocking with input); whereas InputStream would want to push content itself, possibly blocking the thread.

Second: most decompressors present either InputStream-based abstraction, or uncompress-the-whole-thing interface: neither works for us, since we are getting incremental chunks; so to use either, we would first have to buffer the whole content. Which is what we are trying to avoid.

As luck would have it, however, Ning Compress package (version 0.9.4, specifically) just happens to have a push-style uncompressor interface (aptly named as "com.ning.compress.Uncompressor"); and two implementations:

com.ning.compress.lzf.LZFUncompressor
com.ning.compress.gzip.GZIPUncompressor (uses JDK native zlib under the hood)

So why is that fortunate? Because interface they expose is push style:

 public abstract class Uncompressor
 {
  public abstract void feedCompressedData(byte[] comp, int offset, int len) throws IOException;
  public abstract void complete() throws IOException;
 }

and is thereby usable to our needs here. Especially when we use additional class called "UncompressorOutputStream", which makes an OutputStream out of Uncompressor and target stream (which is needed for efficient access to content AHC exposes via HttpResponseBodyPart)

6. Show me the code

Here goes:

public class UncompressingFileHandler implements AsyncHandler<File>,
   DataHandler // for Uncompressor
{
 private File file;
 private final OutputStream out;
 private boolean failed = false;
 private final UncompressorOutputStream uncompressingStream;

 public UncompressingFileHandler(File f) throws IOException {
  file = f;
  out = new FileOutputStream(f);
 }

 public com.ning.http.client.AsyncHandler.STATE onBodyPartReceived(HttpResponseBodyPart part)
   throws IOException
 {
  if (!failed) {
   // if compressed, pass through uncompressing stream
   if (uncompressingStream != null) {
    part.writeTo(uncompressingStream);
   } else { // otherwise write directly
    part.writeTo(out);
   }
   part.writeTo(out);
  }
  return STATE.CONTINUE;
 }

 public File onCompleted() throws IOException {
  out.close();
  if (uncompressingStream != null) {
   uncompressingStream.close();
  }
  if (failed) {
   file.delete();
   return null;
  }
  return file;
 }

 public com.ning.http.client.AsyncHandler.STATE onHeadersReceived(HttpResponseHeaders h) {
  // must verify that we are getting compressed stuff here:
  String compression = h.getHeaders().getFirstValue("Content-Encoding");
  if (compression != null) {
   if ("lzf".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new LZFUncompressor(this));
   } else if ("gzip".equals(compression)) {
    uncompressingStream = new UncompressorOutputStream(new GZIPUncompressor(this));
   }
  }
  // nothing to check here as of yet
  return STATE.CONTINUE;
 }

 public com.ning.http.client.AsyncHandler.STATE onStatusReceived(HttpResponseStatus status) {
  failed = (status.getStatusCode() != 200);
  return failed ?  STATE.ABORT : STATE.CONTINUE;
 }

 public void onThrowable(Throwable t) {
  failed = true;
 }

 // DataHandler implementation for Uncompressor; called with uncompressed content:
 public void handleData(byte[] buffer, int offset, int len) throws IOException {
  out.write(buffer, offset, len);
 }
}

Handling gets bit more complicated here, since we have to handle both case where content is compressed; and case where it is not (since server is ultimately responsible for applying compression or not).

And to make call, you also need to indicate capability to accept compressed data. For example, we could define a helper method like:

public File download(String url) throws Exception
{
 AsyncHttpClient ahc = new AsyncHttpClient();
 Request req = ahc.prepareGet(url)
  .addHeader("Accept-Encoding", "lzf,gzip")
  .build();
 ListenableFuture<File> futurama = ahc.executeRequest(req,
   new UncompressingFileHandler(new File("download.txt")));

 try { // wait for 30 seconds to complete
  return futurama.get(30, TimeUnit.MILLISECONDS);
 } catch (TimeoutException e) {
  throw new IOException("Failed to download due to timeout");
 }
}

which would use handler defined above.

7. Easy enough?

I hope above shows that while doing incremental, "streaming" processing is bit more work, it is not super difficult to do.

Not even when you have bit of pipelining to do, like uncompressing (or compressing) data on the fly.

Posted by Tatu Saloranta at Thursday, May 24, 2012 5:26 PM
Categories: Java, Open Source, Performance
| Permalink |Comments |links to this post

Thursday, May 03, 2012

Jackson Data-binding: Did I mention it can do YAML as well?

Note: as useful earlier articles, consider reading "Jackson 2.0: CSV-compatible as well" and "Jackson 2.0: now with XML, too!"

1. Inspiration

Before jumping into the actual beef -- the new module -- I want to mention my inspiration for this extension: the Greatest New Thing to hit Java World Since JAX-RS called DropWizard.

For those who have not yet tried it out and are unaware of its Kung-Fu Panda like Awesomeness, please go and check it out. You won't be disappointed.

DropWizard is a sort of mini-framework that combines great Java libraries (I may be biased, as it does use Jackson), starting with trusty JAX-RS/Jetty8 combination, building with Jackson for JSON, jDBI for DB/JDBC/SQL, Java Validation API (impl from Hibernate project) for data validation, and logback for logging; adding bit of Jersey-client for client-building and optional FreeMarker plug-in for UI, all bundled up in a nice, modular and easily understandable packet.
Most importantly, it "Just Works" and comes with intuitive configuration and bootstrapping system. It also builds easily into a single deployable jar file that contains all the code you need, with just a bit of Maven setup; all of which is well documented. Oh, and the documentation is very accessible, accurate and up-to-date. All in all, a very rare combination of things -- and something that would give RoR and other "easier than Java" frameworks good run for their money, if hipsters ever decided to check out the best that Java has to offer.

The most relevant part here is the configuration system. Configuration can use either basic JSON or full YAML. And as I mentioned earlier, I am beginning to appreciate YAML for configuring things.

1.1. The Specific inspirational nugget: YAML converter

The way DropWizard uses YAML is to parse it using SnakeYAML library, then convert resulting document into JSON tree and then using Jackson for data binding. This is useful since it allows one to use full power of Jackson configuration including annotations and polymorphic type handling.

But this got me thinking -- given that the whole converter implementation about dozen lines or so (to work to degree needed for configs), wouldn't it make sense to add "full support" for YAML into Jackson family of plug-ins?

I thought it would.

2. And Then There Was One More Backend for Jackson

Turns out that implementation was, indeed, quite easy. I was able to improve certain things -- for example, module can use lower level API to keep performance bit better; and output side also works, not just reader -- but in a way, there isn't all that much to do since all module has to do is to convert YAML events into JSON events, and maybe help with some conversions.

Some of more advanced things include:

Format auto-detection works, thanks to "---" document prefix (that generator also produces by default)
Although YAML itself exposes all scalars as text (unless type hints are enabled, which adds more noise in content), module uses heuristics to make parser implementation bit more natural; so although data-binding can also coerce types, this should usually not be needed
Configuration includes settings to change output style, to allow use of more aesthetically pleasing output (for those who prefer "wiki look", for example)

At this point, functionality has been tested with a broad if shallow set of unit tests; but because data-binding used is 100% same as with JSON, testing is actually sufficient to use module for some work.

3. Usage? So boring I tell you

Oh. And you might be interested in knowing how to use the module. This is the boring part, since.... there isn't really much to it.

You just use "YAMLFactory" wherever you would normally use "JsonFactory"; and then under the hood you get "YAMLParser" and "YAMLGenerator" instances, instead of JSON equivalents. And then you either use parser/generator directly, or, more commonly, construct an "ObjectMapper" with "YAMLFactory" like so (code snippet itself is from test "SimpleParseTest.java")

  ObjectMapper mapper = new ObjectMapper(new YAMLFactory());
  User user = mapper.readValue("firstName: Billy\n"
    +"lastName: Baggins\n"
    +"gender: MALE\n"
    +"userImage: AQIDBAY=",
   User.class);

and to get the functionality itself, Maven dependency is:

<dependency>
  <groupId>com.fasterxml.jackson.dataformat</groupId>
  <artifactId>jackson-dataformat-yaml</artifactId>
  <version>2.0.0</version>
</dependency>

4. That's all Folks -- until you give us some Feedback!

That's it for now. I hope some of you will try out this new backend, and help us further make Jackson 2.0 the "Universal Java Data Processor"

Posted by Tatu Saloranta at Thursday, May 03, 2012 10:12 PM
Categories: Java, JSON, Open Source
| Permalink |Comments |links to this post

Tuesday, April 10, 2012

What me like YAML? (Confessions of a JSON advocate)

Ok. I have to admit that I learnt something new and gained bit more respect for YAML data format recently, when working on the proof-of-concept for YAML-on-Jackson (jackson-dataformat-yaml; more on this on yet another Jackson 2.0 article, soon).
And since it would be intellectually dishonest not to mention that my formerly negative view on YAML has brightened up a notch, here's my write-up on this bit of enlightenment.

1. Bad First Impressions Stick

My first look at YAML via its definition basically made my stomach turn. It just looked so much like a bad American Ice Cream: "Too Much of Everything" -- hey, if it isn't enough to have chocolate, banana and walnut, let's throw in bit of caramel, root beer essence and touch of balsamic vinegar; along with bit of organic arugula to spice things up!". That isn't the official motto, I thought, but might as well be. If there is an O'Reilly book on YAML it surely must have platypus as the cover animal.

That was my thinking up until few weeks ago.

2. Tale of the Two Goals

I have read most of YAML specification (which is not badly written at all) multiple times, as well as shorter descriptions. My overall conclusion has always been that there are multiple high-level design decisions that I disagree with, and that these can mostly be summarized that it tries to do too many things, tries to solve multiple conflicting use cases.

But recently when working on adding YAML support as Jackson module (based on nice SnakeYAML library, solid piece of code, very unlike most parsers/generators I have seen), I realized that fundamentally there are just two conflicting goals:

Define a Wiki-style markup for data (assuming it is easier to not only write prose in, but also data)
Create a straight-forward Object serialization data format

(it is worth noting that these goals are orthogonal, functionality-wise; but they conflict at level of syntax, visual appearance and complicate handling significantly, mostly because there is always "more than one way to do it" (Perl motto!))

I still think that one could solve the problem better by defining two, not one, format: first one with a Wiki dialect; and second one with a clean data format.
But this lead me to think about something: what if those weird Wiki-style aspects were removed from YAML? Would I still dislike the format?

And I came to conclusion that no, I would not dislike it. In fact, I might like it. A lot.

Why? Let's see which things I like in YAML; things that JSON does not have, but really really should have in the ideal world.

3. Things that YAML has and JSON should have

Here's the quick rundown:

Comments: oh lord, what kind of textual data format does NOT have comments? JSON is the only one I know of; and even it had them before spec was finalized. I can only imagine a brain fart of colossal proportions caused it to be removed from the spec...
(optional) Document start and end markers ("---" header, "..." footer"). This is such a nice thing to have; both for format auto-detection purpose as well as for framing for data feeds. It's bit of a no-brainer; but suspiciously, JSON has nothing of sort (XML does have XML declaration which _almost_ works well, but not quite; but I digress)
Type tags for type metadata: in YAML, one can add optional type tags, to further indicate type of an Object (or any value actually). This is such an essential thing to have; and with JSON one must use in-band constructs that can conflict with data. XML at least has attributes ("xsi:type").
Aliases/anchors for Object Identity (aka "id / idref"): although data is data, not objects with identity, having means to optionally pass identity information is very, very useful. And here too XML has some support (having attributes for metadata is convenient); and JSON has nada.

The common theme with above is that all extra information is optional; but if used, it is included discreetly and can be used as appropriate by encoders, decoders, with or without using language- or platform-specific resolution mechanisms.
And I think YAML actually declares these things pretty well: it is neither over nor under engineered with respect to these features. This is surprisingly delicate balance, and very well chosen. I have seen over-complicated data formats (at Amazon, for example) that didn't know where to stop; and we can see how JSON stopped too short of even most rudimentary things (... comments). Interestingly, XML almost sort-of has these features; but they come about with extra constructs (xsi:type via XML Schema), or are side effects of otherwise quirky features (element/attribute separation).

Having had to implement equivalent functionality on top of simplistic JSON construct ("add yet another meta-property, in-line with actual data; allow a way to configure it to reduce conflicts"), I envy having these constructs as first-level concepts, convenient little additions that allow proper separation of data and metadata (type, object id; comments).

4. Uses for YAML

Still, having solved/worked around all of above problems -- Jackson 1.5 added full support for polymorphic types ("type tags"); 2.0 finally added Object Identity ("alias/anchor"), use of linefeeds for framing can substitute for document boundaries -- I do not have compelling case for using YAML for data transfer. It's almost a pity -- I have come to realize that YAML could have been a great data format (it is also old enough to have challenged popularity of JSON, both seem to have been conceived at about same time). As is, it is almost one.

Somewhat ironically, then, is that maybe Wiki features are acceptable for the other main use case: that of configuration files. This is the use case I have for YAML; and the main reason for writing compatibility module (inspired by libs/frameworks like DropWizard which use YAML as the main config file format).

Posted by Tatu Saloranta at Tuesday, April 10, 2012 9:52 PM
Categories: JSON
| Permalink |Comments |links to this post