Newer
Older
Import / applications / HighwayDash / ports / Design / SerializingPaper.txt


Automating Object Serialization/Deserialization in C++
Using C++ Template Meta-Programming to generator visitors for objects

by John Ryland (c) 2016
-------------


Background

Why is serialization/deserialization needed?

There are multiple uses for serialization. Most obvious is to save some state to disk that can be later restored, eg saving a document or file, and be able to open it again later. In this particular case, if the data will be loaded again on the same machine and with the same program, then endian and interoperability issues likely can be ignored. Other uses include interoperability, as well as transmission of objects over a network.

Generally pointers can't be serialized and deserialized (array indexes are okay if what they reference are also serialized/deserialized). Compound objects are usually serialized and deserialized by calling serializing and deserializing functions for the component parts, so generally you end up with a heirachy, so formats that can represent this are used, such as XML and JSON being two notable ones. The downside of these are that they are text based formats, so the data is inflated as text, and more time is spent parsing. There are binary forms of XML and JSON. Generally to serialize/deserialise in binary requires fixing the size of basic types, fixing number representations and endian order etc. Google's solution is protocol buffers, which Google provides implementations for for quite a number of languages, including C++.

In various languages, there is some degree of built in support for serializing and deserializing objects.
In python, it is called pickling (https://docs.python.org/3/library/pickle.html) and is well supported and is quite convinient to use, not to mention that converting python objects to json back and forth is also quite easy.
In C#, the language provides some convinience attributes and classes for serializing and deserializing objects to/from xml (see: https://msdn.microsoft.com/en-us/library/58a18dwa(v=vs.110).aspx).

    public class Address
    {
       // Xml attribute annotations part of language
       [XmlAttribute]
       public string Name;
       public string Line1;
    }

    // Code to serialize
    XmlSerializer serializer = 
    new XmlSerializer(typeof(PurchaseOrder));
    TextWriter writer = new StreamWriter(filename);
    PurchaseOrder po=new PurchaseOrder();
    ... populating the object ...
    serializer.Serialize(writer, po);
    writer.Close();

While python (and C#) are useful for particular tasks, there are times when C/C++ is required, and you may need to add serializing and deserializing in to an existing project already using C/C++. So how to make serializing and deserializing more convinient in C++? Certainly C++11 has made some great improvements, but it doesn't solve this yet. A survey of something like what we would like finds some solutions. One notable one that looks quite promising is this:

https://github.com/Loki-Astari/ThorsSerializer

How it looks like in practise:

    #include "ThorSerialize/Traits.h"
    #include "ThorSerialize/JsonThor.h"

    struct Color
    {
        int     red;
        int     green;
        int     blue;
    };
    // Annotation of fields to serialize
    ThorsAnvil_MakeTrait(Color, red, green, blue);


    // Code to serialize
    using ThorsAnvil::Serialize::jsonExport;
    Color color = {255,0,0};
    std::cout << jsonExport(color) << "\n";


That looks pretty close to ideal, wouldn't you think? No MACROs, just pure C++. Looks nice.
I'm pretty pleased with the syntax Loki-Astari has managed to create and the minimal amount of code needed to annotate classes. But there are a few things I don't quite like. When calling MakeTrait, the names of the variables are entered again. Also the Syntax feels a bit clunky with the type followed by members in the parameters to MakeTrait. The particular solution also ships with it's own json serializer/deserializer, coupling those together, although with some work I imagine it is possible to hook it up with rapidJson, which is my preferred json implementation in C++ at the moment. I think perhaps the API might be better if there could be a seperation of concerns, isolate the traversal of the objects from the more mundane matter of reading and writing files in a particular format which existing libraries may be more well tuned at.
Unfortunately it also appears that it requires C++14, and may not work with C++11.

So I think what interests me is the generation of the object traversal, rather than the detail of reading/writing involved in serialization. The GOF (https://en.wikipedia.org/wiki/Design_Patterns) would call this the visitor pattern. Serialization is just one specialization or use of this visitor pattern. Instead of for example, actually generating all that xml or json or what ever you plan to serialize, you could instead traverse the objects (in the same way) and generate a hash of the objects. I've made use of such a use of the visitor pattern in doing unit testing to check the state of objects against a known hash that the objects are expected to have, hence saving needing to dump the entire state and do a comparison of a large amount of data. And I've used it also in a client-server design which is server authorative and the state between the client and server can be compared to validate client actions based on the hash. The good thing about the visitor pattern is that it can be non-obtusive to an implementation so it does not impact performance.

Before we look at how to do a visitor pattern correctly, lets look at some other ways people attempt to do this and the impact it has.

One way, using MACROs (excuse my screaming) (BTW, can you tell I'm not a huge fan of MACROs despite having done a lot of 'clever' MACROs in my time).

say we have our color example again:

    struct Color
    {
        int     red;
        int     green;
        int     blue;
    };

This is nice POD (plain old data), which has nice properties.
With MACROs, commonly you see people do something like this:

In the header:

    DECLARE_CLASS(Color)
       DECLARE_MEMBER(int, red)
       DECLARE_MEMBER(int, green)
       DECLARE_MEMBER(int, blue)
    END_DECLARE_CLASS(Color)

And then in a CPP file:

    DEFINE_CLASS(Color)
       DEFINE_MEMBER(int, red)
       DEFINE_MEMBER(int, green)
       DEFINE_MEMBER(int, blue)
    END_DEFINE_CLASS(Color)

If you've done something like this, don't feel bad, this is pretty common.

Unfortunately, usually the DECLARE_CLASS is defined as something like this:

    #define DECLARE_CLASS(classname) \
        class classname : public SerializableBase {

The consequence is that depending on the size of SerializableBase, all objects will have grown, using more memory.

Hopefully you didn't do that. If you didn't give yourself a pat on the back. Perhaps you then did this instead to remove the inheritance:

    #define DECLARE_CLASS(classname) \
        class classname { \
           void serialize(Serializer& s);

    #define DECLARE_MEMBER(typ, nam) \
           typ nam;

    #define DEFINE_CLASS(classname) \
        void classname::serialize(Serializer& s) {

    #define DEFINE_MEMBER(typ, nam) \
           if (s.isWriter())        \
             v << nam;              \
           else                     \
             nam >> v;

Not bad. Your set of macros can handle serializing and deserializing. There is no inheritance, and a non-virtual member function which should still keep POD types as POD. Unfortunately this still misses the more intersting possibilities of traversing the objects for something other than serializing, such as hashing, or what ever algorithm that you wish or need to apply, rather than specifically serializing with a specific implementation.

An improvement is this:

    #define DECLARE_CLASS(classname) \
        class classname { \
           template <typename Visitor> \
           void visit(Visitor& v);

    #define DECLARE_MEMBER(typ, nam) \
        typ nam;

    #define DEFINE_CLASS(classname) \
        template <typename Visitor> \
        void classname::visit(Visitor& v) {

    #define DEFINE_MEMBER(typ, nam) \
            v.visit(nam);

Instead of that horrible branching inside the macro (yuck) for whether we are serializing or deserializing, it is instead controlled by which Visitor implementation we pass in.

The Visitor implementation needs to implement the visit function. This function can be templated so that it is specialized for basic types, and can then call the visit function of other compound types like ones the macros are creating.

So that is not bad. If you managed to do this give yourself a couple more pats on the back.

But there is still the duplication in declaring things in two sets of macros in both the header and the CPP file which introduces issues of maintainability and is error prone.

But really, what are we saving ourselves from writing with these macros anyway?

We could, if we don't mind duplication, simply write out explicity the expansion of the macros inside the header file like this example:


    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;

      // being a member function the members could be private and this will still work
      template <class V>
      void Visit(V& v)
      {
        v.Enter("person");
        v.Visit("id",id);
        v.Visit("name",name);
        v.Visit("email",email);
        v.Visit("number",phone);
        v.Exit("person");
      }
    };


Or if we don't like having the mebmer function, and want a nice more clear seperation between the struct and the visitor, one way is like this:


    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;
    };

    template <class V>
    void Visit(V& v, Person& p)
    {
      v.Enter("person");
      v.Visit("id",p.id);
      v.Visit("name",p.name);
      v.Visit("email",p.email);
      v.Visit("number",p.phone);
      v.Exit("person");
    }

Is that so bad? Everything is together in the one place. It avoids MACROs. It is quite idiomatic C++ code. I believe this will maintain PODness of structs that would originally be POD without the visit function (whether as a member function or not). This particular example shows how the members can be named and it doesn't need to match the member name in the class. The particular detail about calling enter() and exit() is to name the type and for dealing with arrays with particular serialization implemenations of the visitor.

Depdending on taste, the visit function could be declared elsewhere, but there is a greater chance that someone adds a new member and doesn't update the visit function if these are in different files. A static_assert of the sizeof the type in the visit function may help detect this.

Looking back at ThorsAnvil_MakeTrait, to compare, ThorsAnvil does look like a bit less typing, but requires C++14 and pulls in more code by comparison. The above formation however doesn't require anything exotic or including headers or pulling in large amounts of any outside code. The syntax also feels nice, and gives an opportunity to name the fields (As JSON, the data can be quite large, smaller field names can cut down the size of the JSON. It can also help with compatibility/interoperability with adapting to externally provided JSON).

If not happy enough with this, and don't mind MACROs, perhaps with a bit of MACRO magic we can just declare things once. I don't give any guarentees that this will be pretty or nice or non-exotic under the hood. It's going to get ugly inside the MACROs as it always seems to, but we might be able to avoid a bit of duplication and save a bit of typing.

So here we go:

So say this is what we desire we end up with when we declare a class, and to do this once, this would be the entire declaration and definition for these types:


    DECLARE_STRUCT(TestBaseStruct)
      DECLARE_MEMBER(int,    m_number,  9   /* default value */ )
      DECLARE_MEMBER(bool,   m_bool,    false)
    END_STRUCT()


    DECLARE_STRUCT(TestStruct)
      DECLARE_MEMBER(TestBaseStruct,  m_base)
      DECLARE_MEMBER(int32_t,         m_int1)   /* specifying a default is optional */
      DECLARE_MEMBER(int32_t,         m_int2,  100)
      DECLARE_MEMBER(float,           m_flt,   9.0)
    END_STRUCT()


So we need this to be able to generate our visitor function, so this is how I've come up with how it could be done:


    #define DECLARE_STRUCT(name) \
      struct name { \
        private: \
          typedef name _this_type; \
          static const char* _this_name() { return #name; } \
        public: \
          typedef struct { \
            template <class V> \
            static void Visit(_this_type* o, V& v) { \
            }

    #define DECLARE_MEMBER(type, name, ...) \
          } blah##name; \
          \
          type name = type(__VA_ARGS__); \
          \
          typedef struct { \
            template <class V> \
            static void Visit(_this_type* o, V& v) { \
              blah##name::Visit(o, v); \
              v.Visit(#name, o->name); \
            }

    #define END_STRUCT() \
          } last; \
          template <class V> \
          void Visit(V& v) \
          { \
            v.Enter(_this_name()); \
            last::Visit(this, v); \
            v.Exit(_this_name()); \
          } \
      };


It's not pretty. I believe this will work with C++11 and probably before C++11 also. I've just tested with g++, but possibly will work (hopefully without tweaks) with other compilers.

So in the end, we can end up with something reasonable close to what can be done in C#, but with added flexibility for options other than just serializing and deserializing XML.


    DECLARE_STRUCT(TestStruct)
      DECLARE_MEMBER(TestBaseStruct,  m_base)
      DECLARE_MEMBER(int32_t,         m_int1)   /* specifying a default is optional */
      DECLARE_MEMBER(int32_t,         m_int2,  100)
      DECLARE_MEMBER(float,           m_flt,   9.0)
    END_STRUCT()


    XMLSerializerVisitor serializer;
    TestStruct test;
    .. initialization of values ...
    serializer.Visit(test);
    printf("%s", serializer.Output().c_str());


    // Getting the hash of the objects
    MD5SumVisitor hasher;
    hasher.Visit(test);
    printf("hash: %s", hasher.hash().c_str());


    // Getting as json
    JsonVisitor jsonVisitor;
    jsonVisitor.Visit(test);
    printf("json: %s", jsonVisitor.value().toCompactString().c_str());


So is it worth it? I'll let you be the judge. I'm somewhat partial to the formation with the explictly provided visitor function declared outside of the type. Although just defining each member and its properties once in one place despite having to use MACROs does have it's advantages.

So how does it work?

Basically as you call DECLARE_MEMBER each time, it generates a static member function, and each time we create one of these it also calls the one from before. But how do we call the one from before? Well what I do is as a way to access the last member's function, I put that static member function inside a struct that gives a kind of namespace to it, and the last one is named with the currently being declared members name, that way I can call the previous one. Using a typedef of the struct allows the naming of that to happen after I've declared it which is how it allows it to be declared in the next one. The last one is named 'last', so then the visitor function calls this, which in-turn calls the other functions etc. Hope that makes sense. Not sure if there might be any simplifications that could shorten this formulation, but this way does appear to work.


Conclusion:


Instead of duplication, such as doing this:


In the header:

    DECLARE_CLASS(Color)
       DECLARE_MEMBER(int, red)
       DECLARE_MEMBER(int, green)
       DECLARE_MEMBER(int, blue)
    END_DECLARE_CLASS(Color)

And then in a CPP file duplicating similar/same information:

    DEFINE_CLASS(Color)
       DEFINE_MEMBER(int, red)
       DEFINE_MEMBER(int, green)
       DEFINE_MEMBER(int, blue)
    END_DEFINE_CLASS(Color)


Alternatively, this way has duplication too:

    struct Person
    {
      int32_t        id;
      std::string    name;
      std::string    email;
      uint64_t       phone;
    };

Then in the visitor, we have to name the members again:

    template <class V>
    void Visit(V& v, Person& p)
    {
      v.Enter("person");
      v.Visit("id",p.id);
      v.Visit("name",p.name);
      v.Visit("email",p.email);
      v.Visit("number",p.phone);
      v.Exit("person");
    }


We instead can just do this:


    DECLARE_CLASS(Color)
       DECLARE_MEMBER(int, red)
       DECLARE_MEMBER(int, green)
       DECLARE_MEMBER(int, blue)
    END_DECLARE_CLASS


And done. No duplicated info anywhere. Also we saw how this isn't locked in to just serializing/deserializing or doing so for a certain format such as JSON/XML. It is not tightly coupled to a serialization implementation. Other algorithms can be applied to the objects, such as hashing. It can preserves POD data as POD if refraining from defining initializer values from the members (and minor change to that MACRO to not do that, I think it is perhaps the only C++11 specific thing in the macros too) if this is important. This reminds me, incase you aren't using static_asserts, they are really useful, there is little reason to not use them as they are only an overhead when compiling and will have zero overhead to the size and speed of the generated code, but they allow catching errors, and catching them at the right time, at compile time, instead of at runtime. The type of error they can catch is not limited to what can be evaluated by the preprocessor. For example if you want to ensure a given type is POD and stays POD, one can statically assert this, so that if someone else came along and modified a struct to make it non-POD, the code would refuse to compile because of the static_assert. Nice isn't it? You can annotate your code with assertions about the kind of properties you want for a type and have it enforced. No need to let other people guess about what your intent is or inadvertantly break the performance of critical code/data. In the case of asserting something is POD, you would do it like this:


    static_assert(std::is_pod<Color>::value == true, "Color not pod");


If the compiler is pre-C++11 and doesn't support static_assert, it can be emulated with a macro, just google for 'static_assert macro' for one of many options.


In another article, I can elaberate on implementing visitors. They are pretty easy, but there are some template tricks that are needed. I have a simple XML serializer which is under 80 lines of code and a simple JSON one of similar size, but I'll save this for later.