Serialization and deserialization are typical operations that the modern developer treats as trivial. We communicate with databases, generate HTTP requests, receive data through the REST API, and often do not even think about how it works. Today I suggest writing my serializer and deserializer for JSON to find out what's under the hood.
Denial of responsibility
Like
last time , I will notice: we will write a primitive serializer, one might say, a bicycle. If you need a turnkey solution, use
Json.NET . These guys released a wonderful product that is highly customizable, can do a lot and is
already solving problems that arise when working with JSON. Using your own solution is really cool, but only if you need maximum performance, special customization, or you like bikes the way I like them.
Subject area
The conversion service from JSON to object representation consists of at least two subsystems. Deserializer is a subsystem that turns valid
JSON (text) into an object representation inside our program. Deserialization involves tokenization, that is, parsing JSON into logical elements. Serializer is a subsystem that performs the inverse task: turns the object representation of data into JSON.
The consumer most often sees the following interface. I specifically simplified it to highlight the main methods that are most often used.
public interface IJsonConverter { T Deserialize<T>(string json); string Serialize(object source); }
“Under the hood,” deserialization includes tokenization (parsing a JSON text) and building some primitives that make it easier to create an object representation later on. For training purposes, we will skip the construction of intermediate primitives (for example, JObject, JProperty from Json.NET) and we will immediately write data to the object. This is a minus, as it reduces the options for customization, but it is impossible to create a whole library within the framework of one article.
Tokenization
Let me remind you that the process of tokenization or
lexical analysis is a parsing of the text with the aim of obtaining a different, more rigorous representation of the data contained in it. Typically, this representation is called
tokens or tokens. For the purposes of parsing JSON, we must highlight the properties, their values, the symbols of the beginning and end of structures - that is, tokens that can be represented as JsonToken in the code.
JsonToken is a structure that contains a value (text), as well as a type of token. JSON is a strict notation, so all types of tokens can be reduced to the
next enum . Of course, it would be great to add to the token its coordinates in the incoming data (row and column), but debugging is beyond the scope of the implementation, which means that JsonToken does not contain this data.
So, the easiest way to parse text into tokens is to sequentially read each character and compare it with patterns. We need to understand what this or that symbol means. It is possible that the keyword (true, false, null) begins with this character, it is possible that this is the beginning of the line (quotation mark), and perhaps this character itself is a token ([,], {,}). The general idea looks like this:
var tokens = new List<JsonToken>(); for (int i = 0; i < json.Length; i++) { char ch = json[i]; switch (ch) { case '[': tokens.Add(new JsonToken(JsonTokenType.ArrayStart)); break; case ']': tokens.Add(new JsonToken(JsonTokenType.ArrayEnd)); break; case '"': string stringValue = ReadString(); tokens.Add(new JsonToken(JsonTokenType.String, stringValue); break; ... } }
Looking at the code, it seems that you can read and immediately do something with the read data. They do not need to be stored, they must be immediately sent to the consumer. Thus, a certain IEnumerator begs, which will parse the text into pieces. Firstly, this will reduce allocation, since we do not need to store intermediate results (an array of tokens). Secondly, we will increase the speed of work - yes, in our example the input is a string, but in a real situation, it will be
replaced by Stream (from a file or a network), which we sequentially read.
I have prepared the
JsonTokenizer code, which can be
found here . The idea is the same - the tokenizer sequentially goes along the line, trying to determine what the symbol or their sequence refers to. If we can understand, then we create a token and transfer control to the consumer. If it’s not yet clear, read on.
Preparing to Deserialize Objects
Most often, a request to convert data from JSON is a call to the Deserialize generic method, where
TOut is the data type with which JSON tokens should be mapped. And where there is
Type : it's time to apply
Reflection and
ExpressionTrees . The basics of working with ExpressionTrees, as well as why compiled expressions are better than “bare” Reflection, I described in a previous article about
how to make your AutoMapper . If you do not know anything about Expression.Labmda.Compile () - I recommend reading it. It seems to me, with the example of the mapper, it turned out quite understandably.
So, the plan for creating an object deserializer is based on the knowledge that we can get property types from the TOut type at any time, that is, the
PropertyInfo collection. At the same time, property types are limited by JSON notation: numbers, strings, arrays and objects. Even if we do not forget about null, it is not as much as it might seem at first glance. And if for each primitive type we will be forced to create a separate deserializer, then for arrays and objects we can make generic classes. If you think a little, all serializers-deserializers (or
converters ) can be reduced to the following interface:
public interface IJsonConverter<T> { T Deserialize(JsonTokenizer tokenizer); void Serialize(T value, StringBuilder builder); }
The code of a strongly typed converter of primitive types is as simple as possible: we extract the current JsonToken from the tokenizer and turn it into a value by parsing. For example, float.Parse (currentToken.Value). Take a look at
BoolConverter or
FloatConverter - nothing complicated. Next, if you need a deserializer for bool? or float ?, it can also be added.
Array deserialization
The generic class code for converting an array from JSON is also relatively simple. It is parameterized by the type of element we can extract
Type.GetElementType () . Determining that a type is an array is also simple:
Type.IsArray . Array deserialization comes down to saying tokenizer.MoveNext () until an ArrayEnd type token is reached. The deserialization of array elements is the deserialization of the array element type, therefore, when creating an ArrayConverter, the element deserializer is passed to it.
Sometimes there are difficulties with instantiation of generic implementations, so I will immediately tell you how to do it. Reflection allows you to create generic types in realtime, which means we can use the created type as an argument to Activator.CreateInstance. Take advantage of this:
Type elementType = arrayType.GetElementType(); Type converterType = typeof(ArrayConverter<>).MakeGenericType(elementType); var converterInstance = Activator.CreateInstance(converterType, object[] args);
Finishing preparations for creating a deserializer of objects, you can put all the infrastructure code associated with the creation and storage of deserializers in the facade of
JConverter . He will be responsible for all JSON serialization and deserialization operations and is available to consumers as a service.
Object deserialization
Let me remind you that you can get all properties of type T like this: typeof (T) .GetProperties (). For each property, you can extract
PropertyInfo.PropertyType , which will give us the opportunity to create a typed IJsonConverter for serializing and deserializing data of a particular type. If the type of the property is an array, then we instantiate the ArrayConverter or find a suitable one among the existing ones. If the property type is a primitive type, then deserializers (converters) are already created for them in the JConverter constructor.
The resulting code can be viewed in the generic class
ObjectConverter . An activator is created in its constructor, properties are extracted from a specially prepared dictionary, and for each of them a deserialization method is created - Action <TObject, JsonTokenizer>. It is needed, firstly, in order to immediately associate the IJsonConverter with the desired property, and secondly, in order to avoid boxing when extracting and writing primitive types. Each method of deserialization knows what property of the outgoing object will be recorded, the deserializer of the value is strictly typed and returns the value exactly in the form in which it is needed.
The binding of an IJsonConverter to a property is as follows:
Type converterType = propertyValueConverter.GetType(); ConstantExpression Expression.Constant(propertyValueConverter, converterType); MethodInfo deserializeMethod = converterType.GetMethod("Deserialize"); var value = Expression.Call(converter, deserializeMethod, tokenizer);
The
Expression.Constant constant is created directly in the expression, which stores a reference to the deserializer instance for the property value. This is not exactly the constant that we write in “regular C #”, since it can store a reference type. Next, the Deserialize method is retrieved from the deserializer type, which returns the value of the desired type, and then it is called -
Expression.Call . Thus, we get a method that knows exactly where and what to write. It remains to put it in the dictionary and call it when a token of the Property type with the desired name “comes” from the tokenizer. Another plus is that it all works very quickly.
How fast
Bicycles, as was noted at the very beginning, it makes sense to write in several cases: if this is an attempt to understand how the technology works, or you need to achieve some special results. For example, speed. You can make sure that the deserializer is really deserializing
with prepared tests (I use
AutoFixture to get test data). By the way, you probably noticed that I also wrote serialization of objects. But since the article turned out to be quite large, I will not describe it, but just give benchmarks. Yes, just like with the previous article, I wrote benchmarks using the
BenchmarkDotNet library.
Of course, I
compared the deserialization speed with Newtonsoft (Json.NET), as the most common and recommended solution for working with JSON. Moreover, right on their website it is written: 50% faster than DataContractJsonSerializer, and 250% faster than JavaScriptSerializer. In short, I wanted to know how much my code would lose. The results surprised me: note that data allocation is almost three times less, and the deserialization rate is about two times faster.
Comparison of speed and allocation
during data serialization yielded even more interesting results. It turns out that the bike serializer allocated almost five times less and worked almost three times faster. If speed really bothered me (really much), that would be a clear success.
Yes, when measuring speed, I did not use the
tips for increasing productivity that are posted on the Json.NET website. I took measurements out of the box, that is, according to the most commonly used scenario: JsonConvert.DeserializeObject. There may be other ways to improve performance, but I do not know about them.
findings
Despite the relatively high speed of serialization and deserialization, I would not recommend abandoning Json.NET in favor of my own solution. The gain in speed is calculated in milliseconds, and they easily "drown" in network delays, disk or code, which is hierarchically located above the place where serialization is applied. Supporting such proprietary solutions is hell, where only developers who are well versed in the subject can be allowed.
The scope of such bicycles is applications that are fully designed with a view to high performance, or pet projects where you understand how this or that technology works. I hope I have helped you a bit in all of this.