Tuesday 19 April 2016

SynchronizationContext

As promised in a previous post, it's time to talk about the SynchronizationContext, something that at first I found a bit confusing. Indeed it's not that complex, but for sure having some notes to come back to when needed will come handy.

The main point is, in what thread will the "continuation code" (the one after await or passed by to ContinueWith) run?. Sometimes we don't care, but in other occasions we need it to run in a specific thread (basically the UI thread).

This article is essential to understand the whole thing. If you are using await, the compiler already takes care of running the continuation in the correct Thread (if necessary). It does so by capturing the current context (when the Task is created, so it captures the context for the calling thread) and then when the continuation must run, if this captured context is not null the continuation code will be dispatched to it. Think of the SynchronizationContext as a sort of Task Scheduler. If you are not using await, but directly ContinueWith, you don't get this for free, you'll have to pass it in one of the ContinueWith overloads.

As explained in that article, you can think of the following code:

await FooAsync();
RestOfMethod();

as being similar in nature to this:

var t = FooAsync();

var currentContext = SynchronizationContext.Current;

t.ContinueWith(delegate

{

    if (currentContext == null)

        RestOfMethod();

    else

        currentContext.Post(delegate { RestOfMethod(); }, null);

}, TaskScheduler.Current);

After reading this impressive post about how the waiting for an I/O operation to complete works, one question comes to mind, the "if" conditional in the ContinueWith, in what thread does it run?

Since the library/BCL is using the standard P/Invoke overlapped I/O system, it has already registered the handle with the I/O Completion Port (IOCP), which is part of the thread pool. So an I/O thread pool thread is borrowed briefly to execute the APC, which notifies the task that it’s complete.

The task has captured the UI context, so it does not resume the async method directly on the thread pool thread. Instead, it queues the continuation of that method onto the UI context, and the UI thread will resume executing that method when it gets around to it.

One could think that it runs in the I/O thread pool, but that would mean that "RestOfMethod" runs also there if the context is null, that would seem strange to me. So I would say (but it's just an assumption, I could be completely wrong), that the I/O thread pool thread calls into another thread to run the condition, and then either continues or dispatchs it to the corresponding thread for that context.

The another fundamental element to the SynchronizationContext is the ConfigureAwait method. If you call it with false, the dispatch to the captured context will not be done. This is important for library code. If you are writing code that creates tasks in your library, you should not care there about capturing the synchronization context and dispatching, this is a concern that should be only at an upper level (the one managing the UI). Let's see:

//UI, button handler
string st = await new Downloader(this.logger).DownloadAndFormat(address);
this.ResultDisplay.Text = st;

//library code (in the Downloader class)
public async Task DownloadAndFormat(string address)
{ 
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync(address);
string formatted = await response.ReadAsStringAsync();
return formatted.Trim().ToUpper();
}

So the UI thread calls into DownloadAndFormat and it continues executing the await client.GetAsync call. As we've seen the compiler will capture the synchronizationContext and execute the continuation in it. As the Task creation happens in the UI thread, the continuation will be dispatched there also, and the same will happen with the next continuation (the one ofter ReadAsStringAsync). In principle we should not care about those continuations running in the UI Thread, it's not necessary, but should not cause trouble, or yes? Yes, as you can see in different articles, if the caller (the code in the UI handler) did something a bit weird, we would get into a deadlock. How?

//UI, button handler
string st = new Downloader(this.logger).DownloadAndFormat(address).Result;
this.ResultDisplay.Text = st;
 
 

The above code seems a bit strange, rather than await it's getting blocked in the .Result access. Weird, but valid code. So there the UI thread is blocked waiting for the Result of the Task, but in the DoanloadAndFormat method, as the continuation is being dispatched to the captured context, we are also blocked waiting for the UI thread, hence, we have a deadlock!

To avoid this, our library code should ensure the continuation is not dispatched to the UI thread, by doing:


//library code (in the Downloader class)
public async Task<string> DownloadAndFormat(string address)
{ 
HttpClient client = new HttpClient();
HttpResponseMessage response = await client.GetAsync(address).ConfigureAwait(false);;
string formatted = await response.ReadAsStringAsync().ConfigureAwait(false);;
return formatted.Trim().ToUpper();
}

Apart from the risks of deadlocks, dispatching code to the UI thread when it can just run in a thread pool thread can hit performance (your UI thread can alread have enough real work to do). You can read about all this here

Friday 15 April 2016

64 bits and Performance

I'd never thought too much of the performance implications of compiling an application as 64 bits rather than 32. For .Net applications the thought process was simple: Am I using any native component that forces me to target one specific architecture? If not, just set it as anyCPU, and the runtime will use the corresponding architecture for that machine when JITting. This seems to indicate that if the architecture is x64 it's always better to compile to 64.

Of course, if your application is heavy and can need more than 2-3 GBs of RAM, for sure you have to set it to 64, but otherwise, you better thinki it twice. Apart from using 64 bits memory addresses and extending to 64 bits the existing registers, x64 also added 8 new general purpose registers (r8 to r15). If your application does some heavy calculations it can take advantage of these extra registers and gain in performance. OK, good, so which are the downsides?

Basically, your application will consume much more memory! Why? Well, objects are made up of some data and references (pointers) to other objects. Good practices tell us to use Composition rather than Inheritance, so more and more our objects point to many other objects. Each reference is now 64 bits rather than 32, so that is going to make a difference (for sure it's not that the overall memory consumption multiplies by 2, as numbers and strings will take up the same space as in 32).

There's another important point to bear in mind. Each .Net instance of a reference type, has a header with 2 fields: the SyncBlock address and the RTTI (vTable if you want to keep it simple) address. Yes, I've said address, so while in 32 bits this header will take 8 bytes in x64 it'll be 16 bytes. You can read more here and here. It's interesting what they mention that references point to the second field rather than the first, that is at a negative offset then.

The sync block sits at a negative offset from the object pointer. The first field at offset 0 is the method table pointer, 8 bytes on x64. So on x86 it is SB + MT + X + Y = 4 + 4 + 4 + 4 = 16 bytes. The sync block index is still 4 bytes in x64. But the object header also participates in the garbage collected heap, acting as a node in a linked list after it is released. That requires a back and a forward pointer, each 8 bytes in x64, thus requiring 8 bytes before the object pointer. 8 + 8 + 4 + 4 = 24 bytes.

So your object is likely laid out like this:

x86: (aligned to 8 bytes)
Syncblk TypeHandle X Y
------------,------------|------------,------------|
8 16


x64: (aligned to 8 bytes)
Syncblk TypeHandle X Y
-------------------------|-------------------------|------------,------------|
8 16 24

I've never been particularly concerned for the memory consumption of my applications, but there are things that is important to have in mind. Let's say that you have a class with 2 Integer data fields. In a 64 bits application each instance will take: 16 bits of header + 4 + 4 (for the 2 integers), that is 24 bytes. If you were using a struct (value type) rather than a class, as no header exists, it would be just 8 bytes, 3 times less! Furthermore, if you are putting these objects in an array (or a List, as it's based on arrays), there's one more difference. If you used a class, your collection will hold references to the instances of that class, while that if you used a struct, it will be embedded in the collection itself, so not 8 extra bytes per object due to that level of indirection, all in all, we are using 32 bytes, while for the struct we stay in the 8 bytes (4 times less). If you have many instances of these objects the difference in memory pressure will be more than noticeable.

This article about Visual Studio sticking to 32 bits is a good link to close this post.

Friday 8 April 2016

Interpreters and JITs

The other day I came across this interesting article about the improvements to the Android runtime. Summarizing, it will start to run an application via the interpreter. Then some parts will be JITted, injecting some profiling code, so that when the machine is idle and charging the hot sections of the code will be recompiled. This recompilation can be applied multiple times, and I think the compiled code is saved, not just kept in memory, so further executions already benefit from this.

Years ago I would think of a world split between Interpreters and JITs. It was reading about the HotSpot JVM when I found that both worlds could be mixed. You start by interpreting the code so that you don't have any initial delay because of compilation, and when it's detected that a method is run very often the JIT compiles it. The astonishing feature provided by the HotSpot JVM is that once a method has been compiled it can be further optimized based on runtime information and "hot swapped". Another interesting feature that I have just learnt is that it can also use OSR (on stack replacement). This means that a method that is being run only once, but for a long while (a big loop) will be detected as a Hot Spot and replaced even when it is running!!! Another cool feature is that as in some cases optimizations can have taken wrong decisions, a method can be replaced by a deoptimized version. You'll find this paragraph interesting:

Remember how HotSpot works. It starts by running your program with an interpreter. When it discovers that some method is "hot" -- that is, executed a lot, either because it is called a lot or because it contains loops that loop a lot -- it sends that method off to be compiled. After that one of two things will happen, either the next time the method is called the compiled version will be invoked (instead of the interpreted version) or the currently long running loop will be replaced, while still running, with the compiled method. The latter is known as "on stack replacement", or OSR.

The .Net runtime does not use an interpreter, just a JIT (and also de AOT compilation via ngen). I could be wrong, but this gives you the impression that it's less advanced than the JVM HotSpot. In .Net 4.6 a new, more performant, JIT (ryujit) is used. From what I've read it's a traditional JIT, neither interpretation nor hot swapping has been added to the mix. Bearing in mind that Ryujit has been on the works for quite a few years and that Microsoft has put a lot of effort on it, I assume they consider that hot swapping is not necessary for huge performance gains.

Reading about all this has woken up my interest on how mondern javascript engines work (Interpreter, JIT, both...).

  • V8: (chrome, standard node.js) Rather than an interpreter and a JIT, it includes a fast, not optimized JIT to start with, and then hot methods are compiled via a slower, optimized JIT. Copy pasted from somewhere on the net:

    V8 never interprets, it always compiles. The first compiler is a very fast, very slim compiler that starts up very quick. The code it produces isn't very fast, though. This compiler also injects profiling code into the code it generates. The other compiler is slower and uses more memory, but produces much faster code, and it can use the profiling information collected by running the code compiled by the first compiler.

  • Mozilla Spider Monkey: Mozilla's runtime has quite evolved over the years. It started as an interpreter, then they added a particular case of JIT, a tracing JIT (sequences of code are compiled rather than whole methods), and then they replaced it with a conventional per-method JIT (along with the interpreter).
  • Microsoft's ChakraCore: From this overview it seems pretty advanced. It uses an interpreter for fast start-up and 2 multithreaded JITs, a fast one and an optimized one. It seems that apart from JITting methods it can also JIT specific loops. Of course the compiled code can be hot-swapped.

If you want to read more about Interpreters and JITs, this makes a good reading.