Durable Task Framework Episode I - The Problem of Robust Execution

Damir Dobric Posts

Home
About

Next talks:

Follow me on Twitter: #ddobric

Syndication

Imagine you are writing a piece of software, which looks like following code snippet:

MyProgram.Run(args)
{
      var result1=Task1();
      if(resut1 = 1)
          Task2();
      else
           Task3();
}

Imagine now, this program needs to be started multiple times for possibly different ‘args’. The idea behind multiple start is to scale horizontally to increase a speed of execution. In other words, this means you can run multiple instances of that program by starting more and more instances. This all is more or less clear. This approach is usually called “Load Balancing”. This is how Web Server works. The only difference is that Web Server does not span multiple instances. Instead, multiple threads are created.

Long Running and Robust Execution

But let’s make now a slightly different assumption. This program now does not execute quickly and it is possibly even not a small one. For example Task1 can take 2 minutes and task two can take 3 minutes. Such programs (approaches) are called “long running”. Long Running of a program theoretically and practically can dramatically increase probability of an error.

That could mean Task1 may complete successfully and immediately after completion the machine where Task1 was running coincidently crashes. As a traditional developer you will probably simply start it again. In that case Task1 will be executed again. In most cases this is not a problem, but what if Task1 is contacting one or more external applications, which do not support repeating of the same task? In the world of integration this is by default correct assumption. Remember our assumption. Task are executing a longer time. This is usually because they have a big amount of work to complete.

A better approach would be if you could start new machine, deploy the solution there and start it again. And the program continues exactly after Task1, by keeping previously calculated result. Wouldn’t that bee good? Please try now to think for a moment how to implement such behavior.

We described in previous scenario what would happen in a case of a failure. But sometimes it might be required by rule, to stop execution of the program after Task1 is completed, because Task2 cannot complete, because of some business rule or some system is not available. This is typically the case when you integrate different systems. For example Task2 should send SMS, but SMS gateway is currently down, because somebody is updating some patch. When working in such scenarios ‘somebody’ means most likely not you.
To reduce organizational dependency you must build the system which is robust enough to survive such situation, which are not wanted, but it just happen in the real life.

So, the industry found the solution. It would be cool if operative system would be able to execute exactly such kind of applications by default. Because we already have an operative system, industry decided to write something lightweight, but capable of execution of the program. This piece of software, which orchestrate such execution is typically called scheduler.

The scheduler will run your program, but before every task is called one message will be send to the message box (typically database, which stores messages). The scheduler will then read the message m1 (start Task1) and execute task1. After that scheduler will execute the condition (if(result.) and send a new message m2 to the store. If in this moment the machine crashes, the message remains persisted in the store. Once the machine is running again scheduler will read the next message m2 and start executing the Task2.
There is no data los, because everything the system needs is stored in the message.

Recap

In this article we introduced execution of programs consisted of one or many Long Running Tasks. This is the case when some part of program called “Task” executed a longer then some short interval like 1 sec. Long Execution does not have to take weeks, but it could. Sometimes is 1 minute very long time. In such constellations tasks should not or cannot be repeated, because of several reasons. In a case of failure or simply stopping the program in the middle, we want to start the program exactly on the place where it was stopped.

Posted Aug 11 2015, 06:41 AM by Damir Dobric

Filed under: BizTalk, cloud, azure, windows azure, Integration, Microsoft Azure, LogicApps