Clemens Vasters

October 19, 2012

@ 09:29 AM

Comments [1]

Service Bus: BeginSend is no magic async pixie dust

I just got off the call with a customer and had a bit of a déjà vu from a meeting at the beginning of the week, so it looks like the misconception I'll explain here is a bit more common than I expected.

In both cases, the folks I talked to, had the about equivalent of the following code in their app:

var qc = factory.CreateQueueClient(…);
for( int i = 0; i < 1000; i++ )
{
… create message …
qc.BeginSend( msg, null, null );
}
qc.Close();

In both cases, the complaint was that messages were lost and strange exceptions occurred in the logs – which is because, well, this doesn't do what they thought it does.

BeginSend in the Service Bus APIs or other networking APIs as much as BeginWrite on the file system isn't really doing the work that is requested. It is putting a job into a job queue – the job queue of the I/O thread scheduler.

That means that once the code reaches qc.Close() and you have also been mighty lucky, a few messages may indeed have been sent, but the remaining messages will now still sit in that job queue and scheduled for an object that the code just forced to close. With the result that every subsequent send operation that is queued but hasn't been scheduled yet will throw as you're trying to send on a disposed object. Those messages will fail out and be lost inside the sender's process.

What's worse is that writing such code stuffs a queue that is both out of the app's control and out of the app's sight and that all the arguments (which can be pretty big when we talk about messages) dangle on those jobs filling up memory. Also, since the app doesn't call EndSend(), the application also doesn't pick up whatever exceptions are potentially raised by the Send operation and flies completely blind. If there is an EndXXX method for an async operation, you _must_ call that method even if it doesn't return any values, because it might quite well throw you back what went wrong.

So how should you do it? Don't throw messages blindly into the job queue. It's ok to queue up a few to make sure there's a job in the queue as another one completes (which is just slightly trickier than what I want to illustrate here), but generally you should make subsequent sends depend on previous sends completing. In .NET 4.5 with async/await that's a lot easier now:

var qc = factory.CreateQueueClient(…);
for( int i = 0; i < 1000; i++ )
{
… create message …
await task.Factory.FromAsync(qc.BeginSend, qc.EndSend, msg, null );
}
qc.Close();

Keep in mind that the primary goal of async I/O is to not waste threads and lose time through excessive thread switching as threads hang on I/O operations. It's not making the I/O magically faster per-se. We achieve that in the above example as the compiler will break up that code into distinct methods where the loop continues on an I/O thread callback once the Send operation has completed.

Summary:

Don't stuff the I/O scheduler queue with loads of blind calls to BeginXXX without consideration for how the work gets done and completed and that it can actually fail
Always call End and think about how many operations you want to have in flight and what happens to the objects that are attached to the in-flight jobs

Categories: Architecture | Technology

October 12, 2012

@ 06:45 AM

Comments [0]

About the 'D' in ACID Transactions

I just got prompted to write this in an email reply and I think it's worth sharing.

My personal definition for ACID's Durability tenet is as follows:

"The outcome of the transaction outlasts the transaction, meaning that the result of the transaction is published to other consumers as the transaction completes. ‘I’ resolves into ‘D’."

People seem to think that there’s an implied guarantee on the outcome of the transaction lasting forever. I don’t think that’s true and also believe that even the implication is out of scope of the transaction mechanism. To me, all that ‘D’ says is that the result of the transaction is published to subsequent consumers with the same reliability assurances that apply for the same kind of data within the same system. That is to say that if the system holds its operational data in volatile memory, committing to and publishing in memory is sufficient. If the system stores operational data in a replicated ring store in memory or if the system stores it on nonvolatile media, the transaction outcome must be verifiably published following the common storage strategy to satisfy ‘D’. It doesn’t have to be a spindle.

Categories:

September 6, 2012

@ 07:08 PM

Comments [14]

Are you catching falling knives?

as I thumb through some people's code on Github, I see a fairly large number of "catch all" exception handling cases. It's difficult to blame folks for that, since there's generally (and sadly) very little discipline about exception contracts and exception masking, i.e. wrapping exceptions to avoid bubbling through failure conditions of underlying implementation details.

If you're calling a function and that sits on a mountain of dependencies and folks don't care about masking exceptions, there are many dozens of candidate exceptions that can bubble back up to you and there's little chance to deal with them all or even knowing them. Java has been trying to enforce more discipline in that regards, but people cheat there with "catch all" as well. There's also a question what the right way tot deal with most exceptions is. In many cases, folks implement "intercept, shrug and log" and mask the failure by telling users that something went wrong. In other common cases, folks implement retries. It's actually fairly rare to see deeply customized and careful reactions to particular exceptions. Again - things are complicated and exceptions are supposed to be exceptional (reminder: throwing exceptions as part of the regular happy path is horrifingly bad for performance and terrible from a style perspective), so these blanket strategies are typically an efficient way of dealing with things.

That all said ...

Never, never ever do this:

try
{
    Work();
}
catch
{
}

And not even this:

try
{
    Work();
}
catch(Exception e)
{
    Trace.TraceError(e.ToString());
}

Those examples are universally bad. (Yes, you will probably find examples of that type even in the archive of this blog and some of my public code. Just goes to show that I've learned some better coding practices here at Microsoft in the past 6 1/2 years.)

The problem with them is that they catch not only the benign stuff, but they also catch and suppress the C# runtime equivalent of the Zombie Apocalypse. If you get thread-abort, out-of-memory, or stack-overflow exceptions thrown back at you, you don't want to suppress those. Once you run into these, your code has ignored all the red flags and exhausted its resources and whatever it was that you called didn't get its job done and likely sits there as a zombie in an undefined state. That class of exceptions is raining down your call stack like a shower of knife blades. They can't happen. Your code must be defensively enough written to never run into that situation and overtax resources in that way; if it does without you knowing what the root cause is, this is an automatic "Priority 0", "drop-everything-you're-working-on" class bug. It certainly is if you're writing services that need to stay up 99.95%+.

What do we do? if we see any of those exceptions, it's an automatic death penalty for the process. Once you see an unsafe out-of-memory exception or stack overflow, you can't trust the state of the respective part of the system and likely not the stability of the system. Mind that there's also a "it depends" here; I would follow a different strategy if I was talking about software for an autonomous Mars Rover that can't crash even if its gravely ill. There I would likely spend a few months on the exception design and "what could go wrong here" before even thinking about functionality, so that's a different ballgame. In a cloud system, booting a cluster machine that has the memory flu is a good strategy.

Here's a variation of the helper we use:

public static bool IsFatal(this Exception exception)
{
    while (exception != null)
    {
        if (exception as OutOfMemoryException != null && exception as InsufficientMemoryException == null || exception as ThreadAbortException != null || 
            exception as AccessViolationException != null || exception as SEHException != null || exception as StackOverflowException != null)
        {
            return true;
        }
        else
        {
            if (exception as TypeInitializationException == null && exception as TargetInvocationException == null)
            {
                break;
            }
            exception = exception.InnerException;
        }
    }
    return false;
}

If you put this into a static utility class, you can use this on any exception as an extension. And whenever you want to do a "catch all", you do this:

try
{
    DoWork();
}
catch (Exception e)
{
    if (e.IsFatal())
    {
        throw;
    }
    Trace.TraceError(..., e);
}

If the exception is fatal, you simply throw it up as high as you can. Eventually it'll end up on the bottom of whatever thread they happen on (where you might log and rethrow) and will hopefully take the process with it. Threads marked as background threads don't do that, so it's actually not a good idea to use those. These exceptions are unhandled, process-terminating disasters with a resulting process crash-dump you want to force in a 24/7 system so that you can weed them out one by one.

(Update) As Richard Blewett pointed out after reading this post, the StackOverflowException can't be caught in .NET 2.0+, at all, and the ThreadAbortException automatically rethrows even if you try to suppress it. There are two reasons for them to be on the list: first, to shut up any code review debates about which of the .NET stock exceptions are fatal and ought to be there; second, because code might (ab-)use these exceptions as fail-fast exceptions and fake-throw them, or the exceptions might be blindly rethrown when marshaled from a terminated background thread where they were caught at the bottom of the thread. However they show up, it's always bad for them to show up.

If you catch a falling knife, rethrow.

Categories: Technology | CLR

September 2, 2012

@ 05:31 PM

Comments [2]

Going Home

After just over six years in the United States, our family is going to relocate back to Germany sometime in the second half of this month.

Thanks to a lot of effort by our management and HR teams at Microsoft, including our VP Scott Guthrie, I will be staying with the Windows Azure engineering group and with the Service Bus feature team and will come back to the mothership fairly frequently, likely 5-6 times a year; you can look at this as "working from home" with a 14h each-way commute to work.

Since it's soon going to be fairly obvious looking at my Twitter timeline that that move is happening, I thought it make sense to let you know here as well with a few more than 140 characters. I already spoke to a few folks who're good at reading tea leaves and writing about it (like Darryl Taft and Mary-Jo Foley) back at TechEd North America about this is coming up, so there wouldn't be speculation about me jumping ship. I'm not.

There are two sets of reasons for why we're moving back at this time. The primary set of reason is around family concerns. Our daughter is now 5 and the grandparents and the rest of the family deserve extended time with her and us. We also have a choice between having her set her cultural and educational roots in America or in Europe and on whether our daughter is going to communicate with us in English or German in the long run.

Everyone has their notions of patriotism and I'm a proud European. And irrespective of what the media panic says, I'm bullish on Europe - and I'm also bullish on the Middle East and Africa. I see an awesome number of really sophisticated customer cloud solutions and concepts in progress in manufacturing, commerce, and energy in the EMEA region, and setting up camp near Mönchengladbach/Düsseldorf will mostly put me within 1-2 flying hours of most of these customers and my colleagues working with them. I think that having more folks from the core Windows Azure engineering organization over in Europe - my Service Bus colleague David Ingham is in Newcastle/England already - will be a good thing.

And even though we're still working on calibrating the exact shape of my remote role and coordinate with the local colleagues, I'm fairly certain that conference and event attendees in EMEA will see a bit more of me again. The first conference European conference I'll be speaking at that I would otherwise not been able to go to will be the German ADC conference in early November. If you run/chair a conference in Europe and would be interested in having me speak, drop me an email to clemensv@microsoft.com

Also - this isn't the last word in the U.S. for us. Our daughter is a dual citizen and we're keeping the door open to come back in a few years time, so this is technically a temporary relocation. We obviously have a lot of friends here, and the Puget Sound area is one of the most beautiful places in the world (even when it's gray) and there's no better place in the world for one of my newly acquired hobbies, which is, probably oddly, Civil and Military Aviation History. I also acquired appreciation for Baseball and the up-and-coming (you just have to believe) Seattle Mariners - and, of course, Football and the Seahawks.

Bottom line: Same job, different continent, different time-zone.

Categories:

September 1, 2012

@ 04:49 AM

Comments [3]

Sagas

Today has been a lively day in some parts of the Twitterverse debating the Saga pattern. As it stands, there are a few frameworks for .NET out there that use the term "Saga" for some framework implementation of a state machine or workflow. Trouble is, that's not what a Saga is. A Saga is a failure management pattern.

Sagas come out of the realization that particularly long-lived transactions (originally even just inside databases), but also far distributed transactions across location and/or trust boundaries can't eaily be handled using the classic ACID model with 2-Phase commit and holding locks for the duration of the work. Instead, a Saga splits work into individual transactions whose effects can be, somehow, reversed after work has been performed and commited.

The picture shows a simple Saga. If you book a travel itinerary, you want a car and a hotel and a flight. If you can't get all of them, it's probably not worth going. It's also very certain that you can't enlist all of these providers into a distributed ACID transaction. Instead, you'll have an activity for booking rental cars that knows both how to perform a reservation and also how to cancel it - and one for a hotel and one for flights.

The activities are grouped in a composite job (routing slip) that's handed along the activity chain. If you want, you can sign/encrypt the routing slip items so that they can only be understood and manipulated by the intended receiver. When an activity completes, it adds a record of the completion to the routing slip along with information on where its compensating operation can be reached (e.g. via a Queue). When an activity fails, it cleans up locally and then sends the routing slip backwards to the last completed activity's compensation address to unwind the transaction outcome.

If you're a bit familiar with travel, you'll also notice that I've organized the steps by risk. Reserving a rental car almost always succeeds if you book in advance, because the rental car company can move more cars on-site of there is high demand. Reserving a hotel is slightly more risky, but you can commonly back out of a reservation without penalty until 24h before the stay. Airfare often comes with a refund restriction, so you'll want to do that last.

I created a Gist on Github that you can run as a console application. It illustrates this model in code. Mind that it is a mockup and not a framework. I wrote this in less than 90 minutes, so don't expect to reuse this.

The main program sets up an examplary routing slip (all the classes are in the one file) and creates three completely independent "processes" (activity hosts) that are each responsible for handling a particular kind of work. The "processes" are linked by a "network" and each kind of activity has an address for forward progress work and one of compensation work. The network resolution is simulated by 'Send".

   1:  static ActivityHost[] processes;

2:

   3:  static void Main(string[] args)

   4:  {

   5:      var routingSlip = new RoutingSlip(new WorkItem[]

   6:          {

   7:              new WorkItem<ReserveCarActivity>(new WorkItemArguments{{"vehicleType", "Compact"}}),

   8:              new WorkItem<ReserveHotelActivity>(new WorkItemArguments{{"roomType", "Suite"}}),

   9:              new WorkItem<ReserveFlightActivity>(new WorkItemArguments{{"destination", "DUS"}})

  10:          });

11:

12:

  13:      // imagine these being completely separate processes with queues between them

  14:      processes = new ActivityHost[]

  15:                          {

  16:                              new ActivityHost<ReserveCarActivity>(Send),

  17:                              new ActivityHost<ReserveHotelActivity>(Send),

  18:                              new ActivityHost<ReserveFlightActivity>(Send)

  19:                          };

20:

  21:      // hand off to the first address

  22:      Send(routingSlip.ProgressUri, routingSlip);

  23:  }

24:

  25:  static void Send(Uri uri, RoutingSlip routingSlip)

  26:  {

  27:      // this is effectively the network dispatch

  28:      foreach (var process in processes)

  29:      {

  30:          if (process.AcceptMessage(uri, routingSlip))

  31:          {

  32:              break;

  33:          }

  34:      }

  35:  }

The activities each implement a reservation step and an undo step. Here's the one for cars:

   1:  class ReserveCarActivity : Activity

   2:  {

   3:      static Random rnd = new Random(2);

4:

   5:      public override WorkLog DoWork(WorkItem workItem)

   6:      {

   7:          Console.WriteLine("Reserving car");

   8:          var car = workItem.Arguments["vehicleType"];

   9:          var reservationId = rnd.Next(100000);

  10:          Console.WriteLine("Reserved car {0}", reservationId);

  11:          return new WorkLog(this, new WorkResult { { "reservationId", reservationId } });

  12:      }

13:

  14:      public override bool Compensate(WorkLog item, RoutingSlip routingSlip)

  15:      {

  16:          var reservationId = item.Result["reservationId"];

  17:          Console.WriteLine("Cancelled car {0}", reservationId);

  18:          return true;

  19:      }

20:

  21:      public override Uri WorkItemQueueAddress

  22:      {

  23:          get { return new Uri("sb://./carReservations"); }

  24:      }

25:

  26:      public override Uri CompensationQueueAddress

  27:      {

  28:          get { return new Uri("sb://./carCancellactions"); }

  29:      }

  30:  }

Clemens Vasters

Cloud Development and Alien Abductions

Navigation for Clemens Vasters