spacer
« On the Origin of Software by Means of Artificial Selection
Online Embedded Software Store: a good idea? »

What’s the state of your Cortex?

Monday, September 26th, 2011 by Miro Samek

Recently, I’ve been involved in a fascinating bug hunt related to a very peculiar behavior of the ARM Cortex-M3 core. Given the incredible popularity of this core, I thought that digging a little deeper into the mysteries of ARM Cortex could be interesting and informative.

First, I need to provide some background. So, the bug was related to the very unique ARM Cortex-M exception type called PendSV. This is an exception triggered by software, but unlike any regular software interrupt, PendSV is an asynchronous exception. This means that PendSV typically does not run immediately after it is triggered, but only after the Nested Vectored Interrupt Controller (NVIC) determines that the priority of the currently executing code drops below the priority associated with PendSV.

At this point, you might wonder, why and where would such “Pended Software Interrupt” be useful? Well, it turns out that PendSV is the only reliable way on ARM Cortex-M to find out when all (possibly nested) interrupt service routines (ISRs) have completed. And this determination is essential to run the scheduler in any preemptive real time kernel.

Virtually all preemptive RTOSes for ARM Cortex-M processors work as follows. Upon initialization the priority associated with PendSV is set to be the lowest of all exceptions (0xFF). All ISRs in the system, prioritized above PendSV, trigger the PendSV exception by writing 1 to the PENDSVSET bit in the NVIC ICSR register, like this:

*((uint32_t volatile *)0xE000ED04) = 0x10000000;

Now, the heavy lifting is left entirely to the NVIC hardware. NVIC will activate PendSV only after the last of all nested interrupts completes and is about to return to the preempted task context. This is exactly the right time for a context switch. In other words, the PendSV exception is designed to call the scheduler and perform the task preemption. ARM Cortex is so smart that it eliminates the overhead of exiting one exception (the last nested interrupt) and activating another (the PendSV) in the trick called “tail-chaining”.

Everything looks easy so far, but ARM Cortex has one more trick up it’s sleeve and this optimization, called “late-arrival”, has interesting side effects related to PendSV. This subtle interaction between PendSV and late-arrival leads essentially to a hardware race condition I’ve recently had a pleasure to chase down.

To illustrate the events that lead up to the bug, I’ve prepared a distilled hardware trace available for viewing at ARM-Cortex-M3_bug.txt. Please go ahead and click on this link to follow along.

The trace starts with an interrupt entry (labelled as Exception 83). This system runs under the preemptive kernel called QK, so the ISR calls QK_ISR_ENTRY() and later QK_ISR_EXIT() macros to inform the kernel about the interrupt. At trace index 069545 the QK_ISR_EXIT() macro triggers the PendSV exception by writing 0×10000000 into the ICSR register.

After this, the Exception 83 runs to completion and eventually tail-chains to Exception 14 (PendSV). This is all as expected.

However, the real problem starts at trace index 069618, at which the execution of the first instruction of PendSV (CPSID i) is cancelled due to arrival of a higher-priority Exception 36 (another interrupt).

This cancellation of low-priority Exception 14 in favor of the higher-priority Exception 36 is anotehr ARM Cortex-special called late arrival. The ARM core optimizes the interrupt entry (which is identical for all exception), and instead of entering the low-priority exception and than immediately high-priority exception, it simply enters the high-priority exception.

The problem is that just before the late arrival, the PENDSVSET bit in the NVIC-ICSR register is already cleared.

However, the late-arriving Exception 36 sets this bit again in QK_ISR_EXIT(), which is normal for any interrupt (trace index 070126).

The Exception 36 eventually exits to the original PendSV (trace index 070130), but this is not the usual tail-chaining (the trace indicates tail-chaining by the pair Exception Exit/Exception Entry). This time around the trace shows only Exception Exit, but no entry.

This difference has very important implication, which is that the PENDSVSET bit in the NVIC-ICSR register is not cleared (remember that it is set, however).

What unfolds next is the consequence of the PENDSVSET bit being set. PendSV executes, fakes its own return to the QK scheduler, and eventually it unlocks interrupts. But before SVCall (Exception 11) can execute, the PendSV Exception 14 is taken again (because it is triggered by the PENDSVSET bit). This makes no sense and should never happen, because PendSV should never be in the triggered state at this point.

***
So, what are the consequences of this behavior and what is the fix?

Well, as you can see, due to late-arrival PendSV can be occasionally entered with the PENDSVSET bit being set, so it will be triggered again immediately after it completes. This might or might not have grave consequences. In case of the QK kernel, this was unacceptable and led to a Hardware Fault. In other RTOSes it might simply cause another scheduler call, waste of some CPU, and delay of the task-level response, but perhaps not a catastrophic failure.

The actual fix of the problem is very simple. Since you cannot rely on the automatic clearing of the PENDSVSET bit in the NVIC-ICSR register, you need to clear it manually (by writing 1 to the PENDSVCLR bit in the NVIC-ICSR register.) Of course this is wasteful, because only one time in a million this bit is actually not cleared automatically.

Interestingly, I have not seen such writing to the PENDSVCLR bit in open source RTOSes for ARM Cortex-M (such as FreeRTOS.org). Recently, I’ve come across some posts to the ARM Community Forums that this problem exists for the Frescale MQX RTOS (see PendSV pending inside PendSV handler? (Cortex-M4)).

If you use a preemptive kernel on ARM Cortex-M0/M3, perhaps you could check how your kernel handles PendSV. If you don’t see an explicit write to the PENDSVCLR bit, I would recommend that you think through the consequences of re-entering PendSV. I’d be very interested to collect a survey of how the existing kernels for ARM Cortex-M handle this situation.

Tags: ARM Cortex-M, hardware race condition

This entry was posted on Monday, September 26th, 2011 at 3:24 pm and is filed under Firmware Bugs, MCUs, RTOS Multithreading. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

48 Responses to “What’s the state of your Cortex?”

  1. Nick Merriam says:
    September 26, 2011 at 3:57 pm

    Great post Miro. I love this kind of thing. It reminds me of a tricky behaviour of the C167 family (it could be still there for all I know), where you could enter an interrupt and the state on the stack showed that the interrupted context was running at a higher priority than the interrupt. This happened when you interrupted an instruction that was writing the interrupt level bits of the PSW. The processor decided to handle the interrupt but also completed the instruction to raise the interrupted context’s interrupt level. The fix was to wrap such writes to the PSW with an ATOMIC sequence.

    Our gliwOS microkernel does not support pre-emptive task switching because (a) most automotive applications do not actually require it and (b) we have seen soooo many cases like this of nasty race conditions that affect a preemptive OS. Also, there are cases where a non-preemtive OS outperforms a preemptive OS.

    Reply
  2. Richard Barry says:
    September 27, 2011 at 8:08 am

    I’m always grateful for a mention of FreeRTOS in any article. I need to triple read what you are saying in this article, but from my first read, it appears you are implying there is a bug in how FreeRTOS interacts with the Cortex-M core – when in fact there is no such thing. The worst case scenario would be that the PendSV runs twice, which has zero consequence over and above wasting a tiny amount of time.

    FreeRTOS has no special interrupt entry or exit code, does not pend a PendSV unless it is necessary (that is if the ISR unblocked a task with a priority higher than the task currently in the Running state), and never globally disables interrupts (CPSID i). In the standard version the only SVC call used is to start the scheduler, although the MPU version makes multiple use of SVC calls.

    Regards,
    Richard
    Author, FreeRTOS

    Reply
    • Miro Samek says:
      September 27, 2011 at 10:53 am

      I have mentioned FreeRTOS.org only because it is popular and truly open source, so it was very easy for me to go and check its source code without needing to register or relinquish my personal information (as it is the case in so many other RTOSes out there). As an author of the open source QP framework I also constantly deal with such scrutiny of the source code. I see it as a very good thing, because it contributes to constant improvement of the quality of the code.

      I am really glad to hear that after performing analysis of the FreeRTOS.org port to ARM Cortex-M you conclude that occasional entry to the PendSV exception with the PENDSVSET bit set causes no problem for FreeRTOS. This is all I was asking for.

      The real question is whether you *knew* about this peculiar interference between late-arrival and the PENDSVSET bit, so that you *know* that you need to think through the consequences of this particular scenario.

      I certainly didn’t know about it and I would have been very grateful to anybody who would share this piece of information with me. As far as I know this behavior is undocumented and is untestable with a single-step debugger where you trigger various exceptions at various places in the code. The bug raises its ugly had only in the highly dynamic situation of late-arrival, so it can be revealed only in a system running at full speed and heavily hammered by interrupts.

      Reply
      • Paul Kimelman says:
        September 27, 2011 at 12:11 pm

        Miro, you are focused on the wrong issue. I am not even sure this is late arrival – it looks more like preemption, but without time or stack info, no way to verify. But, the simpler case is that PendSV has entered and then your ISR 36 fires. The exact same scenario would occur as you describe. I assume your PendSV handler protects its critical data, so ISR 36 would occur before or after.
        In either case, the side effect of ISR 36 needs to be verified by PendSV. Since ISR36 could fire anywhere in PendSV, it makes sense for the core to re-run PendSV to make sure the change is handled. Since you are using critical sections, there is no risk of corruption, so it is only time.

        Reply
      • Richard Barry says:
        September 27, 2011 at 1:17 pm

        Well, Paul Kimelman is obviously far more knowledgeable on this than I am, so I won’t say too much, but I don’t understand your use of the word “bug”. How you are describing the core working is exactly how I would expect it to work, and I think how it is intended to work, and how I would infer it worked from the documentation. I would go as far as to say that if the core did not exhibit this behaviour, it would be a bug in the core, not if it does exhibit it. For example, if the pend bit is cleared, and a preempting interrupt could not re-set it, then that would be a problem. There is no race condition, is there? Then again, I have been using Cortex-M devices for many years, so maybe my knowledge (?) of its workings come more from experience than the documentation.

        Reply
        • Dan says:
          September 27, 2011 at 2:10 pm

          Hi Richard,

          When Miro said “the bug raises its ugly [head]“, I don’t think Miro was referring to the CM3 core, I think he was referring to the QK kernel code.

          (Miro, feel free to correct me if I have mis-represented your intent.)

          Reply
          • Paul Kimelman says:
            September 27, 2011 at 3:24 pm

            Dan, he is treating it as a core problem: “actual fix of the problem is very simple. Since you cannot rely on the automatic clearing of the PENDSVSET bit in the NVIC-ICSR register, you need to clear it manually (by writing 1 to the PENDSVCLR bit in the NVIC-ICSR register.)”.

            If he cannot tolerate PendSV being called as soon as it exits, then he has a big problem since this “fix” will not fix that. As I noted, the high priority ISR could come in at instruction 0×08023278 (where PendSV enables interrupts) and it will of course re-execute.

            But, the trace does not make sense anyway. He has an SVC instruction in his PendSV handler apparently. For this to not fault, the SVCall exception must be higher priority than PendSV. If it is, then it should execute before PendSV can be re-executed. Further, PendSV will not re-enter until you return. That is, the PendSV bit does *not* cause PendSV to pre-empt itself – that cannot happen. So, his trace is not reflecting something, but we do not have enough info to see what is going on.

          • Dan says:
            September 27, 2011 at 4:46 pm

            Hi Paul,

            (Staying at current level of indentation for readability)…

            I hear you – I was only responding to Richard’s remark about the use of “bug”, I thought Miro was only referring to the QK implementation, not the core, when he used that term. Just how I read it.

            This is totally my interpretation, but I read the “you cannot rely on the automatic clearing of the PENDSVSET bit” not as a slam on the CM3, but more pointing out a situation where the QK doesn’t handle the particular sequence of events properly (gotta love the asynchronous nature of interrupts, right?). My experience with Miro is that he’s usually the first to assume the problem is in his own code rather than blame someone else’s code (or processor design).

            By the way, thanks for joining in the conversation. I’d say you’re uniquely qualified to weigh in on this topic, so any insight & suggestions are welcome by all of us.

        • Miro Samek says:
          September 27, 2011 at 8:46 pm

          I certainly realize that many millions or perhaps now even billions of ARM Cortex-M processors have been deployed in all sorts of products, so it is quite safe to say that the core doesn’t have bugs. The point of my post was rather to describe an interesting, and non-obvious (at least to me) behavior related to the interference of two features (PendSV and late-arrival), both being unique to ARM Cortex-M.

          Perhaps I should also note that the hardware trace referenced in the original post contains very useful end-of-line comments, which were generated by the trace probe, not by me. These comments provide additional information from the hardware, from which you can clearly see that the Exception 14 was entered (index 069617), but it was cancelled and Exception 36 was entered instead. Also, the exit from Exception 36 (index 070130) is *not* followed by the “Exception Entry” comment, but the first instruction of PendSV is executed. These two pieces of evidence make me still think that my interpretation of this trace is correct, in that is we are dealing here not with “normal” preemption, but with late-arrival.

          The non-obvious behavior to me is that late-arrival causes entry to *both* exceptions simultaneously, including execution of the side effects of both exceptions, such as clearing the pending bits (PENDSVSET bit in case of PendSV and some other pending bit for Exception 36). But then again, perhaps the right way of thinking about late-arrival is that it *is* the preemption of a low-priority exception by a higher-priority exception only compressed to one machine instruction (exception entry). With this interpretation I admit that I should have *expected* that the PENDSVSET bit won’t be cleared as part of resuming the preempted PendSV. (Although, interestingly, the cancelled first instruction of PendSV *is* recalled and executed.)

          But this whole discussion brings into focus one important point, I hope. And this is that occasionally PendSV is triggered while another instance of PendSV is already active. When this happens and when the PENDSVSET bit is *not* explicitly cleared right before the context switch, another instance of PendSV is executed immediately after the first one. I understand that this is not catastrophic in the FreeRTOS.org implementation. However, the amount of code executed from PendSV is non-trivial and delays the task-level response. My gut feeling is that adding explicit write to the PENDSVCLR bit right before invoking the context switch would improve the determinism of the kernel. Of course, the average execution time would increase by three short instructions, but the *worst-case* task-level response would improve. I believe this is something to think about.

          Reply
          • Paul Kimelman says:
            September 27, 2011 at 11:37 pm

            I think we should clarify a few things. Late arrival simply means that while we are pushing regs, a higher priority interrupt comes along and so we pass control to it vs. the original ISR – when it returns, the one that was prevented from running will be entered by tail chaining (which just means skip the pop and then push). This works by keeping the pend bit set until the ISR is “activated” (1st location loaded into PC).
            I do not know what the trace tool you are using does, but the PendSV pending bit is not cleared until the 1st instruction is about to start (after too late for late arrival) and the active bit is also set. However, if a higher pri interrupt comes in, that 1st instruction is not executed (canceled) and the interrupt taken. The registers are all stacked again since this is a pre-emption. So, if you see it cleared, then it must have been entered into PendSV but 1st instruction canceled by pre-emption.
            If your PendSV code does a lot of work before anything that would be affected by another interrupt, then by all means clear the PendSV pend bit. But, you cannot do it after that critical point and be safe of course. For most kernels, this is near or at the start. For you, it is worse since you disable interrupts right away. So, the only place it can occur is just before that 1st instruction. So, the value of clearing it as one of the 1st few instructions is very very limited – the odds of catching it on its 1st instruction is low. However you see that when it does happen only because it crashes your code.
            My guess is that you are creating a lot of these effects by global interrupt disable for whole ISRs. This causes interrupts to align to where the interrupts are enabled.
            I still do not understand your trace showing SVC being used in PendSV. I also do not understand the next instruction being PendSV – CM3 will not pre-empt an ISR with itself. Are you using the same function for SVCall and PendSV?

  3. Paul Kimelman says:
    September 27, 2011 at 9:59 am

    This is intentional. First of all, as Richard Barry notes above, ISRs only need to set PendSV if they do something that would need rescheduling (that is, they change a task state or a resource task is waiting on). Second of all, PendSV can be set from an ISR which has pre-empted PendSV – this is very intentional. If this happens, there is a potential race since it could happen after you finished updating the task list (and released the critical section). So, it runs again (by tail chaining). If the 2nd run finds nothing new to do, that is OK. But, it could.
    The reason you are confused is that you think the PendSV bit is lost on the late arriving high pri interrupt, but that is not the case. My guess is that this is not a late arrival case but a pre-emption. You likely pre-empted it (after it entered). The PendSV bit is not cleared until the PendSV handler is active (stacked). You can tell by looking at the active bits, by looking to see if the higher pri int will be returning to the task level, or by looking at the stack.
    Also as Richard says, the best way to do a critical section is BASEPRI (and BASEPRI_MAX) so only impact ISRs which use or affect the same critical data. So, if the top 3 pri levels do not make OS calls, then set BASE pri to the 4th level so that the top 3 can run unimpeded but the priorities below are protected from corrupting your data. You can even use different levels for different resources. That is, tasks may be the 4th pri level, but mailbox CBs may be the 5th level since the 4th does not touch mailbox CBs.
    Regards, Paul

    Reply
  4. Paul Kimelman says:
    September 27, 2011 at 10:19 am

    In response to Nick, I will say that Cortex-M3/M4 (really ARMv7-M) was designed to support OSes including pre-emptive ones. So, it provides a number of facilities to ensure better performance in many cases, faster scheduling, avoidance of critical sections and avoidance of global critical sections, etc. It also supports a protected OS using the MPU, which Richard Barry has added support for in FreeRTOS – this includes fast update of the MPU during scheduling.
    There are still many tricks up its sleeve that are not yet being used by any OS/kernel (that I know of). This includes minimally:
    - Using bit banding to support task sleep/wake with no critical sections. That is ISRs set an atomic bit and PendSV, and the scheduler can perform the task operation without then needing a critical section at all.
    - Use of LDREX/STREX (exclusives) for non-blocking and non-locking FIFOs between ISRs and tasks, such as for peripheral feeds (RX and TX).
    - Support of supervisor and user tasks. Best with MPU, but can be useful in other cases.
    - You can run user code within an ISR context. This was done for Autosar, but can be useful for security or safety.
    - Many local fault handlers to allow for better handling of problems in a layered way with a global catch (Hard fault) for panics.

    Regards, Paul

    P.S. To Miro, PendSV is not the only way to ensure a handler runs before returning to task level. You can also use the SET PEND bit on any ISR (system or interrupt) to get this effect. PendSV was added to make it really easy and standard (vs. having to do a port for each MCU variant) for the same reason SysTick was added to the core.

    Reply
  5. 42Bastian says:
    September 27, 2011 at 3:07 pm

    Hi Miro,
    on the first read it sounded you are right. But the more often I read your article and look at the trace, I get the feeling you are doing something wrong.
    One comment: Not every ISR shall call PendSV, only if there is a need for re-scheduling.

    I also do not see, where SVC comes into the play.

    In our RTOS, I do not clear then PendSV bit. I actually think it would be a false action unless one really wants to cancel a pending scheduling.
    I never got a report about this behavior you describe before. And our customers made some high-ISR-rate applications.

    Reply
  6. Miro Samek says:
    September 27, 2011 at 9:30 pm

    Thank you all for your comments.

    It’s becoming clear to me that I should have given more background information about the particular preemptive kernel that I was working on, because it is very different from all traditional (blocking) kernels.

    So, my kernel, called QK, is fully preemptive and priority based, but it can only manage single-shot tasks that run to completion and cannot block. (This class of kernels is known in the OSEK/VDX terminology as the BCC1-class kernels.) In exchange for the inability to block such a run-to-completion (RTC) kernel can be very simple and very fast, because it can use a *single stack* for all tasks. The RTC kernel works in the same way as a prioritized interrupt controller, such as the NVIC, which keeps the context of all nested interrupts on a single stack. An RTC kernel implements in software the same policy as NVIC implements in hardware.

    At this point, you might question the usefulness of such a crippled kernel, but it turns out that an RTC kernel is ideal for execution of state machines that exactly need the RTC semantics, but never need to block in the middle of RTC processing. My ESD article “Build a Super-Simple Tasker”, which I wrote together with Robert Ward, explains the inner workings of an RTC kernel and its benefits for executing state machines (www.eetimes.com/design/embedded/4025691/Build-a-Super-Simple-Tasker).

    Going back to the ARM Cortex-M, I found it tricky to implement the RTC kernel, because I couldn’t find a way to send the End-Of-Interrupt (EOI) command to the NVIC. The implementation I ended up with is to fake the PendSV exception return to get to the QK scheduler at the task level, but then I needed to return to the preempted task context, at which point I needed another exception return, for which I employed SVCall. The details of the QK port to ARM Cortex-M are described in the Application Note “QP and ARM Cortex-M with IAR” available at www.state-machine.com/arm/AN_QP_and_ARM-Cortex-M-IAR.pdf. The QK code, including the PendSV assembly code is available from SourceForge.net at sourceforge.net/projects/qpc/files/QP-nano/4.2.04/ (The QP-nano framework is simplest to experiment with).

    @Paul Kimelman: I hope that the provision of the context clarifies some of your concerns about my sanity. The BCC1-class QK kernel I employ to execute state machines lies off the beaten path of traditional kernels, so perhaps my PendSV/SVCall code doesn’t look right to you at first glance. But I hope that if you delve into it just a little deeper, you might find the ultra-simple and ultra-fast QK kernel interesting, in which case I’d love to enlist your help. My basic question is this: Is it possible to send the EOI command to NVIC in any simpler way than I am currently doing?

    Reply
    • Paul Kimelman says:
      September 27, 2011 at 11:55 pm

      I am somewhat confused by what all the EOI needs to do. Traditionally, the popup thread model (run to completion) works just like any other scheme except that all tasks must yield by return (and so are like polling loop handlers – hold all state locally and make decisions by non-blocking calls (e.g.
      if (MailBoxHasData(&data)) { process data…}
      return
      which return either way).
      So, the stack is created and discarded each time (so is the same). In Cortex-M3, this could be the process stack or the main stack, but the process stack works well for this. Normally the tasks are setup with a return link (LR) which invokes SVC. So, the task is started by “returning” from SVCall to a function like:
      void TaskStart(void (*task_start))
      {
      task_start();
      svc(OS_DONE); // use SVC to enter SVCall handler
      }

      The PendSV mechanism was designed for what you want: the last ISR to return back to thread level returns into PendSV because the pend bit is set. The PendSV routine can do what it wants and then return into the task level, including messing with the return frame (as a normal kernel does). If you want to add pre-emption on top of what I show above, then you normally would create a fake frame over the real frame in PendSV and return to that TaskStart. When it is finished, the SVCall handler can toss the finished tasks “frame” and get back to the pre-empted task. Since it created the entry frame, it knows how big it is and how to get rid of it.
      In this model, SVCall is only invoked by the task ending and is the same priority as PendSV. That was my intent anyway.
      The only thing to watch for is the stack getting too deep. I guess you can also have some concept of priority inversion but that is not solvable. By that I mean if the lower pri task (below on the stack) has claimed a mutex, the higher pri task obviously cannot “wait” on it and somehow has to delay being run next. Traditional popup schemes use a task invoke list. So, the high pri task registers with the mutex (like a pend list) and when the mutex is freed up, the next task is invoked.
      So, what did I miss about how you want to use EOI?
      Regards, Paul

      Reply
    • Paul Kimelman says:
      September 28, 2011 at 12:11 am

      As a quick side note: I did envision these popup threads (or state machine polling loop handlers). I used a scheme like that on 8051 years ago to replace ladder logic.
      But, I did not consider pre-emptive tasks. However, I do not see it as a problem if you use the PendSV/SVCall creates fake return frame to invoke and “exit” of task is SVCall re-entry. That method allows user tasks but also allows “pre-emption by nesting (stacking)” to work just fine.

      Reply
    • 42Bastian says:
      September 28, 2011 at 1:04 am

      Miro,
      I think the problem is, that you use SVC for scheduling. Actually your scheduling looks strange to me. Why don’t you call the scheduler from PendSV_Handler.
      If the scheduler will be called from some OS function it shall not call QK_schedule_() directly, instead it should make a PendSV.
      Then you have no problem returning from the scheduler.
      Cheers,
      42Bastian

      Reply