Home > On-Demand Archives > Talks >
Super-Simple Tasker - Hardware RTOS for ARM Cortex-M
Miro Samek - Watch Now - EOC 2023 - Duration: 40:31
Super-Simple Tasker (SST) is a preemptive, priority-based RTOS kernel fully compatible with Rate Monotonic Analysis/Scheduling (RMA/RMS). It was initially published in 2006.
This session will present a unique hardware implementation of a modern version of SST for ARM Cortex-M (M0, M0+, M3, M4, M7, etc.).
The session will begin with introducing preemptive but non-blocking multitasking with so-called basic tasks (as defined in the OSEK/VDX/AUTOSAR operating system). Such tasks are ideal for executing event-driven state machines (a.k.a. active objects), which are also non-blocking and handle events in a run-to-completion fashion.
Next, you will see how these basic tasks can be mapped to ARM Cortex-M interrupts and how to program the NVIC interrupt controller to manage both interrupts and basic tasks.
The bulk of the session will demonstrate various preemption scenarios, including inter-task communication and the compatibility of SST with RMA/RMS. The demos will utilize a logic analyzer and will run on a low-end ARM Cortex-M0 and on a high-end ARM Cortex-M7 with FPU.
If you are interested in real-time performance, you should see SST in action. SST is available on GitHub under the permissive MIT open-source license. SST is likely the most efficient and performant RMS-compatible RTOS kernel for ARM Cortex-M.
I had a bit of a fascination with schedulers a while back since it was covered so heavily in my grad studies and because they're so easy to get wrong! For those interested in reading more, Phil Koopman has a whole chapter on schedulers in his book "Better Embedded Systems Software" and some of that material you can find on his teaching website. Miro's book is also freely available.
If the idea of a small, simple scheduler is of interest to you, here are a few that I've found on the internet. Some look like an RTOS, but are written in very few lines of C (like SST), which gives them smaller overhead and also helps makes them easier to understand.
- RIOS (article | code)
- HeliOS
- cocoOS (website | code)
- Punctual
- ULWOS2 (article | code)
- Protothreads
- Coroutines in C
Thank you, Nathan, for the compilation of the various schedulers. From that list, I'm only familiar with Adam Dunkels' Protothreads. I've also written how Protothreads compare to state machines. (Readers might enjoy a heated debate at the end of that blog post.)
But if we're making lists, here is a list of approaches related more specifically to the discussed SST kernel (although not necessarily incorporated into SST in any way):
Yes, thanks for posting those! I saw that list at the end of your presentation and I'm very excited to check them out. (My apologies if it seemed like I was trying to hijack your thread with my last post!)
Hi, Thanks for the lucid explanation. We have been using a very similar implementation albeit on a different family of micro-controllers. One of the qualms that some of us have had with this implementation is that the number of priorities possible depends on the number of interrupt priorities supported by the hardware because of the way the stack is used. While I agree the context switch is significantly faster since the OS need not swap out the stack, don't you think that implementations such as freeRTOS are more flexible with regard to usage of interrupt priorities?
Hi Aniket,
There is a lot to unpack here, so let me peel off the layers of this question one by one.
Let me start with the number of available priority levels (which is often confused with the number of available tasks). The main technical reason why virtually all RTOS kernels (such as FreeRTOS and also SST) support preemptive, priority-based scheduling is the RMA/RMS method. However, if you really apply this method, you will quickly notice that it becomes hopelessly complex for priorities beyond 2 or 3 levels starting from the top. In practice, people don't apply RMA/RMS (do you?). Instead, task priorities are usually assigned by "intuition" and then endlessly tweaked to compensate for various "glitches" (which all can be traced to some blocking calls wrecking the responsiveness of the system). With this type of approach, it is really hard to say how many task priorities you really need. So going back to the presented hardware implementation of SST, the NVIC limit on interrupt/task priorities might be perceived as low. But if you choose Cortex-M silicon with 4 NVIC priority bits (16 priority levels), such as STM32 M3 or higher, you should have really more priority levels than you can possibly know what to do with.
The second issue is the number of possible tasks, which goes to the fundamental reason why an RTOS kernel is used in the first place. And that reason is that RTOS allows you to split your application into quasi-independent tasks (so an RTOS is foremost the "divide and conquer" strategy). To that end, the frequent cause for the proliferation of tasks is blocking because blocked tasks waiting for one kind of event are unresponsive to other events, so you split them into more tasks that can wait for those other events in parallel. This of course leads to poor cohesion and sharing of resources, which leads to mutual exclusion and more blocking. For all these reasons, experts advise avoiding blocking, even if your RTOS can do this. (The NASA JPL developers, from the paper I quoted in the presentation, were certainly aware that the VxWorks RTOS can block anywhere they wanted. Still, they specifically chose not to.) My point here is that non-blocking tasks remain extensible for new functionality (new events), so you need fewer tasks. Such tasks give you freedom to split your functionality the way it really fits the problem with proper cohesion, rather than compensate for unresponsiveness inflicted by blocking.
In the end, the dominating preconception is that for real work you need a "real" RTOS kernel, like FreeRTOS. I disagree. But I understand the immense inertia of the embedded community and therefore I devoted 1/4 of the presentation to explaining how you can use FreeRTOS in an event-driven way. If this is the only takeaway from this talk, I'd consider my main mission as accomplished. For anyone interested in a more complete event-driven framework built on top of FreeRTOS, I recommend the "FreeAct" project available on GitHub.
Hi Miro, great presentation!
On the topic of blocking, is it possible to avoid all blocking calls in a more complex system, for example one that has a filesystem, networking stack and shell? I am a big fan of event driven development, and I agree that blocking calls are not ideal and should be avoided. But I am struggling to see how it would be possible to completely remove them from a large IoT platform for instance.
Can a substantial, real-life system be implemented without blocking? Yes, I think so! One example is the NASA JPL program of Martian rovers (starting with Pathfinder in 1995), which was described in the paper discussed in the presentation. The real-time embedded software in the latest Perseverance Rover certainly qualifies as a complex system.
The biggest problem is not the technical impossibility, but rather the traditional sequential approach taken in middleware (such as communication software, file systems, etc.) Such 3rd-party software often blocks internally and thus requires blocking primitives and a traditional blocking RTOS to run. Also, there is often sharing of resources among concurrent tasks, which requires mutual-exclusion mechanisms (also typically blocking, like mutexes). Interestingly, for reasons of performance, some such 3rd-party software (e.g., LwIP TCP/IP stack) is internally implemented in an event-driven way, then blocking is added only in the end for "convenience" (e.g., to provide BSD-style blocking socket interface). But event-driven, non-blocking middleware becomes increasingly popular and available. A good example is the Mongoose Embedded TCP/IP Web Server from Cesanta.
Regarding embedded file systems, in particular, the situation is complex because these often use microsecond-level delays (e.g., for erasing segments). Such delays are too short for blocking, so they are implemented as busy polling. Obviously, this causes high CPU utilization, and because of that the tasks running file systems must be prioritized low. (Otherwise, they will prevent other tasks from running for a long time.) This last comment applies to any execution environment, blocking or not.
Thank you, I appreciate the detailed response!
This is definitely the best or one of the best presentations at this conference. I've been aware of Miro's scheduler work for a couple years and have occasionally used a (simpler?) C++ event loop from embxx. This talk was a great introduction and I plan to watch some of his YouTube videos to supplement.
Thank you, Paul!
Please also visit the SST repo on GitHub, and try it yourself. The provided examples are intentionally for very inexpensive eval-boards. I also used a cheapo logic analyzer to make testing SST easy.
For a deeper background about event-driven, preemptive but non-blocking kernels, I'd recommend reading the original SST paper. An additional bonus is a comment section attached at the end, which clarifies many misconceptions.
Finally, I hope to see you at the live discussion on Zoom!
--Miro
Miro,
Thanks for the talk. Using the unused NVIC vectors to dispatch tasks is very creative.
I have been using a simple prioritized event driven non-preemptive dispatcher since the 1990s on numerous projects. With my implementation there is only one event queue, and events are pulled by their priority before being sent to their registered event handler (task). This structure allows a task, which is often a state machine, to handle events based on priority instead of FIFO. Could you comment on what advantages you see in associating priorities with the tasks instead of the events?
Thanks, Chip Weller
Hi Chip,
Thank you for the questions. Your non-preemptive dispatcher reminds me somewhat of the non-preemptive "SST0" that I mentioned in my talk (see https://github.com/QuantumLeaps/Super-Simple-Tasker#non-preemptive-sst0 ) . But in contrast to your dispatcher, SST0 uses multiple event queues (priority queues).
So, regarding the question about the single, but prioritized event queue, I have the following problems with this approach: First, sorting the queue at runtime as events of various priorities are posted (or pulled out) is probably non-trivial (expensive) and maybe not quite deterministic (likely depends on the number of events in the queue.)
But even more importantly "pulling events by their priority" potentially changes the order of events for the same task (e.g., when a low-priority event is followed by a high-priority event and the low-priority event was not consumed yet). This is very confusing at best and leads to errors at worst. For example, if the low-priority event triggers a state transition, that new state might react completely differently to the following high-priority event than the previous state. I hope you see my point.
And another argument against assigning priorities to events (as opposed to tasks) is that it can lead to priority inversions. This probably applies only to preemptive kernels, but still, let me quickly elaborate. Suppose for example that a task is handling a low-priority event and a high-priority event arrives in the common queue. But due to the RTC (run-to-completion) semantics, the high-priority event must wait until the processing of the previous event completes. But that processing runs at a low priority (the priority of the original event). That means that other medium-priority events destined for other tasks can preempt the low-priority one, while an urgent, high-priority event is waiting in the queue. This would be the classic unbound priority inversion, as it can go on forever (as long as medium-priority events are in the queue).
For these reasons, SST (and other kernels of this type, like OSEK/VDX) assign priorities to tasks but use multiple (priority) queues. This seems to give you the best of both worlds. The work can be prioritized (through priority queues) and hard real-time deadlines can be met without the risk of priority inversion. But at the same time, the order of events for the same task is preserved (FIFO queueing). There is also no overhead or non-determinism of sorting events by priority in a common queue (either when they are posted or when they are pulled out). Of course, there is a cost of having multiple event queues, but this is relatively low.
Miro,
Thanks for the reply. You bring up some good points. I have not run into issue with handling the events out of order, but this largely may be because I know they can be receive out of order. Often a state machine needs to handle race conditions between events, so my approach just makes them more likely.
As far as the priority inversion, yes it can happen but I could argue the other way. Typically the timing requirements can be viewed as a chain of events, so the highest priority deadline doesn't need to take into account how many events could be queued before the most time critical one. Doing a dead-line monotonic analysis becomes fairly simple with this approach. (As far as RTC in a few cases I have had to break low priority event handlers up, firing a new event and terminating in order to allow higher priority events to meet their deadlines.)
I hadn't noticed that a single SST task could handle multiple prioritized queues. That certainly would address some of my concerns.
As far as the cost of pulling events in priority order, the cost is bounded and I am typically surprised by few events are in the queue.
Thanks, Chip
Hi Chip,
Just to clarify, SST has only one event queue per task. I called all these private, per-task queues collectively "priority queues" because they behave according to the standard priority queue algorithm (every SST queue has the priority of the associated task).
But I'd like to step back and reflect on the bigger picture here, which is the non-blocking property of the event handlers. This alone makes all such kernels possible (like your dispatcher, SST, or SST0). This is also what I meant in the presentation by remarking that event-driven code (implying non-blocking code) is more flexible. Yet, a simple kernel like that can provide all benefits of a traditional (blocking) RTOS kernel, such as the ability to decompose of an application into tasks and efficient, real-time execution of these tasks. Due to the non-blocking property (by which I mean also non-polling), the one-shot event handers run typically very fast (microsecond level), so meeting hard real-time deadlines is much easier.
Outstanding presentation! Thank you for sharing it. It has certainly caused me to consider ways that I may be able to apply this paradigm in my next project design.