Home > On-Demand Archives > Q&A Sessions >
Live Q&A - Mars Perseverance Software
Steve Scandore - Watch Now - EOC 2021 - Duration: 58:22
That's correct, the general use of semaphores (task locks in general) in the software is not allowed. This avoids some classic misuse and unexpected task dependencies (e.g.: inversions, deadlocks) in an architecture where we want tasks to be as independently operating and deterministic as possible. Using semaphores also complicates runtime analysis and testing in a system with processing deadlines. In short, we remove or cautiously use conventions which may question the operation of the code. In many cases we have easily redesigned code to avoid the casual use of a semaphore. Having said all that, we do have cases where waivers to this rule are granted. For example, IPC waits on message are implemented using a semaphore. By not allowing them, and then using waivers in the few places where they are really required helps ensure their overall safe use and operation of the system.
As I mentioned in the talk, we have learned a lot over the years. Here's a link to a related dependency problem from our past: https://www.youtube.com/watch?v=C2xKhxROmhA
Thanks Steve. I used to work on the avionics software for the F-16 at General Dynamics in the late 80's. Back then we had no RTOS, just a homegrown "cyclic executive". We weren't even allowed to pass parameters to a function because it took too long to push and pop data from the stack so everything was in global data. There was even a complete software team just to manage the global data. I am really glad to hear that VxWork is now used. I use to work with them at various contractors when I work for Rational Software. The Ada days. :-)
Thanks for great presentation;
You mentioned "compression and data streaming" in Mars Perseverance FSW slide. Which compression formats do you use? Are they custom made or public/known protocols?
I remember NASA's articles about TTEthernet and how they use it. Did you also use (TT)Ethernet? Why?
Yunas, see related comment from DavidKnight below. We do not use time-triggered (TT) Ethernet on this mission. Unfortunately, it takes many years to get new technology (in space terms) introduced into the mission avionics baseline. I do hope it happens.
I often wondered what kind of state machine implementation was used in these mission critical SW. I was pretty happy to learn that Dr. Samek's awesome QP framework was used along side traditional RTOS. This framework deeply changed my ways of programming embedded SW.
Also, thank you so much for giving us an insight of the SW used for the mars rover missions: it provides us with a light feeling of having been part of this big adventure.
Yes, I was also happy to hear that "Samek's hierarchical state machines were used". (So they are apparently on Mars now! Awesome!). But Steve didn't actually say that the whole QP framework was used in this mission and from my understanding this was actually NOT the case. But the system was clearly event-driven, which is sufficient to apply at least the state machine part...
Steve, in the past there have been papers published from JPL and NASA about software engineering techniques employed in various subsystems (e.g. in MRO?s radios where QP was, at least initially, used and verified using SPIN/Promela).
I would be interested in learning more about how your team designed, implemented and tested the software to achieve this level of reliability. Are there any resources from which we can learn more details about this mission?
Very inspiring, thanks for the presentation Steve. The amount of redundancy and reliability needed in mission on this scale is crazy. It's also astonishing how much one can achieve on such a limited processor with a good architecture.
Perfect Presentation, thanks a lot for your time and your effort.
Can you give info about UnitTest Coverage in this magnificent project??
Thank you for the comment. We require 100% code coverage in unit testing. There are waivers to the 100% rule allowed in specific code cases where the coverage is not possible (e.g.: intentional spin-loops). We use gcov to measure, and report the coverage. It's not perfect. We can't easily measure code path coverage and rely more on test reviews to ensure the right tests exists. We can then use gcov to see what parts of the code have not been tested, then fill in those test gaps.
Hey Steve, Thank you so much for this great presentation.
At one point in the Q&A I think you mentioned having a custom version of gzip for compression. I was wondering if this is the only compression algorithm used or does the rover use other algorithms like huffman coding or rice encoding?
I'm also wondering how the downlink works, does the rover use the CCSDS space packet protocol or some custom packetization protocol? If CCSDS is used do you still use the custom gzip compression for downlink or do you use the standard rice encoding that CCSDS recommends?
My initial response was originally focused on the engineering data and science data aspects of compression. The other types we use in this area are: lzo (data), jpg (image), icer (image), loco (image). All these have their own encoding algorithm methods. Some of these originated from JPL/NASA missions.
For downlink, I would say we use a tailored, but compliant version of the CCSDS space packet protocol. The data in the space packets are compressed using gzip or one of the others mentioned above. The CCSDS transfer frames are then streamed through additional telecom specific encoders for reliability, not really bandwidth management. This can be a Reed-Solomon encoding, but we more commonly use Turbo encoding methods
Excellent presentation. Wonderful insight into how this national treasure was constructed, tested, successfully launched, landed. I learned a great deal about how embedded software development/software modeling is carried out by one of the nation's brightest! :Thank you so much for your time!!! 73, Dave Comer, NM5DC
So cool system! Thanks for the presentation. But, by the way, why did you choose PowerPC750? Is this the best RAD processor on the USA space market now?
The Rad750 is a space qualified radiation hardened processor from the early 2000s. The qualification process is long and expensive. It was the best choice for this mission (Perseverance) given the reuse directive, and implementation timeline. There are newer versions and options which were not fully qualified in the early 2010 time frame for this mission.
Hi Steve, Thank you great presentation!
I'm embedded system engineer, and my career was started from simple cubesat. So I was glad to hear the Perseverance architecture. Perseverance is one of the largest embedded system. Probably you may be able to talk about more than one hour for each topic (e.g. customized processor, cruise software, fail-safe for radiation tolerance, tempature,...), but I couldn't hear even a part of them. Today I could hear. Thank you Steve and EOC2021 team for giving me this opportunity.
Steve, great presentation. The work that you are guys doing is simply awe-inspiring to human civilization.
Thank you
Thank you, Steve - this was excellent and very informative. There was also plenty to think about, seeing the complexity of the system and even how the flight software had 1.2 million lines of flight code and over 50 percent more (1.9 million) lines of unit test code.
12:40:54 From Dave Nadler : Could you tell us a bit about how you tested the software? 12:41:10 From Keith J : Thanks for that Steve. That was awesome to see how advanced things have become from a compute standpoint... I'm old enough to remember Apollo - although as a young kid. 12:41:18 From Tom.Davies : Awesome presentation 12:41:28 From Matjaž Finc : There is no room for error on such missions. How do you cope with the stress of "what if my code goes wrong" while developing and also during the mission? Which mission stage makes you the most nervous? 12:42:27 From Jeremy Schreiber : Awesome talk! How large is the development team? What type of development process (agile, waterfall, etc) do you follow to pull off a project of this size and complexity? 12:42:30 From Raul Pando : How much do you rely on Over The Air (Space) updates :)? 12:42:31 From Alex Burka : Can you comment any more on what went wrong with the first Ingenuity flights and what was the fix that "works 85% of the time"? 12:42:43 From Radu Pralea : C only? C++? All 100 tasks handled by a single core running at 200 MHz (<1% of computing power of a Raspberry Pi)? 12:42:47 From Matjaž Finc : Which QP kernel did you use? QKX? 12:42:48 From David : Was all imaging and other high data components passed over RS422 or 1553 or were there additional highers speed buses? 12:43:18 From Radu Pralea : *10% 12:43:18 From afwaanquadri : What framework did you use for state machines ? 12:44:13 From Jonnyvb : Following on from the stress of "what if my code goes wrong" question from Matjaz - what sort of processes do you go through when something does go wrong to stop that kind of issues happening again and to learn the lessons from it? 12:44:18 From David Potter : Can you talk about your top down architecture and associated documentation process? 12:44:47 From Dave Comer : my apologies in advance for the naïve question. I worked on the Galileo mission back in the 1980's How, or did, that mission help the current efforts on Mars? 12:46:12 From Radu Pralea : Do you use TDD? :) 12:46:36 From ken H : can you tell us how many person-hours went into software development? What percent was test/validation? 12:48:23 From Miro Samek : Very interesting that you mention the following practices used by NASA: event-driven architecture, threads structured as event-loops, blocking in one place only, NO blocking during message processing. These best practices are collectively known as the "Active Object" design pattern. Do you use this name ("Active Object") to quickly reference to your architecture? 12:48:23 From Davy Baker : If you could start over, what would you do differently ? 12:49:35 From Alex : What was the biggest enabler (e.g. test bed/automated builds) for the firmware development? 12:49:51 From David Kanceruk : I imagine you use a build server. How long does it take to compile the code? 12:50:48 From afwaanquadri : Did you have any User-Interface to test specific modules of the software? 12:51:35 From Meenal Burrows : How big is the flight software team? 12:53:55 From Simon Voigt Nesbo : Was there anything that didn't work? That we wouldn't know just watching the news 12:54:06 From Gopinath : What caught my eye is how low the frequencies are in the system - processor running at 135 kHz, buses at 8 Hz, 64 Hz, etc. Is there a reason for this? EMI? 12:55:53 From Miro Samek : For anyone interested in the NASA software architecture used originally on the Pathfinder, which apparently is still very much influencing the current missions, there is a paper: "Managing Concurrency in Complex Embedded Systems" by Dr. David Cummings (you can google for it). 12:56:33 From Simon Voigt Nesbo : The slides said 132 MHz for the CPU, not 135 kHz 12:59:30 From Tim Michals : Are the checklists and design methodology open source or available? 13:00:24 From David Potter : What code analysis tools? 13:01:30 From Gopinath : Correct, my bad. But even 132 MHz is low. 13:01:52 From Leopy : Is everything human-coded, or some functions on the boad computer are dealt with ML/AI? 13:05:00 From jvillasante : Fantastic! Are you hiring? :) 13:06:15 From David Potter : Are your software design rules available to the public? 13:09:56 From Dave Comer : Is there a SysML talk or material that the public can access? 13:09:56 From Michael Kirkhart : https://yurichev.com/mirrors/C/JPL_Coding_Standard_C.pdf 13:15:34 From Tom.Davies : What tools do you use to autogenerate the code? 13:16:10 From Radu Pralea : How do you deal with real-time stuff in the sw simulation environment (I guess the models of the "peripherals" could be slower than the actual hardware) so how do you test the actual software (which I assume would depend on real timings on the real system), in a fully simulated environment, Do you have some timing abstraction layer taking care of this? 13:16:25 From Tom.Davies : Now that Perseverance is on the surface, how long will you remain on the project before moving onto the next project? 13:17:26 From Kurtovic, Tarik (1.59) : On-target unit testing is mandatory in some industries. Do you (need to) do on-target unit testing? 13:18:10 From Dave Nadler : What caused the resets during transit to Mars? 13:19:21 From Dave Comer : Alpha particle....This was a key concern in testing SRAMs, EPROMs, FLASH, EEPEOM.... 13:23:54 From Jay : Can you talk about telemtry and logging? What do you log, how large memory footprint, how do you encode the logs? How do you ensure that you can use these to diagnose unexpected events? 13:30:24 From Andrei : Might be a silly question (and I may have missed it).. Are the coms being encrypted with publicly available algorithms? 13:34:30 From Dave Nadler : I need a desktop pyro simulation... 13:34:39 From Meenal Burrows : :-) 13:35:09 From afwaanquadri : This was great! Thanks for your presentation! 13:35:10 From Keith J : Thank you very much Steve! Fascinating stuff. Very much appreciated you taking the time. 13:35:37 From Dave Nadler : Thanks Steve - Awesome work and awesome presentation! 13:36:43 From Simon Voigt Nesbo : Yeah thanks for the great presentation. And thanks for answering my question :) 13:36:54 From Stefan Petersen : Thanks Steve! Great hearing about your work and set ups, great presentation and QnA. 13:37:07 From Meenal Burrows : Brilliant keynote and awesome Q&A. Thanks Steve for your time with us all. 13:37:52 From Rob Meades : That was excellent, many, many, thanks. 13:38:03 From Jay : Great presentation and discussion! 13:38:04 From Gopinath : Excellent presentation, Steve. Thank you very much. 13:38:06 From mdohring : Wonderful presentation! Thank you very much! 13:38:08 From Yuriy Kozhynov : Thanks a lot!!! 13:38:10 From Juan : thank you! 13:38:11 From Erwin : Awesome Job you do and a great talk! Fantastic to get this insights on how you work! 13:38:11 From Dan Rittersdorf : Great talk and QA, Steve! Thank you so much. 13:38:17 From Tom.Davies : Thank you Steve 13:38:23 From David Pastl : Great presentation, thank you Steve! 13:38:25 From Jose E. : Thanks for the presentation! 13:38:26 From Doug Peters : Great talk!! 13:38:29 From Eric : Thank you for a great presentation! 13:38:44 From James G : Awesome! Extremely interesting and enlightening, Steve. Thank you! 13:38:45 From Andrey Shevelov : Great presentation! Thanks a lot! 13:38:46 From PeteMehn : Well done. Thanks for sharing! 13:38:54 From Michael Kafarowski : Thank you! 13:39:00 From Sam : thank you. 13:39:06 From Leandro Pérez : Thanks 13:39:27 From Christopher Long : Thank you
Fabulous presentation and Q&A Steve. You mentioned that wrt to coding you do not allow any developers to use semaphores. Just curious why that is the case. I have a hunch :-)