The whole truth about RTOS. Article # 31. Diagnostics and error checking RTOS

Error handling is not the most common thing for operating systems designed for embedded systems applications. This is an unavoidable result of limited resources, since all embedded systems have certain restrictions. And only a small number of such systems have the ability to behave like desktop systems, that is, offer the user the ability to choose actions in the event of exceptional events.

In Nucleus SE, in general, there are three types of error checking:

means for verifying the health of the selected configuration to ensure that the selected parameters do not lead to errors;
optionally included code for checking runtime behavior;
Certain API functions that contribute to the development of more robust code.

All this will be discussed in this article, along with some ideas regarding diagnostics by the user.

Previous articles in the series:

Article # 30. Nucleus SE Initialization and Startup Procedures

Article # 29. Interruptions in Nucleus SE

Article # 28. Software timers

Article # 27. System time

Article # 26. Channels: ancillary services and data structures

Article # 25. Data Channels: Introduction and Basic Services

Article # 24. Queues: ancillary services and data structures

Article # 23. Queues: introduction and basic services

Article # 22. Mailboxes: Ancillary Services and Data Structures

Article # 21. Mailboxes: Introduction and Basic Services

Article # 20. Semaphores: Ancillary Services and Data Structures

Article # 19. Semaphores: introduction and basic services

Article # 18. Event Flag Groups: Helper Services and Data Structures

Article # 17. Event Flag Groups: Introduction and Basic Services

Article # 16. Signals

Article # 15. Memory Partitions: Services and Data Structures

Article # 14. Sections of memory: introduction and basic services

Article # 13. Task Data Structures and Unsupported API Calls

Article # 12. Task Services

Article # 11. Tasks: configuration and introduction to the API

Article # 10. Scheduler: advanced features and context preservation

Article # 9. Scheduler: implementation

Article # 8. Nucleus SE: Internal Design and Deployment

Article # 7. Nucleus SE: Introduction

Article # 6. Other RTOS services

Article # 5. Task interaction and synchronization

Article # 4. Tasks, context switching, and interrupts

Article # 3. Tasks and Planning

Article # 2. RTOS: Structure and Real-Time

Article # 1. RTOS: introduction.

Verify Settings

Nucleus SE was designed with a focus on high user configurability, which should ensure the best use of available resources. Such configurability is a complex task, since the number of possible parameters and the interdependencies between them is huge. As mentioned in many previous articles, most of the user steps to configure Nucleus SE are performed using the #define directives in the nuse_config.h file.

To help identify configuration errors, the file nuse_config.c through #include includes the file nuse_config_check.h , which performs integrity checks on the #define directives. The following is a snippet of this file:

/*** Tasks and task control ***/ #if NUSE_TASK_NUMBER < 1 || NUSE_TASK_NUMBER > 16 #error NUSE: invalid number of tasks - must be 1-16 #endif #if NUSE_TASK_RELINQUISH && (NUSE_SCHEDULER_TYPE == NUSE_PRIORITY_SCHEDULER) #error NUSE: NUSE_Task_Relinquish() selected - not valid with priority scheduler #endif #if NUSE_TASK_RESUME && !NUSE_SUSPEND_ENABLE #error NUSE: NUSE_Task_Resume() selected - task suspend not enabled #endif #if NUSE_TASK_SUSPEND && !NUSE_SUSPEND_ENABLE #error NUSE: NUSE_Task_Suspend() selected - task suspend not enabled #endif #if NUSE_INITIAL_TASK_STATE_SUPPORT && !NUSE_SUSPEND_ENABLE #error NUSE: Initial task state enabled - task suspend not enabled #endif /*** Partition pools ***/ #if NUSE_PARTITION_POOL_NUMBER > 16 #error NUSE: invalid number of partition pools - must be 0-16 #endif #if NUSE_PARTITION_POOL_NUMBER == 0 #if NUSE_PARTITION_ALLOCATE #error NUSE: NUSE_Partition_Allocate() enabled – no partition pools configured #endif #if NUSE_PARTITION_DEALLOCATE #error NUSE: NUSE_Partition_Deallocate() enabled – no partition pools configured #endif #if NUSE_PARTITION_POOL_INFORMATION #error NUSE: NUSE_Partition_Pool_Information() enabled – no partition pools configured #endif #endif

The above code performs the following checks:

check that at least one, but no more than sixteen tasks are configured;
Confirmation that the selected API functions are compatible with the selected scheduler and other specified parameters;
verification that no more than sixteen instances of other kernel objects were created;
Confirmation that no API functions related to undeclared objects have been selected.
ensuring that the API functions for signals and system time are not used when the support for these services is deactivated;
verification of the selected type of scheduler and related parameters.

In all cases, the detection of errors leads to the execution of the #error directive at compilation. This usually causes the compilation to stop and display the corresponding message.

This file does not guarantee the impossibility of creating an incorrect configuration / configuration, but makes it very unlikely.

Checking API Settings

Like Nucleus RTOS, Nucleus SE has the ability to include code to test the parameters of invoking API functions at runtime. Usually this is used only during initial debugging and testing, as in the final program code excessive memory consumption is undesirable.

Parameter checking is activated by setting the NUSE_API_PARAMETER_CHECKING parameter in the nuse_config.h file to TRUE . This leads to compilation of the required additional code. The following is an example of checking the parameters of an API function:

 STATUS NUSE_Mailbox_Send(NUSE_MAILBOX mailbox, ADDR *message, U8 suspend) { STATUS return_value; #if NUSE_API_PARAMETER_CHECKING if (mailbox >= NUSE_MAILBOX_NUMBER) { return NUSE_INVALID_MAILBOX; } if (message == NULL) { return NUSE_INVALID_POINTER; } #if NUSE_BLOCKING_ENABLE if ((suspend != NUSE_NO_SUSPEND) && (suspend != NUSE_SUSPEND)) { return NUSE_INVALID_SUSPEND; } #else if (suspend != NUSE_NO_SUSPEND) { return NUSE_INVALID_SUSPEND; } #endif #endif

Such a check of the parameters can lead to the fact that the API call will output an error code. It is a negative value of the form NUSE_INVALID_xxx (for example, NUSE_INVALID_POINTER ) - a complete set of definitions is contained in the file nuse_codes.h .

To process error values, an additional application code (possibly created using conditional compilation) can be added, however, to detect them, it is better to use the data monitoring tools of modern firmware debuggers.

Checking the parameters leads to additional memory consumption (additional code) and affects the performance of the code, therefore, its use will affect the entire system. Since the entire source code of Nucleus SE is available to the developer, verification and debugging can be performed manually on the final application code if absolute accuracy is required.

Checking the task stack

Until the Run to Completion Scheduler is used, Nucleus SE provides the ability to check the task stack, which is similar to a similar function in Nucleus RTOS and shows the remaining space on the stack. This API utility call ( NUSE_Task_Check_Stack () ) was described in detail in a previous article (# 12). Some ideas for checking for stack errors are provided later in this article in the Custom Diagnostics section.

Version Information

Nucleus RTOS and Nucleus SE have an API function that simply returns kernel version / build information.

Nucleus RTOS API Utility Call

Service call prototype:

CHAR * NU_Release_Information (VOID);

Parameters: none.

Return value:

Pointer to a string containing version information ending with a null byte.

Nucleus SE API Call

This API call supports the core functionality of the Nucleus RTOS API.

Service call prototype:

char * NUSE_Release_Information (void);

Parameters: none.

Return value:

Pointer to a string containing version information ending with a null byte.

Making a call to get Nucleus SE assembly information

Implementing this API call is pretty simple. A pointer is returned to the constant line NUSE_Release_Info , which is declared and initialized in the file nuse_globals.c .

This line looks like Nucleus SE - Xyymmmdd , where:

X - build status: A = alpha; B = beta; R = release

yy - year of release

mm - release month

dd - release day

Nucleus RTOS Compatible

Nucleus RTOS contains optional history magazine support. The kernel records the details of various system actions. There are API functions that allow programs to:

activate / deactivate logging;
create a journal entry;
Receive a journal entry.

This feature is not supported in Nucleus SE.

Nucleus RTOS also has several error management macros that allow you to perform no error confirmations (ASSERT) and provide the ability to call user-defined critical error functions. They are optionally included in the OS assembly. Nucleus SE does not support this functionality.

User Diagnostics

So far in this article, we have looked at the diagnostic and error checking tools provided by Nucleus SE itself. Now it’s worth telling how the user-defined or application-specific diagnostic tools can be implemented using the tools provided by the kernel and / or our knowledge about its internal structure and implementation

Application-specific diagnostics

In almost every application, you can add additional code to verify its integrity at runtime. The multitasking core makes it easy and simple to create a special task for this job. Obviously, in this article we will not consider too unusual cases of diagnosis, but consider some general ideas.

Memory checks

Obviously, proper memory operation is critical to the integrity of any processor system. It is no less obvious that a critical error will not allow you to run, not just diagnostics, but the entire software product as a whole ( translator's note: By the way, this is exactly the case we examined in the article “Fake Blue Pill” ). However, there are situations when a certain error appears, which is a serious cause for concern, but does not interfere with code execution. Memory testing is a rather complicated topic that is beyond the scope of this article, so I will give only some general ideas.

The two most common errors that occur in RAM are:

“Sticky bits” when the bit has a value of 0 or 1, which cannot be changed;
“Crosstalk” when adjacent bits cause interference with each other.

Both errors can be checked by writing and reading certain test patterns to each area of RAM in turn. Some checks can be performed only at startup, even before the stack is formed ( translator's note: in the article mentioned above it turned out that it was the first use of the stack that destroyed everything at once ). For example, a “running unit” check, in which each bit of memory is assigned a value of one, and all other bits are checked to make sure they are zero. Other bitwise testing patterns can be performed during operation, provided that while the RAM area is damaged, context switching will not occur. Using the Nucleus SE critical section restriction macros NUSE_CS_Enter () and NUSE_CS_Exit () is quite simple and scalable.

Various types of ROMs are also prone to periodic errors, but there are not many tools for checking them. A checksum calculated after the code has been assembled could be useful here. This check can be performed at boot time, and possibly at runtime.

An error in the memory addressing logic can affect both ROM and RAM. You can develop a special check for this error, but it will most likely be detected as part of the checks described above.

Checking Peripherals

In addition to the CPU, peripheral circuits can also be error prone. Of course, this varies greatly from system to system, but most devices have some method of checking their integrity with the help of diagnostic software. For example, a communication channel may have a loopback verification mode in which any data coming into the channel is immediately returned.

Watchdog Service

Embedded developers often use a watchdog. This is a peripheral device that either interrupts the CPU and awaits a response, or (more preferably) requires periodic software access. In both cases, a common result of a watchdog timer is a system reset.

The effective use of a watchdog in a multitasking environment is a complex problem. If you make a task that periodically accesses it (watchdog timer), it will confirm that this particular task is working. A possible solution could be the implementation of the “dispatcher task”. An example of such a task will be given below.

Stack overflow check

If you do not use the Run to Completion Scheduler, a stack will be created for each task in the Nucleus SE application. The integrity of these stacks is very important, but the amount of RAM is likely to be limited, therefore, it is important to make the application size optimal. Statically predicting the stack requirements of each task is possible, but very difficult. The stack should be large enough for even the most nested functions, along with the most demanding interrupt handler. A simpler approach to solving this problem would be to use exhaustive runtime testing.

Generally speaking, there are two approaches to stack verification. If you use a sophisticated embedded software debugger, the boundaries of the stack can be monitored, and all violations will be detected. The location and size of the Nucleus SE stacks is available in the global ROM data structures: NUSE_Task_Stack_Base [] and NUSE_Task_Stack_Size [] .

An alternative is runtime testing. A common approach is to use “guard words” at the end of each stack, usually the first element of each area of the stack data. These words are initialized with a recognized non-zero value. The service / diagnostic task then checks to see if these words have changed, and performs the appropriate actions. Mashing the security word does not mean that the stack is full, but indicates that this is about to happen. Therefore, the software may continue to work, you may need to take corrective actions or report an error to the user.

Supervisor Task

Despite the fact that Nucleus SE does not reserve any of the sixteen possible tasks for its own needs, the user can select one task for diagnostics. This can be a low-priority task that simply uses any “free” processor time, or it can be a high-priority task that runs periodically, taking a short period of time, which ensures that diagnostics are performed on a regular basis.

The following is an example of how a similar task can work.

The signal flags of the dispatcher task are used to track the operation of six critical system tasks. Each of these tasks uses a specific flag (from bit 0 to bit 5) and should set it regularly. The dispatch task resets all flags, and then pauses its work for a certain period of time. When she resumes work, she expects that all six tasks are "checked" by setting the appropriate flag, then she looks for an exact match with the value of b00111111 (from the file nuse_binary.h ). If everything meets the requirements, it resets the flags and pauses again. If not, it calls the critical error handling routine, which in turn can, for example, reboot the system.

In an alternative implementation, groups of event flags could be used. This makes sense if the signals are not used elsewhere in the application (otherwise this will lead to excessive use of RAM by all tasks) and especially if the event flags are used for other purposes.

Tracing and profiling

Despite the fact that many modern embedded debuggers have a high degree of customization and can be used to work with RTOS, debugging a multi-threaded application can still be difficult. A widely used approach is post-execution profiling, in which the code (RTOS) is implemented so that a detailed audit of its work can be analyzed in retrospect. Typically, an implementation of such a service includes two components:

An additional code is added to the RTOS to log actions. Usually it will be wrapped in preprocessor directives to use conditional compilation. This code records several bytes of information when an important event occurs (for example, calling an API function or switching context). Such information may include:
- current address (PC);
- ID of the current task (index);
- indices of other used objects;
- code corresponding to the operation performed.
The task allocated for unloading the profile information buffer to external storage, usually to the host computer.

Analysis of the data thus obtained will also require some work, but it is no more complicated than using a regular Excel spreadsheet.

In the next article, we will examine in detail the compatibility of Nucleus SE and Nucleus RTOS.

About the author: Colin Walls has been working in the electronics industry for more than thirty years, devoting most of his time to firmware. He is now a firmware engineer at Mentor Embedded (a division of Mentor Graphics). Colin Walls often speaks at conferences and seminars, the author of numerous technical articles and two books on firmware. Lives in the UK. Colin's Professional Blog , e-mail: colin_walls@mentor.com.

All Articles