Dell Reliable Memory Technology PRO: Detect and Isolate Memory Errors

Regardless of the manufacturer or type of RAM, almost all computer memory contains certain microdefects. The manufacturer of the memory can spend from 10 to 15% of the cost of the DIMM memory module for extensive testing for errors, but the memory can still be subject to failures and failures during operation of the system. A variety of factors - from overheating to "aging" and the presence of microdefects in it - can lead to memory errors.







In fact, the frequency of dynamic random access memory (DRAM) errors is orders of magnitude higher than reported. In a recent large-scale study of DRAM memory errors in the field, based on data collected for more than two years, about a third of all machines and more than 8% of DIMMs recorded at least one correctable error per year ( DRAM errors in the wild: a large scale field study ). On some platforms, almost 50% of the systems had correctable errors (IBID report), and on average only about 1.3% of the systems were subject to irreparable errors, and for some platforms this figure was 2-4%.



In standard office PCs, memory errors rarely adversely affect the performance of standard application software. However, in high-end systems with intensive computing in the world of finance, research in oil and gas production, in medical imaging tasks, media production (rendering and editing) and other data integrity is an essential component of the overall system architecture. In such high-performance systems, memory replacement is one of the first places to repair due to failed components, while memory errors are one of the most common problems with equipment that can lead to system failures (IBID report).







Thus, the ability to detect DIMM errors, report them and prevent failures in high-performance workstations becomes a necessity.



Given the high demand for extreme RAM performance, Dell has patented an innovative, exclusive technology used in Dell Precision workstations that helps mark and remove unusable memory. This unique Dell feature helps reduce system downtime, simplify IT support services, and reduce overall maintenance costs, increasing memory longevity and increasing user productivity.



Consider the basic concepts of Dell Reliable Memory Technology PRO (RMT PRO) reliable memory technology, some of the main causes of memory errors, and how RMT PRO helps to eliminate these errors.



RAM



Together with new advances in processor technology, increasing bus speeds and improvements in the overall architecture, computer systems are becoming more complex, and RAM also has to keep up with these changes.







Essentially (very simplistic), DRAM chips are an array of on / off elements that retain this state (1 or 0) in the presence of power. When the power is turned off, they return to the zero state. Several chips are assembled together in a memory subsystem and placed on a printed circuit board - a dual in-line memory module (DIMM).



Most workstations, such as Dell Precision, use a DIMM type known as DDR4 SDRAM, a random-access synchronous dynamic storage device. Essentially, compared to earlier versions of memory types (for example, DDR3), DDR4 runs faster, has greater bandwidth and higher memory density, and requires a lower supply voltage.



Memory errors



Memory errors can be caused by a large number of factors, with the result that one DRAM bit automatically switches to the opposite state (for example, from 1 to 0, when the memory must remain in 1 during this cycle). Errors can be affected by factors such as overheating, memory age, defects, etc. Studies have shown that in the first 10 months of DIMM operation, the error rate increases dramatically.



These types of errors are called correctable errors: they randomly damage bits, but leave no physical damage and can be fixed by updating the memory state.



However, in many cases, uncorrectable errors occur. This is a repeated bit error due to a physical defect or another DIMM anomaly, or when two errors occur within a single memory block. An unrecoverable memory error can lead to a system crash (a reboot is required) or an application (Stop Error code at the system level, core dump or blue screen of death - BSoD). Frequently correctable errors warn about approaching uncorrectable errors. In studies, about 65-80% uncorrectable errors in the same month were preceded by a correctable error.



Error processing



Today, many workstation-class PCs include memory parity algorithms that, quite simply, ensure that each time a data byte is read, the data sent is the same as the data received.







More complex systems use other methods of error correction and detection. The most common option is memory with error correction (error-correcting code, ECC). It is used in servers and workstations, such as Dell Precision workstations. In fact, the ECC memory includes extra bits and an integrated memory controller that checks the parity of the memory, and in the event of a one-bit error, the ECC memory logic can correct the error and output the corrected data for the system to continue working.



ECC copes with fixing isolated memory errors and ensures stable system operation. However, ECC memory does not provide a solution for multiple errors in a single memory block. In these cases, data corruption will occur. Dell Reliable Memory Technology PRO can help in this situation.



Benefits of RMT PRO



If the hard disk plate is physically damaged, the bad sector will be marked as unusable by the PC system. However, in most computers, including workstations with ECC memory, an unrecoverable error or several correctable errors in one memory block on a DIMM module can lead to a system failure. The user, as a rule, is forced to report such an error to his support service, which, in turn, must run some kind of diagnostic program to detect the error. Often a one-time failure may require replacing the entire DIMM.



The result is increased downtime, reduced productivity, loss of IT staff time, the need to replace DIMMs and possible damage to key application files.







Dell Reliable Memory Technology PRO (RMT PRO) comes to the rescue.

Similar in concept to hard disk drive error correction technology, the RMT PRO detects uncorrectable errors and multi-bit correctable errors in the DIMM and fixes the problem. Instead of costly downtime, running diagnostics, opening the system, and replacing a failed DIMM with RMT PRO technology upon reboot:







After a simple reboot, the RMT PRO workstation makes the defective area invisible to the operating system. Applications and critical system functions will “bypass” the marked area and will continue to work without the need for equipment replacement. Everything will be as if a bad memory never existed. This ensures uninterrupted operation, reduces the number of system failures and application errors.



RMT PRO can reduce hardware costs — memory modules. Since memory may deteriorate with intensive use or excessive heat (usually due to high loads), the number of physical errors may increase. Despite the "bad memory" information remains on the DIMM. In addition, if a DIMM replacement is required, the RMT PRO will display in the BIOS which specific DIMMs cause errors, speeding up the troubleshooting process and replacing DIMMs, which helps to reduce downtime and reduce the total cost of service. Thus, RMT PRO technology increases the life cycle of RAM and helps save money.







findings



Although some error detection schemes, such as ECC memory, can catch memory errors, many of these algorithms can only handle correctable errors. When physical defects or uncorrectable DIMM errors occur, the Dell RMT PRO provides an additional level of detection and correction of defective memory.



By mapping and deleting bad sectors, RMT PRO technology makes computing-intensive applications access only usable memory. This can lead to significant savings in both time and money due to the reduction of the time required to replace equipment and DIMM modules, and reduce downtime. When data integrity is critical, RMT PRO technology provides the necessary level of confidence by providing available memory to maximize the performance and reliability of a workstation.



All Articles