The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements . To satisfy these requirements for a constantly increasing amount of data, database systems constantly adapt to new hardware features [2, 3, 4, 5, 6, 7], for instance: new instruction sets, increasing core counts, changing core/cache topologies, increasing DRAM bandwidths, or new persistence technologies (nvRAM) [8, 9, 10]. These advances come with a backdraw, though: for a long time it has been known that hardware is subject to soft and hard errors [11, 12, 13]. Soft errors are also called bit flips, which may occur due to cosmic rays, heat, hardware aging, or electrical crosstalk, which, in turn, is due to the ongoing miniaturization of integrated curcuits [11, 14]. Hardware aging even leads to increasing error rates during a system’s run-time. Despite increasing error rates, database research could focus on improving performance (higher throughput, lower latency) by leveraging hardware improvements, without considering any side effects.
So far, this was possible because soft errors were masked by hardware, i.e. either did not propagate to the software layer, or lead to process or system crashes. Server-grade hardware like ECC DRAM can correct single bit flips and detect double bit flips. However, the picture has already changed, as large-scale systems are yet suffering from the increasing error rates. Scaling up today’s hardware detection and correction capabilities is not always sensible and leads to high overheads in terms of additional code space (memory area) and coding coplexity and latency (multiple or more complex codes). Consequently, error resilience becomes a major challenge for both hardware and software system designers and in the last couple of years researchers gained the insight that a cross-layer approach is required for tackling hardware errors . You could say, that, this opens a very interesting and and challenging new research area.
The idea is that each layer in the hardware/software stack detects and corrects those errors, for which it is better suited than other layers. For the database domain, this requires novel approaches, as resilience was mainly left to the hardware and operating system layers. For instance, database systems could use context knowledge about data types, algorithms, internal data structures, and inherent redundancy to detect and correct hardware errors when and where it is sensible.
We created a main memory-centric column store prototype which does the detection of transient hardware errors (multi-bit flips) in software. Find the sources on GitHub/brics-db/AHEAD. Other generic software approaches store and process all data twice or thrice (double or triple modular redundancy). In contrast to these application-oblivious techniques, we use a data encoding which is tailored to the actual data size and can be adapted to a desired error rate. More details on the techniques can be found under AN Coding and Coding Reliability (see navigation bar at the top).
In SIGMOD/PODS ’18: 2018 International Conference on Management of Data, June 10-15, 2018,
In Further Improvements in the Boolean Domain. Cambridge Scholars Publishing, 2018.
In 12th International Workshop on Boolean Problems (IWSBP), Freiberg, Germany. 2016.
In Data Management Technologies and Applications. Springer International Publishing, 135-153, 2016.
In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA), 326-331, 2015.
In Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs Datenbanken und Informationssysteme (DBIS). 675-678, 2015.
In Proceedings of the Tenth International Workshop on Data Management on New Hardware (DaMoN). ACM, 5:1-5:9, 2014.
In Procecedings of the VLDB Endowment (PVLDB). 4:1462-4:1465, 2011.
Commun. ACM, 59(2):92–99, 2016.
In Commun. ACM, 51(12):77–85, 2008.
In SIGMOD, pages 1891–1906, 2016.
In SIGMOD, pages 1221–1230, 2013
In PVLDB, 10(7):733–744, 2017.
In SIGMOD, pages 355–370, 2016.
In SIGMOD, pages 371–386, 2016.
In Commun. ACM, 54(5):67–77, 2011.
In IEEE Design & Test, 34(3):4–5, 2017.
In Symposium on Microarchitecture, page 2, 1999.
In IEEE Micro, 25(6):10–16, 2005.
In DAC, pages 99:1–99:10, 2013.
In MTDT, pages 111–116, 2004.
In ASPLOS, pages 111–122, 2012.
In Springer, 2016.