Research project on detecting and correcting transient hardware errors (bit flips) during query processing in main-memory-centric database systems (especially, column stores).
The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements [1]. To satisfy these requirements for a constantly increasing amount of data, database systems constantly adapt to new hardware features [2, 3, 4, 5, 6, 7], for instance: new instruction sets, increasing core counts, changing core/cache topologies, increasing DRAM bandwidths, or new persistence technologies (nvRAM) [8, 9, 10]. These advances come with a backdraw, though: for a long time it has been known that hardware is subject to soft and hard errors [11, 12, 13]. Soft errors are also called bit flips, which may occur due to cosmic rays, heat, hardware aging, or electrical crosstalk, which, in turn, is due to the ongoing miniaturization of integrated curcuits [11, 14]. Hardware aging even leads to increasing error rates during a system’s run-time. Despite increasing error rates, database research could focus on improving performance (higher throughput, lower latency) by leveraging hardware improvements, without considering any side effects.
So far, this was possible because soft errors were masked by hardware, i.e. either did not propagate to the software layer, or lead to process or system crashes. Server-grade hardware like ECC DRAM can correct single bit flips and detect double bit flips. However, the picture has already changed, as large-scale systems are yet suffering from the increasing error rates. Scaling up today’s hardware detection and correction capabilities is not always sensible and leads to high overheads in terms of additional code space (memory area) and coding coplexity and latency (multiple or more complex codes). Consequently, error resilience becomes a major challenge for both hardware and software system designers and in the last couple of years researchers gained the insight that a cross-layer approach is required for tackling hardware errors [15]. You could say, that, this opens a very interesting and and challenging new research area.
The idea is that each layer in the hardware/software stack detects and corrects those errors, for which it is better suited than other layers. For the database domain, this requires novel approaches, as resilience was mainly left to the hardware and operating system layers. For instance, database systems could use context knowledge about data types, algorithms, internal data structures, and inherent redundancy to detect and correct hardware errors when and where it is sensible.
We created a main memory-centric column store prototype which does the detection of transient hardware errors (multi-bit flips) in software. Find the sources on GitHub/brics-db/AHEAD. Other generic software approaches store and process all data twice or thrice (double or triple modular redundancy). In contrast to these application-oblivious techniques, we use a data encoding which is tailored to the actual data size and can be adapted to a desired error rate. More details on the techniques can be found under AN Coding and Coding Reliability (see navigation bar at the top).
In SIGMOD/PODS ’18: 2018 International Conference on Management of Data, June 10-15, 2018,
Multi-GPU Approximation for Silent Data Corruption of AN Codes.
In Further Improvements in the Boolean Domain. Cambridge Scholars Publishing, 2018.
Multi-GPU Approximation Methods for Silent Data Corruption of AN-Coding.
In 12th International Workshop on Boolean Problems (IWSBP), Freiberg, Germany. 2016.
Needles in the haystack – Tackling Bit Flips in Lightweight Data Compression.
In Data Management Technologies and Applications. Springer International Publishing, 135-153, 2016.
Resiliency-aware Data Compression for In-memory Database Systems.
In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA), 326-331, 2015.
Online Bit Flip Detection for In-Memory B-Trees Live!.
In Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs Datenbanken und Informationssysteme (DBIS). 675-678, 2015.
Online bit flip detection for in-memory B-trees on unreliable hardware.
In Proceedings of the Tenth International Workshop on Data Management on New Hardware (DaMoN). ACM, 5:1-5:9, 2014.
Resiliency-Aware Data Management.
In Procecedings of the VLDB Endowment (PVLDB). 4:1462-4:1465, 2011.
The beckman report on database research
Commun. ACM, 59(2):92–99, 2016.
Breaking the memory wall in MonetDB.
In Commun. ACM, 51(12):77–85, 2008.
Robust Query Processing in Co-Processor-Accelerated Databases.
In SIGMOD, pages 1891–1906, 2016.
Query processing on Smart SSDs: Opportunities and Challenges.
In SIGMOD, pages 1221–1230, 2013
Adaptive work placement for query processing on heterogeneous computing resources.
In PVLDB, 10(7):733–744, 2017.
Accelerating relational databases by leveraging remote memory and RDMA.
In SIGMOD, pages 355–370, 2016.
FPTree: A hybrid SCM-DRAM persistent and concurrent b-tree for storage class memory.
In SIGMOD, pages 371–386, 2016.
The future of microprocessors.
In Commun. ACM, 54(5):67–77, 2011.
In IEEE Design & Test, 34(3):4–5, 2017.
New microarchitecture challenges in the coming generations of CMOS process technologies.
In Symposium on Microarchitecture, page 2, 1999.
In IEEE Micro, 25(6):10–16, 2005.
Reliable on-chip systems in the nano-era: lessons learnt and future trends.
In DAC, pages 99:1–99:10, 2013.
Do we need anything more than single bit error correction (ecc)?
In MTDT, pages 111–116, 2004.
In ASPLOS, pages 111–122, 2012.
Reliable Software for Unreliable Hardware – A Cross Layer Perspective.
In Springer, 2016.