BRICS

BRICS - Bit Flip Resilience for In-Memory Column Stores

Research project on detecting and correcting transient hardware errors (bit flips) during query processing in main-memory-centric database systems (especially, column stores).

View the Project on GitHub

Introduction

The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements [1]. To satisfy these requirements for a constantly increasing amount of data, database systems constantly adapt to new hardware features [2, 3, 4, 5, 6, 7], for instance: new instruction sets, increasing core counts, changing core/cache topologies, increasing DRAM bandwidths, or new persistence technologies (nvRAM) [8, 9, 10]. These advances come with a backdraw, though: for a long time it has been known that hardware is subject to soft and hard errors [11, 12, 13]. Soft errors are also called bit flips, which may occur due to cosmic rays, heat, hardware aging, or electrical crosstalk, which, in turn, is due to the ongoing miniaturization of integrated curcuits [11, 14]. Hardware aging even leads to increasing error rates during a system’s run-time. Despite increasing error rates, database research could focus on improving performance (higher throughput, lower latency) by leveraging hardware improvements, without considering any side effects.

So far, this was possible because soft errors were masked by hardware, i.e. either did not propagate to the software layer, or lead to process or system crashes. Server-grade hardware like ECC DRAM can correct single bit flips and detect double bit flips. However, the picture has already changed, as large-scale systems are yet suffering from the increasing error rates. Scaling up today’s hardware detection and correction capabilities is not always sensible and leads to high overheads in terms of additional code space (memory area) and coding coplexity and latency (multiple or more complex codes). Consequently, error resilience becomes a major challenge for both hardware and software system designers and in the last couple of years researchers gained the insight that a cross-layer approach is required for tackling hardware errors [15]. You could say, that, this opens a very interesting and and challenging new research area.

The idea is that each layer in the hardware/software stack detects and corrects those errors, for which it is better suited than other layers. For the database domain, this requires novel approaches, as resilience was mainly left to the hardware and operating system layers. For instance, database systems could use context knowledge about data types, algorithms, internal data structures, and inherent redundancy to detect and correct hardware errors when and where it is sensible.

AHEAD: Adaptable Data Hardening for On-the-Fly Hardware Error Detection during Database Query Processing

We created a main memory-centric column store prototype which does the detection of transient hardware errors (multi-bit flips) in software. Find the sources on GitHub/brics-db/AHEAD. Other generic software approaches store and process all data twice or thrice (double or triple modular redundancy). In contrast to these application-oblivious techniques, we use a data encoding which is tailored to the actual data size and can be adapted to a desired error rate. More details on the techniques can be found under AN Coding and Coding Reliability (see navigation bar at the top).

References

  1. The beckman report on database research

    D. Abadi et al.

    Commun. ACM, 59(2):92–99, 2016.

  2. Breaking the memory wall in MonetDB.

    P. A. Boncz, M. L. Kersten, and S. Manegold.

    In Commun. ACM, 51(12):77–85, 2008.

  3. Robust Query Processing in Co-Processor-Accelerated Databases.

    S. Breß, H. Funke, and J. Teubner.

    In SIGMOD, pages 1891–1906, 2016.

  4. Query processing on Smart SSDs: Opportunities and Challenges.

    J. Do, Y. Kee, J. M. Patel, C. Park, K. Park, and D. J. DeWitt.

    In SIGMOD, pages 1221–1230, 2013

  5. Adaptive work placement for query processing on heterogeneous computing resources.

    T. Karnagel, D. Habich, and W. Lehner.

    In PVLDB, 10(7):733–744, 2017.

  6. Accelerating relational databases by leveraging remote memory and RDMA.

    F. Li, S. Das, M. Syamala, and V. R. Narasayya.

    In SIGMOD, pages 355–370, 2016.

  7. FPTree: A hybrid SCM-DRAM persistent and concurrent b-tree for storage class memory.

    I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner.

    In SIGMOD, pages 371–386, 2016.

  8. The future of microprocessors.

    S. Borkar and A. A. Chien.

    In Commun. ACM, 54(5):67–77, 2011.

  9. Emerging memory technologies.

    J. Henkel.

    In IEEE Design & Test, 34(3):4–5, 2017.

  10. New microarchitecture challenges in the coming generations of CMOS process technologies.

    F. J. Pollack.

    In Symposium on Microarchitecture, page 2, 1999.

  11. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation.

    S. Y. Borkar.

    In IEEE Micro, 25(6):10–16, 2005.

  12. Reliable on-chip systems in the nano-era: lessons learnt and future trends.

    J. Henkel, L. Bauer, N. Dutt, P. Gupta, S. R. Nassif, M. Shafique, M. B. Tahoori, and N. Wehn.

    In DAC, pages 99:1–99:10, 2013.

  13. Do we need anything more than single bit error correction (ecc)?

    M. Spica and T. M. Mak.

    In MTDT, pages 111–116, 2004.

  14. Cosmic rays don’t strike twice: understanding the nature of DRAM errors and the implications for system design.

    A. A. Hwang, I. A. Stefanovici, and B. Schroeder.

    In ASPLOS, pages 111–122, 2012.

  15. Reliable Software for Unreliable Hardware – A Cross Layer Perspective.

    S. Rehman, M. Shafique, and J. Henkel.

    In Springer, 2016.