Fault Tolerance of Persistent Memory

2022/09/29 PM

In this article, we will list several papers on local NVM/PM fault tolerance. (the details may be filled in later.

note: fault tolerance across networks is not in the scope here. Related works mostly use replications, from Mojim (ASPLOS ‘15) to Rowan-KV (OSDI ‘23)

ps: quite a long time since the last update… struggling to build systems and write a paper…🥲

problems

define data reliability problems on PM:

  • media errors
    • cell wear out
    • bit flip
  • software scribbles
    • bugs in firmware level
    • exposed addresses
  • crash inconsistency

ECC is only useful for small-scale media errors.

existing works

System

  • libpmemobj of Intel PMDK
    • replication across different PM devices (pm pools)
    • more details
  • NOVA-Fortis (SOSP ‘17) from NVSL
    • CRC32 checksums to detect error
    • replication all metadata
    • RAID-4 style parity
    • replicated checksums of data
    • image.png
    • more details
  • Pangolin (ATC ‘19) from NVSL
    • 1% XOR parities for 99% objects (with checksums)
    • in-place update data with replicated redo logging in PM
      • So Cocytus (FAST ‘16)…
    • Adler32 for incremental checksums
    • build a lib like libpmemobj on Opatne PM
    • image.png
    • more details
  • Vilamb (arXiv ‘20) from Rajat Kateja, Andy Pavlo. (also named ANON I guess)
    • Palingon sync-update parities -> expensive -> how to loosen guarantee?
    • two background threads for async: one for check parities, and one for update. pros:
      1. checksums are in page granularity -> read amplification. async process can merge several ops to save BW.
      2. utilize wasted “dirty” bits in the page table
    • rich experiments but on emulated NVM
    • more details

Architecture

  • TVARAK (ISCA ‘20) from Rajat Kateja
    • calculating parities like Pangolin is too slow (may lead to 50% drops)
    • add a new HW controller beside LLC to offload computation (maintain parities)
    • simulation on zsim
  • Polymorphic Compressed Replication (SYSTOR ‘20)
    • for columnar storage models on hybrid memory
    • use compression to reduce writes to NVM as replications
  • ECP (ISCA ‘10)
    • Error-Correcting Pointers (ECP) to remap locations instead of ECC, for the ECC blocks wearing out problem
    • and so many works on this approach, like zombie memory, chipkill, etc. more
  • WoLFRaM (ICCD ‘20)
    • wear-leveling + fault tolerance with programming address decoder (PRAD)

questions left (or opportunities)

todo

  • LB + fault tolerance
  • fault domains level
  • real error pattern of persistent memory
  • not that ec style
  • not that optane style

Search

    Table of Contents