Gray's grind

Accelerating Erasure Coding

2025-03-06T00:00:00+08:00

various existing acceleration techniques for erasure coding, including recent academic work and popular open-source libraries.

Forewords

Erasure coding (EC) strategies essentially provide system-level fault tolerance by encoding k data blocks into m redundant blocks. k can be much larger than m, unlike replicas where m is an integer multiple of k. The obvious advantage of EC is reduced storage space, but more complex data organization strategies introduce additional overhead, such as frequent encoding and decoding operations. Unlike XOR, EC encoding and decoding cannot achieve line speed, therefore, numerous EC acceleration libraries have been developed. Open-source and widely used libraries include Intel's ISA-L, the Jerasure series, and klauspost/reedsolomon in Go. Many companies also have their own proprietary libraries. Moreover, extensive academic research focuses on EC acceleration.

This blog primarily focuses on the end-to-end efficiency of Reed-Solomon (RS) codes, a specific type of systematic code, rather than the mathematical encoding problem itself.

The content was adapted by gemini 2.0 pro from my original notes. If the comments are too harsh, it's not me who wrote them.

Some related background:

If you are completely unfamiliar with erasure coding, please check drxp's EC blog series for a 101 introduction:
- Principles: https://blog.openacid.com/storage/ec-1/
- Implementation: https://blog.openacid.com/storage/ec-2/
- Optimization: https://blog.openacid.com/storage/ec-3/
Additional background: High-performance erasure codes - kkblog: https://abcdxyzk.github.io/blog/2018/04/12/isal-erase-1/
- Parallel lookup table strategy for GF calculations
- Matrix operation partitioning to improve locality
- Cauchy encoding matrix: https://abcdxyzk.github.io/blog/2018/04/16/isal-erase-2/

We will first divide the discussion into two categories:

XOR-based: Because finite field calculations are not instructions directly executable by the CPU, they are converted into multiple XOR operations.
Lookup Table: If the finite field is fixed (e.g., GF(8)), the multiplication results can be stored in a fixed table, and table lookups are used instead of calculations.

XOR-based

DSN '09 TC '13 Efficient Encoding Schedules for XOR-Based Erasure Codes

Jianqiang Luo, Mochan Shrestha, Lihao Xu, and James S. Plank

As shown in the figures above, unlike the simple logic of finite field calculations, the computational logic of XOR codes can be understood as splitting each data block into sub-blocks (corresponding to packets). Each sub-parity (sub-p) block requires the XORing of multiple sub-data (sub-d) blocks across rows. Different XOR-code matrices represent different combinational logic. Therefore, much work focuses on proposing more efficient matrices with fewer XOR operations, resulting in less computation and faster speed. This paper focuses on the fact that, while fewer computational operations are important, the caching efficiency of sub-d blocks also plays a significant role, since sub-d blocks are repeatedly accessed during computation.

Therefore, this paper proposes several different scheduling strategies:

DPG (Data Packets Guided): Performs calculations in the order of packets, processing all calculations related to one packet before moving on to the next.
DWG (Data Words Guided): Similar to DPG, but iterates by data word.

This improves the locality of sub-d blocks. Because sub-p blocks are not repeatedly read here, pure store operations will only hit the cache. Although this paper targets pure XOR codes like EVENODD and RDP, rather than RS codes, the idea is still quite good.

FAST '19 Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques

Work by Tianli Zhou and Chao Tian, TAMU.

Open source: https://github.com/zhoutl1106/zerasure This is further work based on Jerasure.

This work is very comprehensive and reading the original paper is recommended; it can be considered a small survey. Zerasure combines and optimizes several existing erasure coding acceleration techniques, including coding matrix design, computational scheduling optimization, general XOR operation reduction, cache management, and vectorization. It then proposes building a cost function based on the number of XOR and memcpy operations. Simulated annealing is used to choose among multiple mutually exclusive strategies, while non-mutually exclusive optimizations are directly overlapped.

Adjusting the selection still involves various matching scheduling techniques, as they are mutually exclusive.

It is worth noting that this paper presents several viewpoints:

It considers the costs of memcpy and XOR to be equivalent.
XOR vectorization is faster than direct GF calculation vectorization (e.g., ISA-L).
Cache-related S-CO strategies do not offer significant improvements.

note: This may be because Zerasure itself is not very fast, and the access patterns of XOR codes are inherently not cache-friendly.

Performance improvements:

SC '21 Accelerating XOR-based erasure coding using program optimization techniques

Yuya Uezato, Dwango, Co., Ltd (Parent company of NICONICO, amazing!)

Open source: https://github.com/yuezato/xorslp_ec

But it's a Rust proof-of-concept. It claims a library will be made, but nothing yet. The tools used are not very system-oriented. But this work is quite interesting.

As can be seen earlier, the process of XOR-based EC is actually logically quite simple, somewhat similar to CUDA kernels, but limited by computational resources, memory access resources, etc. So, this work directly abstracts it into SLPs (a concept from the PL field, Straight-Line Programs), and then uses various SLP optimization strategies to optimize the SLP (automated PL strategies):

Compressing: Using grammatical compression algorithms to reduce the number of XORs.
Fusing: Using the functional program optimization method deforestation to reduce memory access.
1. This reduces memory access for intermediate variables, but it seems many are done manually.
Using the (red-blue) pebble game from program analysis to reduce cache misses.
1. A formal objective is created, and then heuristically optimized, performing XOR rearrangement.

[It seems that there will still be conflicts among these multiple strategies. How to make trade-offs?]

This strategy is different from ISA-L, which directly accelerates finite field calculations using lookup tables, without converting to XOR.

Results:

The optimized EC library outperforms ISA-L in RS(10,4) encoding, achieving a throughput of 8.92 GB/s.
(Xor)RePair reduces XOR operations by approximately 60% on average.
The combination of XOR fusion and (Xor)RePair reduces memory access by approximately 76% on average.
The fusing step provides the largest improvement.

ICCD '23 TCAD '24 Cerasure: Fast Acceleration Strategies For XOR-Based Erasure Codes

Tianyang Niu, Min Lyu, Wei Wang, Qiliang Li, Yinlong Xu, ADSL Lab, USTC

Open Source: https://github.com/ADSL-EC/Cerasure

Challenges:

The number of 1s in the bit matrices found by existing heuristic algorithms can obviously be further reduced.
Creating pointers for reading/writing data leads to high encoding latency.
The trade-off between computational efficiency and spatial locality can be further improved by selecting the packet size.
Wide-stripe encoding (stripes containing many data/parity blocks) leads to low cache hit rates for commonly used packet sizes.

Corresponding designs:

V-search: Searches Vandermonde and Cauchy matrices and greedily reduces the number of 1s in the matrix to find a near-optimal encoding matrix.
1. The trans version adds an opt-search: iteratively replaces matrix elements in descending order of the number of 1s in the bit matrix, and excludes those that increase the number of 1s or destroy the MDS property.
Uses offset reuse to accelerate the construction of read and write pointers (this engineering trick seems to be due to the strong coupling with ISA-L's interface).
Finds a trade-off in packet size selection (computational efficiency and cache), using the L1 cache size to calculate an optimal solution.
Decompose: For large stripes, the number of data blocks is larger, putting significant pressure on the cache. Therefore, the calculation is separated into multiple sub-encoding tasks, which are merged at the end.
1. The trans version adds smart decompose, which greedily combines subtasks during decomposition to increase similarity, so that the previous scheduling is more effective.

Experiments were compared with Zerasure and SLPEC, but not with their implemented baseline, ISA-L. It can be more than twice as fast as Zerasure, but Zerasure cannot outperform the default ISA-L.

HotStorage'24 Rethinking Erasure-Coding Libraries in the Age of Optimized Machine Learning

Jiyu Hu, Jack Kosaian, K. V. Rashmi, CMU

As mentioned earlier, some have used SLP to automatically optimize XOR organization and scheduling. This paper is even more interesting, directly using TVM to optimize EC computation scheduling. The difference between EC matrix calculations and NN matrix calculations is that the internal subunits are performing bitmatrix XOR.

No internal modifications to TVM are needed, just call the API directly. However, TVM requires this data to be contiguous. They assume that the overhead of memcpy is general for libraries.

But it doesn't feel like it?

This idea is very interesting, but it's quite engineering-heavy to implement. Although it uses TVM, it's not using the GPU; it's still compared with CPU libraries.

Interestingly, it can be seen that the SC'21 work can no longer outperform ISA-L when r=4. In addition to the hardware issues mentioned in the paper, the main reason is that the increase in XOR operations is not linear for XOR codes with multiple parities. It can be seen that the advantage of TVM-EC increases with a larger number of parities. This may be because the increase in the number of operations provides more optimization space for TVM. Of course, this is an optimal calculation state after learning the parameters, requiring a process similar to preheating.

Introducing this system to existing systems requires a C++ runtime, and the layout needs to be adjusted. In addition, the specific memory access and calculation process becomes opaque, which is actually quite heavy.

By the way, there are also some works that use GPUs for EC:

ICC'15 PErasure: A parallel Cauchy Reed-Solomon coding library for GPUs
TPDS'18 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding
- Some memory access efficiency optimization and control flow optimization methods are made for GPUs. Because CRS is gf(2), XOR can be used directly.

Lookup Table

Jerasure

http://jerasure.org/jerasure/gf-complete/

Jerasure uses lookup tables for finite field calculations. Including work by James S. Plank such as FAST '13, it is a C library. The precomputed multiplication tables include:

Log Table: Records the logarithmic value of each non-zero element in the field.
Exp Table: Records the field element corresponding to each logarithmic value.

It not only supports GF(8) finite field calculations, but also supports finite fields between GF(4) and GF(128). It only has SSE vectorization acceleration.

Different optimization strategies are employed for different values of w:

GF(2^4): Multiplication of 128-bit data is accomplished through two table lookups using the _mm_shuffle_epi8 instruction.
GF(2^8): The 8-bit number is split into two 4-bit numbers, each undergoing table lookup, leveraging the _mm_shuffle_epi8 instruction.
GF(2^16): The 16-bit number is split into four 4-bit numbers, utilizing eight lookup tables and the _mm_shuffle_epi8 instruction. To fully exploit SIMD parallelism, an "Altmap" memory mapping scheme is adopted, mapping a set of every 16 words into two 128-bit variables.
GF(2^32): Similar to GF(2^16), the 32-bit number is split into eight 4-bit numbers, employing 32 lookup tables and the "Altmap" memory mapping.

A critical problem with Jerasure is its memory access pattern. Here is a pseudo-code example:

void jerasure_matrix_encode(int k, int m, int w, int *matrix, char **data, char **coding, int size) {
    for (i = 0; i < m; i++) { // Iterate through each parity block
        for (j = 0; j < k; j++) { // Iterate through each data block
            if (matrix[i * k + j] != 0) { // If the matrix coefficient is non-zero
                galois_region_multiply(data[j], matrix[i * k + j], size, coding[i], (j > 0));
            }
        }
    }
}

It can be seen that it is centered around parity blocks, which leads to poor locality of data blocks, resulting in poor cache efficiency.

ISA-L

https://github.com/intel/isa-l

Intel's ISA-L may also be for generality (adapting to various platforms and instruction sets). It doesn't have many additional complex optimizations, whether in matrix selection or encoding strategy. It also directly uses lookup tables.

ISA-L also uses split multiplication tables. A complete GF(8) table with 256-byte elements would be 64KB in size, which would put too much pressure on the cache. Therefore, the 8 bits are split into the high 4 bits and the low 4 bits, creating smaller multiplication tables. The calculation also involves two table lookups and one XOR to obtain the final result. The performance advantage of ISA-L comes from relatively simple factors, mainly:

Extensive assembly unrolling, efficient use of instructions.
Good memory access locality. After accessing all data blocks in the same row during encoding, all parity blocks are written at once (of course, this is limited by the number of registers; they only support p<=6 in this way, otherwise it needs to be split. But this could be extended).
Newer instruction set acceleration selection (e.g., AVX512, GFNI).

ISA-L's handwritten assembly here is similar to writing CUDA kernels. Interestingly, many EC implementations, even in top conferences, often use ISA-L's standard encode interface for indirect implementation. However, these self-made interfaces are not very particular about instruction execution efficiency. Writing assembly code in the style of ISA-L would result in at least a one or two times improvement (although this may not be their focus).

FYI, ISA-L also has some minor limitations:

Summary is ISA-L EC can use any encoding matrix, performs the same operation regardless of encoding matrix provided and the documentation is clear about the limitations of gf_gen_rs_matrix(). So not a bug in ISA-L.

Vandermonde matrix example of encoding coefficients where high portion of matrix is identity matrix I and lower portion is constructed as 2^{i*(j-k+1)} i:{0,k-1} j:{k,m-1}. Commonly used method for choosing coefficients in erasure encoding but does not guarantee invertable for every sub matrix. For large pairs of m and k it is possible to find cases where the decode matrix chosen from sources and parity is not invertable. Users may want to adjust for certain pairs m and k. If m and k satisfy one of the following inequalities, no adjustment is required:

k <= 3

k = 4, m <= 25

k = 5, m <= 10

k <= 21, m-k = 4

m - k <= 3

GFNI

A new feature introduced in ISA-L v2.31. The distributed interface is ec_encode_data_avx512_gfni. The Go EC library https://github.com/klauspost/reedsolomon also provides a similar interface. The GFNI instruction set directly supports GF multiplication operations at the hardware level, eliminating the need for additional lookup tables, bit operations, etc. On corresponding platforms, my test results show that GFNI+AVX512 can double the performance compared to ordinary AVX512 single-threaded.

锐评

Since our paper is still under review, this section will be expanded upon in the future…

Erasure Coding + Disaggregated Memory

2024-12-05T00:00:00+08:00

Disaggregated Memory is currently a hot topic in systems research, and distributed large-capacity memory clearly requires system-level reliability strategies. While replication has always been a default choice, with many related works, including recent ones like SWARM@SOSP'24, erasure coding is also an option. This article lists existing EC+DM works.

IPDPS '21 F-Write: Fast RDMA-supported Writes in Erasure-coded In-memory Clusters

Previous works like octopus@ATC'17 have reconstructed network I/O (like RPC) using one-sided verbs.

This paper focuses on the scenario of RDMA+EC, where updates are slow due to I/O amplification.

Implements a 2PC scheme for EC using one-sided writes.
- Essentially, it's octopus's one-sided RPC.
Then, it builds on top of this with speculative updates, implementing flying data merging (to merge multiple submissions) for EC.

No NIC info provided.

FAST '22 Hydra: Resilient and Highly Available Remote Memory

SymbioticLab, available on arXiv since 2019. Note that it uses RC for all RDMA communication, so it's not really DM.

Issues:

High Latency: EC-based remote memory solutions cannot meet microsecond-level latency requirements due to encoding overhead, straggler issues, interrupt overhead, and data replication overhead.
Low Availability: Existing fault tolerance mechanisms based on replication and erasure coding can easily lead to data loss in the event of correlated failures due to the random placement of coding groups.

Challenges:

Encoding overhead
Splitting amplifies tail latency.
Context switching overhead
Copy overhead
Placement strategy is not good for simultaneous errors.

Design:

Asynchronous encoded writes and delayed binding reads to hide latency.
- Asynchronously Encoded Write: Fragments are not queued; similar to late binding for writes, only confirming after the first $k$ requests return.
- Late Binding: Basically, multi-fragment reads in an EC cache.
In-Place Coding minimizes data copying. Unregisters after receiving k splits to prevent overwriting by subsequent splits. [Will there be no registration performance issues?].
Run-to-Completion avoids context switching because the latency is very low.
The CodingSets algorithm improves availability by carefully designing the placement strategy of coding groups, reducing the probability of data loss under correlated failures. [A classic EC problem, from CopySet].

Open Source: https://github.com/SymbioticLab/hydra

Is it really a good idea to use late binding so extensively?

Increased number of network packets (could RDMA verb scalability be limited?).
Higher computational pressure (mainly due to increased latency, but throughput is still line rate (can it be pipelined?).

OSDI '22 Carbink: Fault-tolerant far memory

Google

Follow-up work to Hydra@FAST'22. The problem is that due to self-coding partitioning for a single object,

Multiple network I/O operations are required to reconstruct a page.
1. But late binding can be used, so it's okay. It seems like just a granularity trick.
Computation is centralized and cannot be offloaded to remote nodes.
1. But for DM, this is a false need. However, what about potential RNIC offload?

Design:

Therefore, it abstracts the concept of a span, where each span consists of multiple pages with similar object sizes. Then, cold/hot determination, grouping (clock algorithm), and eviction are performed asynchronously and transparently. [Like slab]. Note that the unit of processing here is clearly different from Hydra.
- There are some system designs, but a lot of related work exists, especially in slab-related clustering.
- And from an EC perspective, one is self-coding and the other is cross-coding, making direct comparison difficult.
Asynchronous GC compaction (EC stripes).
1. Hydra does not have this issue because a set of stripes forms an object, so their lifecycles are tied together.
2. Triggered by swap-out, and consistent completion needs to be ensured. 2PC is a naive approach.
  1. EC-batch local
  2. EC-batch remote (offload parity calculation to remote nodes)

"To reconstruct a span, a compute node only needs to contact a single memory node storing that span."

TPDS '23 Enabling Efficient Erasure Coding in Disaggregated Memory Systems

USTC ADSL

This work begins to focus on the problem from a DM perspective (i.e., purely memory nodes).

As one-sided RDMA latency drops to the microsecond level, encoding overhead degrades the performance of DM with EC. To enable efficient EC in DM, we thoroughly analyzed the coding stack from the perspectives of cache efficiency and RDMA transfer.

DM is a subset of RDMA, where local memory is more limited or only acts as a cache. A natural approach is to use pipelining, but the challenges are:

Sub-stripe segmentation affects cache efficiency.
Dedicated kernel coding reduces cache pollution.
How object size impacts pipeline scheduling issues.

MicroEC significantly reduces latency variation by reusing auxiliary encoding data. For example, it reduces the P99 latency of writing a 1 KB object by 27.81%. It optimizes the coding workflow and coordinates encoding and RDMA transfer through an exponential pipeline while carefully adjusting coding and transmission threads to minimize latency.

Note that this work only focuses on objects larger than 64KB.

Design:

Reuse auxiliary data.
Propose efficient data structures to support the design.
A non-blocking pipeline, and carefully adjust the coding and transmission threads.

The sub-stripe size is a trade-off: larger sizes lead to poor performance (head-tail latency amplification degradation), while smaller sizes increase network latency (but isn't it possible to overlap?).

This work has a more EC-centric flavor. It focuses on reusing auxiliary encoded data, using an exponential pipeline, and carefully adjusting coding and transmission threads.

Open Source: https://github.com/ADSL-EC/MicroEC

I don't understand why they chose to use Java's Crail-1.3 for the system. It's surprising to use a system with a built-in GC for something so sensitive. No wonder it can only handle large objects.

SOSP '24 Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value Stores

Pengfei Zuo, DM KVS + EC

Checkpoint for index, EC for KV pair
Differential checkpointing scheme, version recovery method, space reclamation mechanism based on differences, and hierarchical recovery scheme.

Challenges:

Checkpoint network overhead, rollback leads to loss of recently submitted KV pairs.
EC introduces GC and recomputation.
Memory node recovery is slow due to computation (pure decoding recovery issue?).
Checkpoint transfer can interfere with performance.

Solution:

Differential Checkpointing for Index: RNIC IOPS are limited. By reducing the bandwidth consumed by checkpoint transfers, Aceso reduces the performance interference of the checkpoint mechanism.
1. Calculate the index delta -> LZ4 -> write to MN -> adjacent MN decompresses and then XOR updates. (The atomicity guarantee here comes from the fact that the index being written will not be included in this checkpoint).
2. After rolling back the checkpoint, you need to scan to match KV pairs. Some RDMA CAS tricks are used to apply versions to slots.
3. Version-based recovery method:
  1. Index Slot Versioning: The slot is extended to ensure the latest version. By reading the latest checkpoint and reprocessing recent KV pairs, Aceso ensures that the index can recover to the latest and consistent state after fault recovery using RDMA CAS.
  2. Index Versioning implements further strategies to accelerate recovery (narrowing the scan range, etc.).
Offline Erasure Coding for KV Pairs: Offline EC, leveraging the linear properties of X-code erasure codes, Aceso implements an efficient space reclamation mechanism for old KV pairs with almost no overhead.
1. Offline mainly means that the MN performs the operation in the background. First, write everything to the MN, then the MN's CPU performs encoding in the background.
2. Metadata records the role, validity, bitmap, etc., similar to previous DM hash work. Then, it uses a slab-like management.
Hierarchical Recovery Scheme: By prioritizing the recovery of critical data (such as the index), Aceso ensures fast recovery of KV storage functionality, minimizing user disruption.
1. Metadata is directly replicated, the index is recovered to a previous version using checkpoints, and then KV pair versions are scanned.
2. Block regions are recovered using EC, while parity is recovered in the background (delta merging occurs here).
3. By default, it optimizes pipelining of RDMA reads and decoding, as well as doorbell batching.

CX3 cluster of CloudLab
Aceso achieves significant throughput improvements in write requests (INSERT, UPDATE, DELETE). Among them, the improvement in DELETE requests is the most significant, reaching 2.67 times.

The baseline is the replicated FUSEE@FAST'23, but many improvements come from the significantly reduced overhead of the index after checkpointing.

IPADS Notes

Random thoughts:

Erasure Coding NIC Offload

2024-12-04T00:00:00+08:00

About offloading erasure coding to NICs.

This article was written a long time ago, and the logic is a bit muddled. I recently did some new research and found this draft in Obsidian. Although a little messy, it still contains some useful information, so I polished it with a large language model and am now publishing it. (Also because I suddenly realized it's been a long time since I last updated, and procrastination before ddl.)

High-speed networks like RDMA are rapidly developing. 800 Gbps NICs are on the horizon. Despite numerous efforts dedicated to accelerating Erasure Coding (EC), EC acceleration libraries like ISA-L haven't kept pace with the advancements in networking. Consequently, for traditional EC where the bottleneck was primarily network bandwidth, a portion of the bottleneck has shifted to computation. Furthermore, computation within EC is also suitable for offloading to processors on PCI-E, which can simultaneously save CPU resources.

In fact, multi-core speed is sufficient, but yeah simple calculations should be offloaded to the DSA.

Haiyang Shi (now at ByteDance US Infrastructure System Lab), a PhD from OSU, has conducted significant research on offload encoding to NIC. his thesis

ps: Gibraltar is a EC library for GPU

The figure qualitatively illustrates the current throughput performance of different acceleration libraries on various processors. ISA-L, due to its cache-friendly design, significantly outperforms others.

It's evident that while the granularity of PCI-E is 64B, the sweet spot for offload devices lies at the MB level. Therefore, for small object cases like KVS, offloading could introduce substantial latency overhead.

HPDC'19 UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems

Goal: Integrate devices such as CPUs, GPUs, and network interface cards (i.e., multi-rail support) to execute erasure coding (EC) operations in parallel.
Methods: A unified multi-rail EC library that can fully leverage heterogeneous EC encoders. The proposed interface is complemented by asynchronous semantics, an optimized metadata-free scheme, and EC rate-aware task scheduling, enabling efficient I/O pipelines.

This work focuses on two-level hierarchies: CPU+GPU and CPU+RNIC (note: only CX5 provides EC features).

(Intuitively, disregarding implementation effort, the core focus is on managing the computing power of different devices and task distribution. Intensive multi-tasking is straightforward. For individual small tasks, such as degraded reads, how to distribute them to cores with different computing capabilities to avoid tail latency necessitates a predictor. However, this approach confines offloading to enhancing computing power rather than shortening paths, for instance.)

The primary strategy aims to reduce latency by overlapping the three stages of data retrieval, coding, and data transmission, similar to a pipeline.

Read operations follow a similar approach. The core idea is that by splitting each coding task into multiple subtasks and distributing them across various devices, these devices can independently and concurrently complete these subtasks without blocking communication or other processes. The strategy for controlling task distribution is simple: maintaining three additional queues and observing their flow rates.

Note that although the GPU performs calculations, the CPU is still responsible for packet transmission.

seems like some data and benchmarks not align with their previous work in Bench'18?

SC'19 TriEC: An Efficient Erasure Coding NIC Offload Paradigm based on Tripartite Graph Model

This paper discusses offloading the EC computation process to RDMA NICs. The problem is abstracted into a tripartite graph model. Additionally, some network primitives are designed to support this offloading. It is primarily a networking-focused work.

Here, a key difference from the HPDC work is that the RNIC can handle network packet transmission?

Two types of offload NICs:

Incoherent: The CPU sends data in memory to the NIC for parity calculation, and subsequently issues a command to send the parity data.
Coherent: The NIC calculates and stores the parity data in memory before sending it.
- Benefits: Reduces CPU overhead and DMA operations (i.e., fewer read operations).

However, the above optimization strategies have limitations:

Only one NIC is used for computation, leading to poor parallelism.
NIC network resources are not fully utilized.
Only the encode-and-send primitive is supported, not the receive-and-decode primitive.

Design:

If we consider the original architecture as a bipartite graph (BiEC) (where the source and NIC are one node and the destination is another), their design is a tripartite graph. The encoding process is divided into multiple subsets and sent to multiple NICs on different nodes for calculation. This distributes the computational load across the NICs. The decoding process is similar, with decoding tasks also being decomposed and distributed. [Implementation requires designating a leader within a group to manage request distribution.]
- Finer-grained task decomposition enables improved parallelism.

Hence, the process transforms from a single-hop to a double-hop-like network.

For in-band repair compared to out-of-band repair, intermediate results of subtasks can potentially be the desired result for a particular node without extra overhead. This allows direct delivery to that node, eliminating the need for write-back. Furthermore, the initialization overhead of the EC offload APIs supported by the NIC is significant, necessitating buffering.

It's unclear how the receive-and-decode primitive is implemented. The intermediate forwarding nodes seem to still require CPU involvement. Note that this network communication still uses two-sided verbs, not DM.

Random Thoughts

this approach of combining subtasks to satisfy a specific node's requirements is kindof a trade-off between computation and networking?
Furthermore, this writing strategy inherently requires writing a portion of the data to specific nodes and then having those nodes calculate parity. This is an asynchronous process. Does this compromise reliability?
Many of the choices presented here appear to extend local encoding techniques to distributed systems. Can this be extended further?

SC'20 INEC: Fast and Coherent In-Network Erasure Coding

This work seamlessly integrates operations like receiving data, calculating erasure codes, and sending results, reducing CPU intervention. RDMA is extended with EC primitives within network primitives, such as encode_and_send, but with further expansions like PPR for forwarding encoding types (e.g., receive_ec_send). [ec/xor-send, recv-ec/xor-send, and recv-ec/xor] The combination of these three primitives is sufficient to express the computation and communication patterns of all five advanced erasure coding schemes shown in Figure 1.

This enables the construction of distributed erasure coding pipelines and the triggering of pre-submitted tasks without CPU intervention.

The modified Mellanox OFED driver supports INEC primitives.

The implementation uses RDMA WAIT (This seems more suitable for DPUs and Bluefield. If line rate is not achieved, it can be awkward).

refer

Shi, Haiyang, Xiaoyi Lu, and Dhabaleswar K. Panda. "EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures." International Symposium on Benchmarking, Measuring and Optimization. Springer, Cham, 2018.
Shi, Haiyang, et al. "High-performance multi-rail erasure coding library over modern data center architectures: early experiences." Proceedings of the ACM Symposium on Cloud Computing. 2018.
Shi, Haiyang, et al. "UMR-EC: A unified and multi-rail erasure coding library for high-performance distributed storage systems." Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019.
Shi, Haiyang, and Xiaoyi Lu. "Triec: tripartite graph based erasure coding NIC offload." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019.
Shi, Haiyang, and Xiaoyi Lu. "INEC: fast and coherent in-network erasure coding." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.

Data Movement with DMA/DSA Offloading

2023-10-10T00:00:00+08:00

About offloading memory data movement to DMA or DSA engines.

What is DMA: https://jianyue.tech/posts/dma/

Pros:

offloading for async ops
less cache polution
less CPU cycles (less mem io stalls)

Cons:

higher latency
- resource management (addr translation, …)
bandwidth limited

note: partially translated by ChatGPT

IPDPS '07 Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study wi

OSU's work, K. Vaidyanathan W.Huang L. Chai D. K. Panda

DMA copy offload:

Reduction in CPU Resources and Better Performance
Computation-Memory Copy Overlap
Avoiding Cache Pollution Effects

But concern about:

a single transfer cannot span discontinuous physical pages
source and dest overlap
Bus CC

They developed a DMA engine in the kernel for copying, which can also be extended for IPC. They also considered issues such as alignment, locking buffer, and multiple DMA channels.

Some experimental results:

Setup: Intel 3.46 GHz processors and 2MB L2 cache system with SuperMicro X7DB8+ motherboards that include 64-bit 133 MHz PCI-X interfaces. The machine is connected with an Intel PRO1000Mbit adapter. We used the Linux RedHat AS 4 operating system and kernel version 2.6.9-30. It doesn't mention memory, but it seems relevant.
Regarding data in the hot cache, CPU memcpy completely dominates.
- On the contrary, for 16KB, 4-channel DMA is better than the CPU.
Beyond 2MB, the CPU lags behind DMA in terms of bandwidth.
The effect of overlap testing is more evident above KB, around 0.3-0.4 at 1KB. When the size is too small, there is no effect due to the overhead of DMA itself.
On pure read workloads, CPU memcpy is affected by cache pollution, resulting in a 30% drop.

CLUSTER '07 Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OA

OSU's work in the same group, K. Vaidyanathan, L. Chai, W.Huang, D. K. Panda

The previous work seemed more focused on performance, while this work provides a hidden solution for multi-core system design. The overhead of initiating DMA can be assigned to a dedicated core, enabling better overlap of memory access and computation, up to 100%. Multi-cores can also facilitate copying to a dedicated core. [ More memory bandwidth or more cores? ] The "protect" strategy is used to achieve application transparency.

Intel SPDK + DMA

a simple callback interface from userspace

Function	Description
spdk_ioat_probe()	Enumerate the I/OAT devices attached to the system and attach the userspace I/OAT driver to them if desired.
spdk_ioat_get_dma_capabilities()	Get the DMA engine capabilities.
spdk_ioat_submit_copy()	Build and submit a DMA engine memory copy request.
spdk_ioat_submit_fill()	Build and submit a DMA engine memory fill request.

https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html

FAST '23 Revitalizing the Forgotten On-Chip DMA to Expedite Data Movement in NVM-based Storage Systems

USTC's research focuses on synchronous data movement between NVM and DRAM.

Large size asynchronous movements on NVM are often considered a mere trick (e.g., HeMem@SOSP'21). But if we divide requests internally, does it also weaken the concepts of sync and async? In essence, everything discussed earlier is also synchronous.

First, DMA on NVM was profiled, evaluating parallel copies for inter and intra requests, among other aspects. Some notable differences include:

Intra: Multi-channel DMA for PM writes is not very effective, while reads are feasible (limited by write bandwidth).
Inter: DMA greater than 4 easily gets saturated.
NVM management in kernel space differs from DRAM as the space is contiguous, allowing for simpler management.
…

A breakdown of read and write, starting directly from 16KB

Then, they proposed a fastmove library:

Batching (pin, submit, etc.), alignment, pre-allocation…
DMA-CPU cooperation.
Implementation involved modifying the DMA kernel module to better serve NVM-DRAM DMA copying, with additions to kernel file systems like Nova.
Further, they developed a scheduler to manage DMA-CPU cooperation based on IO size, among other factors.

It's worth noting that this work is 15 years newer than the previous one, so it leverages many new features in the kernel to further enhance DMA performance.

I wanted to see the difference in microbenchmarking between CPU and modified DMA for small sizes (1K? 4K?), but they didn't provide it. Many experiments focused on end-to-end latency in the file system. Only support fm_copy_to_user() and fm_copy_from_user(). The claim of the new hardware being compatible with general CXL seems a bit forced. It seems more dependent on the documentation than actual implementation.

arXiv '23 ASPLOS '24 A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel® 4th Gen Xeon® Scalable Processors

What is DSA?

https://zhuanlan.zhihu.com/p/518157278

DSA can offload operations including memcpy and even perform streaming CRC. A significant portion of the discussion is dedicated to the specification of DSA itself.

The key point is that DSA enables the calling end to operate with minimal latency:

Specialized hardware is used for IOMMU, allowing DSA to directly access SVM, thus eliminating the need for pinning as discussed earlier and avoiding most of the startup overhead.
- Meanwhile, the address translations for the completion record, source, and destination buffers are performed by interacting with the on-device address translation cache (ATC) that interacts with the IOMMU on the SoC — a key difference from previous generations. This enables support of coherent shared memory between DSA and cores — they can access shared data in CPU virtual address space and thereby eliminate the need for applications to pin memory.
New instructions like MOVDIR64B bypass the cache to submit a 64B descriptor.
On-chip features include QoS and similar mechanisms.

Many of the data presented are quite intriguing:

Most notably, DSA directly bypasses many of the issues previously discussed regarding DMA from the hardware level, resulting in faster performance even for small sizes, such as 256B.

Async batching

Breakdown after batching

Saving CPU cycles

Finally, numerous guidelines are provided on maximizing throughput, interactions with the cache/memory hierarchy, and the configuration of DSA hardware resources.

DSA + memory-intensive systems? and nontrivial
DSA + EC?

SW Prefetch in System&DB

2023-10-09T00:00:00+08:00

Prefetch to hide memory access latency (CPU stall)

What to prefetch

When to prefetch

Where to place the prefetched data

Some ref:

note: partially translated by ChatGPT

TACO '14 When Prefetching Works, When It Doesn’t, and Why

Discussing HW prefetch and SW prefetch:

SW prefetch is suitable for scenarios such as short arrays, sequential and irregular reads, etc. The form of SW prefetch introduces more instruction calling costs.
HW prefetch heavily depends on the platform, as specific patterns need to be recognized.
SW prefetch has a training effect on HW, which might negatively impact HW performance.
HW prefetchers generally prefetch to L2 or L3, as the performance gap between L1 and L2 can be tolerable for an OOO CPU when the miss rate is below 20%.

For more details on this topic, refer to: https://hackmd.io/@jserv/HJtfT3icx?type=view

T0 (Temporal data) - Prefetch data into all levels of the cache hierarchy.
T1 (Data about L1 cache misses) - Prefetch data into level 2 cache and higher levels.
T2 (Data about L2 cache misses) - Prefetch data into level 3 cache and higher levels, or as implementation-specific choices.
NTA (Non-Temporal data across all cache levels) - Prefetch data into non-temporal cache structures and prefetch it to locations close to the processor, minimizing cache pollution.
- prefetchnta is only used to prefetch into the USWC memory region using line fill buffers. Otherwise, it prefetches into L1 (and L3 inclusive L3 on CPU), bypassing L2 (as stated in Intel's optimization manual). You cannot weakly order loads from WB memory; there is no way to bypass cache coherence on WB.

For further insights, refer to: https://stackoverflow.com/questions/46521694/what-are-mm-prefetch-locality-hints

BUT

For databases, workloads such as point chasing are prevalent, as seen in hash joins, where HW prefetching is ineffective.

Hash join involves a large set of keys, and the task is to perform table lookups [used in table joins for two DBs, resulting in a significant amount of random memory access].
MVVC chains are another example.
However, integrating operations other than hash join into this context is certainly challenging, and it might require a case-by-case approach.

While some works might only discuss hash joins, the ideas are generally applicable, so distinctions regarding whether the implementations in the articles are general are not considered here.

ICDE '04 Improving hash join performance through prefetching

Sw prefetch for hash join. In comparison to simple SW prefetch, which prefetches all related pages before access, further proposals include Group Prefetching and Pipelined Prefetching.

The idea is straightforward: for a batch of tasks, prefetch first, then perform subsequent computations. By the time the data is retrieved, it is already in the cache. Pipelined prefetching goes a step further than group prefetching, which naturally imposes additional constraints on the size within each loop body, batch size, and so on.

https://ieeexplore.ieee.org/document/1319989
https://zhuanlan.zhihu.com/p/443829741

VLDB '17 Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last is an example of a DB using GP (with SIMD). There are more detailed experiments, but it seems there are no significant changes in terms of methodology (although this isn't the focus of this article).

https://zhuanlan.zhihu.com/p/51588155

VLDB '16 Asynchronous Memory Access Chaining

AMAC provides a method to transform the pattern of chaining access (point chasing with many pointer dereferences) into one that can be SW prefetched in coding, but this requires a significant amount of manual effort, even for probing a hashtable.

The key observation is that not every access chain has a fixed size. Therefore, theoretically ideal pipeline prefetching isn't practical, and there will always be instances of pipeline stalls in irregular scenarios, similar to what occurs in superscalar processors.

Thus, they utilize a Finite State Machine (FSM) to abstract the entire process, enabling early modifications in the code to fill the pipeline effectively.

In comparison to simple group or simple pipeline strategies, AMAC represents a more dynamic approach, accounting for different sizes, among other factors. It emphasizes observing the dependency relationship to interleave prefetch and computation.

Interestingly, even if you handwrite the state machine, the compiler might still mess up your code, making it slower. Watch this video for more information.

VLDB '17 Interleaving with Coroutines: A Practical Approach for Robust Index Joins

AMAC is excellent and can approach the theoretical limit indefinitely, but it is not practical. This work proposes using coroutine switching to replace manually interleaved execution. It delegates the task of scheduling interleaving to the compiler or DB engine.

The advantage of coroutines lies in their low switching overhead. Unlike heavyweight threads, in the best-case scenario, the overhead can be almost equivalent to that of a single function call.

Significantly, SW prefetching is indeed sensitive to many aspects, and such interleaving can potentially impose greater pressure on address translation. The group size still needs manual adjustment.

Note that here, of course, coding still needs to be done in this pattern, and engine developers still need to manually adjust various details of coroutines.

VLDB '18 Exploiting coroutines to attack the "killer nanoseconds"

The discussion also revolves around using coroutines in DB to reduce memory stalls in pointer-intensive data structures. They transformed hashtables, binary searches, and more complex data structures like masstree and bw-tree, conducting numerous tests.

There are many intricacies, but only a few minor details are listed here:

HW thread (referring to Hyper-Threading) prefetch is not as effective as coroutine prefetch (referenced in eurosys22).
The performance of coroutines varies significantly among different compilers.
…

There is also a connection to the line fill buffer (since the line fill buffer is also used for non-temporal store and similar operations, it appears that there might be some competition in this context).

VLDB '20 Interleaved Multi-Vectorizing

This is the work of ECNU by Zhuhe Fang, Beilei Zheng, and Chuliang Weng. It is also about SIMD+SW prefetch, which is quite an interesting study.

The first issue pertains to SIMD, which requires multiple contiguous data but might not all be present in the cache, resulting in a direct slowdown.
1. An interesting experiment demonstrated that although SIMD is often claimed to be powerful, its performance rapidly degrades as the workload size increases, becoming similar to scalar operations. Cache misses take around 200 cycles, which is orders of magnitude higher than the computation cycles. [ Does dense SIMD also lead to frequency reduction? ]
The second issue concerns the possibility of empty registers within SIMD, where some parts of the code stage might not be fully utilized, leading to a problem of underutilized hardware resources.

They proposed IMV:

(Manually) interleave the execution of different SIMD computations to implement SW prefetching and reduce cache misses.
Introduced residual vector states to merge with divergent vector states.

Understanding the concept is straightforward from the diagram.

*it seems that at least one assumption here is that misalignment exists? If alignment is directly addressed, all cache misses or code situations would be fully aligned in 64B units, leading to complete consistency.

Experimental results:

We compared the performance of IMV with various other methods on Hash join probe (HJP) and Binary tree search (BTS). As shown in Figure 6, in most cases, IMV outperforms other methods, being 2.38 times, 1.39 times, 2.22 times, 2.74 times, and 4.85 times faster than AMAC (scalar code interleaving), FVA (fully vectorized AMAC), RAV (direct vectorized AMAC), SIMD (direct SIMD coding), and Naive (basic scalar implementation), respectively. In this experiment, Intel Vtune was used to further analyze the advantages of IMV through microarchitectural indicators, and the time breakdown of its execution is shown in Figure 7. The figure explains why IMV is much faster than other methods. IMV not only reduces maemory access overhead but also eliminates speculative execution errors. The results from Naive (pure scalar implementation) indicate that the execution time of HJP and BTS is mainly spent on memory access. Although AMAC optimizes memory access to improve performance, it is severely limited by speculative execution errors. Compared to Naive and AMAC, SIMD on the CPU only eliminates branch errors with little effect, as there are a large number of cache misses.

Link to Zhihu Link to Bilibili

VLDB '21 CoroBase: coroutine-oriented main-memory database engine

Continuing from the previous work, this study also adopts a strategy of using alternating coroutines to implement SW prefetch. They aimed to create an implementation that is as automated as possible, which led them to explore C++20 coroutine. (Note that C++20's coroutine is stackless compared to boost, using suspension for switching.)

It's essential to note that the granularity here does not involve multiple threads as a batch, but rather multiple get() operations. This distinction sets it apart from AMAC, showing better performance with a small number of threads.

The major problems they encountered were as follows:

Coroutine switching overhead: The overhead is significant every time there is a suspend, so they implemented a two-level system. However, some parts still require manual unwinding.
Scheduling: Fixed batch size based on profiling, such as the optimal number of CPU hardware prefetches, constrained by the number of registers, and so on.
Resource management: Adjusting the timing of resource entry and reclamation.
Concurrency control and DB architecture choices: Thread-local transformation.

A drawback of reducing the execution granularity is that the overhead of synchronization locks increases.

Tianzheng Wang's presentation can be found here: Bilibili Link

Fault Tolerance of Persistent Memory

2022-09-29T00:00:00+08:00

In this article, we will list several papers on local NVM/PM fault tolerance.

note:

the fault tolerance in some paper may indicate crash consistency, but here we mainly focus on device failures.

fault tolerance across networks is not in the scope here. Related works mostly use replications, from Mojim (ASPLOS '15) to Rowan-KV (OSDI '23)

changelog:

2/17 add Kamino-Tx
2/24 add TENET
4/10 add Pavise

problems

Define data reliability problems on PM:

media errors
- cell wear out
- bit flip
- …
software scribbles
- bugs in firmware level
- exposed addresses
crash inconsistency
…

ECC is only useful for small-scale media errors.

existing works

System

seems like lots of works focus on transactional persistent memory, but lib details won't be mentioned below. check papers to know more

Replication Style

libpmemobj-R
- replication across different PM devices (pm pools)
- more details
Kamino-Tx (EuroSys '17)
- Async backup to prevent additional data copy in the critical path of atomic ops.
  - only for write-intensive hot data to save some PM space
- extend to chain replication (fault tolerance)
  - backup for crash consistency + replication for fault tolerance: merge them, and only keep backup for the head of the chain
  - to ensure the characteristic of chain replication, space cost is $(f+1+1+α)*datasize$ (note: backup and the head are in the same node)
    - 1 for recovering non-head node
    - α for backup
Romulus (SPAA '18)
- async 2 reps for txn by only 4 fences (just like Kamino-Tx)
TENET (FAST '23)
- TimeStone (ASPLOS '20)
  - MVCC (logging) to scale performance. timestamp version control, non-blocking read and etc…
    - version chain in DRAM (TLog), a compressed checkpoint version (group commit) in PM (CLog), Obj in PM (maybe stale).
    - So that recovery can use small op log (params to replay txn) to replay txn
  - [more details from GHC 6023]: "TimeStone is essentially redo-log + DRAM Buffer + group commit + operation logging."
- TENET builds on TimeStone to create protections for spatial safety & temporal safety of memory access
- use local SSD replication:
  - sync replications: Clog and op log
  - async replications: data obj

Coding Style

NOVA-Fortis (SOSP '17) from NVSL
- "TickTock for NVMM data structures that combines atomic update with error detection and recovery" (just like Kamino-Tx rep style)
- CRC32 checksums to detect errors (including silent errors unlike TENET)
- replicated checksums of data
- RAID-4 style parity to hide parity bits from application's address space
- as NOVA is based on CoW, UPDATE is "allocates new pages, populates them with the written data, computes the checksums and parity, and finally commits the write with an atomic log appending operation"
- eval on PMEP: 1) the cost of checking and maintaining checksums and parity for file data incurs a steep cost for both reads and writes. 2) …
- more details
- source codes
Pangolin (ATC '19) from NVSL
- replicated metadata
- 1% XOR parities for 99% objects (with checksums)
- in-place delta update data with replicated redo logging in PM
  - So Cocytus (FAST '16)…
    
    why replicated redo? the only additional protection from replicated redo that I can think of is if data and parity are both crash-inconsistent and errors are found on a redo log entry.
- Adler32 for incremental checksums
- build a lib like libpmemobj on Opatne PM
- Concurrent updating of data is not supported, but concurrent updating of parity is supported (data in the same stripe but not in the same text)
  - atomic XOR is a simple solution but cannot be vectorized, on the other hand, vectorized XOR needs range locks => hybrid approach on an 8KB threshold
    
    but page size is 4KB?
- more details
Vilamb (arXiv '20) from Rajat Kateja, Andy Pavlo. (also named ANON I guess)
- Palingon sync-update parities -> expensive -> how to loosen guarantee?
- two background threads for async: one for check parities, and one for update. pros:
  1. checksums are in page granularity -> read amplification. async process can merge several ops to save BW.
  2. utilize wasted "dirty" bits in the page table
    
    finding the gap from old redundant design is cool, it reminds me of DaxVM@MICRO'22
- rich experiments but on emulated NVM
- some metadata is still volatile -> needs batteries
- more details
Pavise@PACT'22
- Pangolin's following work
- one redo log
- a lib with less intrusive changes to the application
  - PMDK access tracking
- source codes
- …

Architecture

Arch papers on PM fault tolerance are usually about hacking ECC modules…

TVARAK (ISCA '20) from Rajat Kateja
- calculating parities like Pangolin is too slow (may lead to 50% drops)
- add a new HW controller beside LLC to offload computation (maintain parities)
- simulation on zsim
Polymorphic Compressed Replication (SYSTOR '20)
- for columnar storage models on hybrid memory
- use compression to reduce writes to NVM as replications
ECP (ISCA '10)
- Error-Correcting Pointers (ECP) to remap locations instead of ECC, for the ECC blocks wearing out problem
- and so many works on this approach, like zombie memory, chipkill, etc. more
WoLFRaM (ICCD '20)
- wear-leveling + fault tolerance with programming address decoder (PRAD)

design space

LB + fault tolerance
fault domains level
- 6~8 DIMMS but with 1% parity?
  
  the difference of error granularity
real error patterns of persistent memory
not very erasure-coding style?
not very optane style?
only txn make sense?
- workloads related
…

QoS on Persistent Memory Systems

2022-03-12T00:00:00+08:00

QoS (LB) on persistent memory systems to avoid interference.

Problem

QoS is to control the priority among different applications, like latency-critical tasks against throughput tasks. Normally the source of those tasks fights for is bandwidth, which is a simple metric and easy to monitor. So best effort tasks won't affect latency-critical tasks. Some QoS works focused on DRAM[7].

Similarly, a hybrid access pattern on persistent memory incurs a dramatic performance drop. But it's tricky. Some other variables will also affect the overall performance.

interference

[1] found some simple cache eviction strategies (like FIFO) without too much data migration can beat complex ones.
[2] found:

The interference between a process accessing DRAM and one performing random reads to PM is small.

When a process accessing DRAM is concurrently executed with one performing frequent writes to PM, the performance of the former is significantly degraded but that of the latter is not.

multi-fold interference source:

iMC WPQ size is designed for fast DRAM access, so too many slow writes to PM will easily fill it up and block DRAM writes
DDR bus
PM write amplification
…

A couple of recent works focus on this QoS problem, including NVMSA '20, APSys '21, FAST '22 (2) [3-6]. QoS is all about monitoring and control, so let's discuss them separately.

Monitor

QoS systems should firstly know when an interference shows up and who it is.

FairHym[3] set up a couple of thresholds:

And the exact values of them come from experiments, so it's coupled with workloads and HW settings.

Dicio[4] only consider one best-effort task and one latency-critical task situation.

Similarly, Dicio has some rules from "priori knowledge".

$T_{DRAM}$ here is dynamic (5-30 GB/s) and depends on the access pattern on PM.
estimated media-level write BW (μs level) = request-level BW * recent WA ratio(ms level)

note: different cases need different strengths on control?

MT^2[5] is in kernel space, using Intel Memory Bandwidth Monitoring (MBM) and some toolkits to collect data (Dicio[4] claims that MBM has some severe bugs for now). So they can get the bandwidth of DRAM and PM (a lot of effort here to implement, check paper details).

Get read latency from $RPQ_O/RPQ_I$, write latency from periodical writes. And latency is used to detect interference. The latency threshold is different depending on the access type (random/seq + read/write).

note: the correlation between latency and bandwidth is basically linear. So the detection here is equal to BW?

in NyxCache, "if the maximum IOPS of pattern A is MaxIOPSA, then the cost of each operation of pattern A is 1/MaxIOPSA."

note: the implicit assumption here is that the cost is linear and ignoring cache effects. emmmmmmmm

In contrast to the above, NyxCache[6] find the victim application that can bring the biggest performance gain with the same suppression.
Like the fig below, we want to choose one app to throttle between B and C to ensure A's perf.

note: you will get it after you find the author is Kan Wu, who is the author of The Storage Hierarchy is Not a Hierarchy.

Control

After finding out which process should be throttled, QoS systems need to control it efficiently.

FairHym[3] assumes in VMs applications, every core is in exclusive use, so throttling the frequency of the target cores can reduce the BW on PM.

note: weak assumption and waste computing source

Dicio[4] tests some methods including MBA (Intel Memory Bandwidth Allocation, basically it's delay injection in memory requests) and limit frequency:

*_stride means write 64B but at each 256B aligned addr to amplify writes. they claim that the old method can't handle PM_write_stride.

note: maybe just not inject enough delays? like 1%.

Dicio controls the number of cores to manage best-effort tasks' BW. And down to zero for BW tasks a single core (duty cycle).

MT^2[5] tries to combine MBA and CPU resources limitations. MBA only controls the ratio of delay injection, so the same throttling value may differ on different memory access pattersn. While throttling CPU resources can almost linearly reduce BW. What's worse, MBA doesn't work on PM.

Instead of changing frequency or the number of cores, MT^2 changes the CPU quota of a thread by LINUX cgroup control, which is finer-grained.

Table 2: pagerank under 10% MBA is faster than 50% CPU with lower BW.
The reason is simple: 50% CPU slows every instruction instead of only memory access.

note: but sorting, coding is computing-intensive…

so they use MBA to throttle DRAM access and CPU scheduling for NVM.

same question here, maybe just because MBA is designed for fast DRAM access, and the ratio of injected delays are not big enough (throttling value < 10 in fig.4)

NyxCache[6]

quote "To mimic the behavior of Intel MBA, Nyx implements simple throttling by delaying PM accesses at user-level."
"Our current implementation adds delays in units of 10ns with a simple computation-based busy loop. In some cases PM operations may need to be delayed indefinitely (e.g., when a resource limit is reached); in this case, PM operations are stalled until the Nyx controller sets the delay to a finite value"

Application access PM through NyxCache interface, so NyxCache can utilize a user-level MBA with a fixed ratio. And it worked, so delay injection is not the problem.

experiments

…

random thoughts

QoS from applications instead of system-level to bypass some limits brought from the bottom-view.
Will injected delay waste CPU resource? context switching cost and CPU resource maybe trade-off here..
Many networking QoS papers utilize or even create more "sensors" than all above. Can we mimic them without HW support?
…

ref

Kassa, Hiwot Tadese, et al. "Improving Performance of Flash Based {Key-Value} Stores Using Storage Class Memory as a Volatile Memory Extension." 2021 USENIX Annual Technical Conference (USENIX ATC 21). 2021.
Imamura, Satoshi, and Eiji Yoshida. "The analysis of inter-process interference on a hybrid memory system." Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.
Imamura, Satoshi, and Eiji Yoshida. "FairHym: Improving Inter-Process Fairness on Hybrid Memory Systems." 2020 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 2020.
Oh, Jinyoung, and Youngjin Kwon. "Persistent memory aware performance isolation with dicio." Proceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems. 2021.
Yi, Jifei, et al. "MT2: Memory Bandwidth Regulation on Hybrid NVM/DRAM Platforms." 20th USENIX Conference on File and Storage Technologies (FAST 22), Santa Clara, CA. 2022.
Wu, Kan, et al. "NyxCache: Flexible and Efficient Multi-tenant Persistent Memory Caching." 20th USENIX Conference on File and Storage Technologies (FAST 22), Santa Clara, CA. 2022.
Fried, Joshua, et al. "Caladan: Mitigating interference at microsecond timescales." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020.

RDMA+NVM remote persistence

2021-10-19T00:00:00+08:00

Problem

Due to RDMA NIC implementation, RNIC doesn't have remote persistent flush primitives. So one-sided write data from clients will write to the volatile cache on RNIC first and then RNIC directly sends ACK back before writing data to PM. As a result, a power loss will break remote data persistence easily.

Besides, one-sided commit[3] is immature or suffers poor performance.

Some researchers place this problem on the network systems level instead of the storage system level, and so ignore it. But for now, this problem does affect system availability.

Old methods

2-side RPC communication can avoid this problem, but 2-side ops can't fully deposit RNIC's performance and lack of scalability at the same time[1].

For 1-side ops, a strawman implementation is sending a write request followed by a read request. But the cost of 2 RTT is still too high.

New methods

[1] uses READ after WRITE, but with outstanding request[2] + doorbell batching[8] to process persistent WRITE request, which reduces latency from 4μs (2RTT) to 3μs.

outstanding request: WR which was posted to a work queue and its completion was not polled (like unfinished requests?
doorbell batching (just batching on RDMA

quote "Specifically, outstanding request [23] allows us using the completion of READ as the completion of the WRITE, as long as the two requests are sent to the same QP. Since the READ to persist the WRITE must be post to the same QP as the WRITE (§2.3), we no longer need to wait for the first WRITE to complete. Thus, this optimization reduces the wait time of the first network roundtrip. Applying outstanding request to persistent WRITE is correct because first, later READ flushes previously WRITE [19], and RNIC processes requests from the same QP in a FIFO order [6].
Based on outstanding request, doorbell batching [24] further allows us to send the READ and WRITE in one request using the more CPU and bandwidth efficient DMA, reducing the latency of posting RDMA requests.
On our testbed, a single one-sided RDMA request takes 2µs. Thus, a strawman implementation of remote persistent write uses 4µs. After applying H9, one-sided remote persistent write takes 3µs latency to finish"

[4] claims that for small persistent writes to remote NVMM, RPCs have comparable latency as one-sided RDMA.

note: the tricky part is that the first author of [4] is also the first author of outstanding request[2]… Maybe the reason is the different devices (CX3 and CX4+CX5)?

[5][10][11][12] use the RDMA WRITE_WITH_IMM verb to achieve remote persistence. So that servers will get the completion status and make data durable immediately.

note: WRITE_WITH_IMM can ensure atomicity since the data need to be confirmed by the extra involved server. On the other side, this imm is only 32-bit, which can't directly address the complete space.

like tranditional DB systems, there are some optimistic methods, like using redundancy check. [6] do CRC when reading to check data consistency.
To go a step further, [7] argues that CRC is expensive so they use a background thread to conduct integrity verification. (but this work is based on simulation…

check their brief intro in [9].

[9] built and test some emulated hardware-support RDMA primitives to support RMDA remote flush primitives(RPC).

Popular RMDA RPC communication methods:

Theirs:

their work still relies on the existing RDMA primitives and the receiver's CPU to emulate RDMA RFlush primitives instead of programmable NIC.

?

Mellonx, gkd

refer

Wei, Xingda, et al. "Characterizing and Optimizing Remote Persistent Memory with RDMA and NVM." Proceedings of the 2021 {USENIX} Annual Technical Conference ({USENIX}{ATC} 21). 2021.
Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Using RDMA efficiently for key-value services." Proceedings of the 2014 ACM Conference on SIGCOMM. 2014.
Kim, Daehyeok, et al. "Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems." Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 2018.
Kalia, Anuj, David Andersen, and Michael Kaminsky. "Challenges and solutions for fast remote persistent memory access." Proceedings of the 11th ACM Symposium on Cloud Computing. 2020.
Lu, Youyou, et al. "Octopus: an rdma-enabled distributed persistent memory file system." 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17). 2017.
Huang, Haixin, et al. "Forca: fast and atomic remote direct access to persistent memory." 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 2018.
Du, Jingwen, et al. "Fast and Consistent Remote Direct Access to Non-volatile Memory." 50th International Conference on Parallel Processing. 2021.
Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Design guidelines for high performance {RDMA} systems." 2016 {USENIX} Annual Technical Conference ({USENIX}{ATC} 16). 2016.
Duan, Zhuohui, et al. "Hardware-Supported Remote Persistence for Distributed Persistent Memory." SC 2021.
Shu, Jiwu, et al. "Th-dpms: Design and implementation of an rdma-enabled distributed persistent memory storage system." ACM Transactions on Storage (TOS) 16.4 (2020): 1-31.
Liu, Xinxin, Yu Hua, and Rong Bai. "Consistent RDMA-Friendly Hashing on Remote Persistent Memory." ICCD 21.
Yang, Jian, Joseph Izraelevitz, and Steven Swanson. "Orion: A distributed file system for non-volatile main memory and RDMA-capable networks." 17th {USENIX} Conference on File and Storage Technologies ({FAST} 19). 2019.

(SC '21) LogECMem: Coupling Erasure-Coded In-memory Key-Value Stores with Parity Logging

2021-10-11T00:00:00+08:00

LogECMem uses a hybrid method of in-place update and Parity logging (PL) for parity updates.

motivation

old update policies:

direct reconstruction: read all non-updated data, and compute the new parity with old parities. (huge data transfer costs)
in-place update: read old parity, and compute the parity delta by data delta. (too many reads to old parity
full-stripe update: out-of-place update, and GC stale data chunks. (no parity reads but it brings high space cost
PL: logging the parity deltas (But PL is designed for disk-based systems

They claimed that:

for wide-stripe EC, GC in full-stripe update will consume a lot of network bandwidth.
full-stripe update will take more memory space due to invalid blocks
single-failure is the most critical (in a MTTDL model)

So they built LogECMem by in-place update for XOR parity in DRAM and PL for other parities.

methods

design

Like buffer logging of RAMCloud[2], they use buffer logging for other parity chunks to accelerate writes:
(the buffer here is DRAM)

note: PM is a good log device, and not limited by capacity.
btw, the persistence in buffer logging of RAMCloud[2] is ensured by battery-based DRAM. So PM is appropriate and may provide fast recovery (since the gap between DRAM and disk is big)? But it's weird to talk about logging persistence on a storage system keeping all data in DRAM….
related: flatstore

With this XOR parity in DRAM, systems can perform degraded read in DRAM nodes.

Op

update:

…check paper for details…

merge-based buffer logging: a log merging trick

multiple chunk failures repair

PLR (FAST '14) trades write performance for the repair performance. Simple merging logging on PLR only can merge incoming parity deltas.
So they use a lazy merging strategy (parity logging with merging, PLM) that writes parity to extra continuous disk space, and then reads them back for merging later.

note: kindof 2-level? the first non-ordered level won't hurt perf?
also, any perf bottleneck in logging and 2nd-level logging replacement (GC-like)? It seems like only overall perf test in the experiment part.

expriments

……

ref

Cheng, et al. LogECMem: Coupling Erasure-Coded In-memory Key-Value Stores with Parity Logging, SC '21.
Ousterhout, John, et al. "The RAMCloud storage system." ACM Transactions on Computer Systems (TOCS) 33.3 (2015): 1-55.

(VLDB '22) PM + Learned Index

2021-09-17T00:00:00+08:00

learned index + PM. APEX: A High-Performance Learned Index on Persistent Memory[1]

Baotong Lu is the author of Dash[2], and Jialin Ding is the author of ALEX[3]. Both works have a big impact. So here is APEX combining PM indexing and learned index (ALEX).

in case you don't know what is ALEX, check here

This pre-print work is open-sourced, APEX on Github.

challenges

some challenges of linear combination:

Extra writes from some designs of ALEX
1. insert data
  - need to update neighbor with gaps
  - shifting data to keep ordered
2. updating metadata
3. SMO ( Structural Modification Operations, like node spliting) is expensive
logging for persistence on PM is expensive

intutive thoughts:
pros: latency on PM is higher, so more time for models
cons: the cost of maintaining a sorted (or nearly sorted) array in PM

archi

Keep metadata, locks, bitmaps and fingerprints (accelerators in paper) in DRAM, because they can be rebuilt. (bitmap: invalid a key by setting it out of legal range, so data in PM contains enough information)

like ALEX, APEX will insert a key to predicted position, and limit the probing distance to 16 (512B in total). And use a linked-list Stash Array to handle contentions (blue part in fig 4).

every 16 slots (key and payload) will be mapping to one accelerator.

Op

lookup

probe-and-stash: predict a position by layered-RMI models and search the corresponding space. If not found, check the stash array.

Range query: i~j => search f(i)~f(j+15) and related Stash Array.

the cost of search unsorted array?

insert

just similar with lookup. note that keys out of ranges are flags that indicate their slots are empty.

update: in-place update

write payload firstly, and write key secondly. data written in the same cacheline won't be re-ordered.[4]

?!

delete

key invalidation, and set bitmaps in DRAM

SMO

Like ALEX, use cost model (the average number of cache misses during probe-and-stash) to choose node expandsion and node split.

hybrid logging in node expansion

multi-step expandsion process:

allocate a expanded node
retrain or rescale models, insert from A to B
attach B, detach A

They found that step 2 is relatively long running. So instead of use redo-logging for the whole process, they use undo-logging before step3 and redo-logging after step 2.

Bulk loading

concurrency control - Lock Granularity

optimistic locking: node-level locks +

allocate one lock per 256 records, as shown in Figure 4(b). Each lock in data nodes is a 32-bit integer where one bit is used to represent the lock’s current status (locked or unlocked); the remaining 31 bits represent a version number

recovery

lazy-recovery:

we adopt a lazy recovery approach [17, 33] in which we recover a data node only when an index operation accesses it

experiments

check paper… APEX loses only when workload is too complicated to fit (like FB of SOSD). The perf gap in range scan is smaller.

refer

Lu, Baotong, et al. "APEX: A High-Performance Learned Index on Persistent Memory." arXiv preprint arXiv:2105.00683 (2021).
Lu, Baotong, et al. "Dash: Scalable hashing on persistent memory." arXiv preprint arXiv:2003.07302 (2020).
Ding, Jialin, et al. "ALEX: an updatable adaptive learned index." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2020.
Andy Rudoff, PIRL 2019: Protecting SW From Itself: Powerfail Atomicity for Block Writes, https://www.youtube.com/watch?v=kJ1qewIgGeQ&t=1283s