paper notes about XStore[1] from SJTU, which uses learned cache (aka learned index[2]) for distributed DRAM KVS.

backgroud

7 RTT for 100M KVS with tree index -> slow
big index tables do bad to caching
- high miss rate
- cost clients' mem

learned index: a machine learning approach to approximate the cumulative distribution function (CDF) of the keys in the tree index, which save searching cost in tree index and parameters storage cost.

XStore

process:

update, insert…: server centric -> two-side RPC
get: client centric -> O(1) with learned cache

learned cache:

like learned index
index traversal -> cal
smaller index model than index tree, which is good for client caching

learned index works for a sorted & static array. (range query)

GET steps:

key => model => a guaranteed space (value addr)
clients fetch all keys in this space, and get the true key.
another READ for the corresponding value

benefits:

1 RTT for lookup
small memory cost

BUT index trees are not static!

server side

server side maintain a hybrid structure: B+Tree + Model

virtual leaf nodes for insert
key&value are stored seperately but continuously. Read value by offset from its key
logical addr in leaf nodes are not sorted (just sorted logically)
a 2 level XModel (NN+LR) predicts a position sorted range from a key.
insertion make array unsorted: a translation table(TT) maps logical addr to the phsical, which would be cached partially in clients
model use RMI[2] to reduce re-train costs.

model

top level assign keys to sub-models
sub-models predict positions with min& max error. A sub-model owns a TT and a logical space.
due to small sub-model size and simplicity(LR), training is fast
- 8μs for retraining
- 8s for training the entrie cache

Index Tree

XTree use an HTM-based concurrent B+tree[3][4] to ensure atomicity with high perf

I konw little about HTM, so u better go to read 5.3 in the paper…

UPDATE:

traverse XTree, and update value inplace. so it will not change index.
optimaztion: update msgs carry position predicted by client's learned cache to reduce server's load

INSERT & DELETE:

normal B+tree moves kv pairs to keep them sorted.
XTree doesn't keep it within a leaf node. Since READ would read a whole leaf node, it won't effect READ.
set flags and rewrite data for DELETE to avoid changing index
when node split, increment incarnation (kind of some flags to indicate inconsistency), and retrain model

Based ops above, a stale TT can still maps correctly as long as accesses are not overlapped with split nodes.

Optimization trick:

speculative execution. Clients can try to read data from stale TT entry, since half of old data still reside in this node. If miss, then read data from right-link. Good for insert-dominated workloads.
Model expansion from XIndex[5] to handle growing kvs size.

scale out

refer to [6]. Also need to use boundary key augmentation.
Clients keep one learned cache for each server. Use a particularly partitioning func to choose server.

When a SCAN across many servers:

collecting nodes across different servers, serially?
one RDMA_READ for each server

read can also use that partitioning func? i don't get it

client side

clients need to cache 2 components

model
the table (most of space)

Operation:

GET
- use XModel to predict a position range (mostly a single leaf node with $N$ slots)
- get real addr by TT
- read keys by one RDMA_READ with RDMA doorbell batching
- scan and identify keys is valid or existent, then read value by another RDMA_READ.
- if keys or TT entries are invalid, start to update local model and TT
Scan
- look up related leaf nodes to determine the remote address
- read related leaf nodes with both keys and values by RDMA_READ

learned index use monotonic models to identify existence.
So non-existent keys would cause some problems:
　　model can't predict untrained area. e.g LR0 can't handle LR0(10)

solution: "data augmentation":
　　train models with a boundary key, so model can predict correct boundary. e.g. LR1(19)=>(18,20)

not the typical data augmentation….

eval

model size: 500K LR(14B*500K)
senstive to workload's pattern (complex↑ -> ↓)
compared with tree-index caching, cache size 600MB->150MB
if use 1% cache size -> only 20% perf loss

benefits from no server CPU bottleneck and reduced network cost. Bigger advantage when uniform workloads (hot data is easy to cache, and the real workload..

limitation

fixed key-lens
- use hash to encode variable-length keys
  
  just like fatcache
model selection: LR is not so good, though good for retraining
- complex data distribution
  
  just trade off between acc&cost

refer

Wei, Xingda, Rong Chen, and Haibo Chen. "Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache." 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 2020.
Kraska, Tim, et al. "The case for learned index structures." Proceedings of the 2018 International Conference on Management of Data. 2018.
Wang, Zhaoguo, et al. "Using restricted transactional memory to build a scalable in-memory database." Proceedings of the Ninth European Conference on Computer Systems. 2014.
硬件事务内存（Hardware Transactional Memory, HTM）

硬件事务内存系统一般是在每个处理器核心中增加事务cache，同时在CPU指令集中增加事务相关的指令。在执行事务的过程中，事务访问过的数据会缓存在事务cache中，修改过的cache一致性协议会检测各事务之间的冲突。如果产生冲突，对应的事务cache中修改过的数据全部作废，该事务重新执行；如果没有冲突产生，就在事务结束时把事务cache中的数据更新到内存中。
硬件事务内存系统完全由硬件电路实现，所以其性能相较于软件事务内存系统有很大的优势，但是由于硬件资源的限制，所以硬件事务内存无法处理大型事务（在处理大事务的过程中会产生大量的共享数据修改操作，这些修改记录的大小超过处理器内事务cache的大小）。为了保证事务的完全执行，现有的商用硬件事务内存系统会加入串行部分，也就是“锁”，对于一部分硬件事务内存实在无法完成的事务就用串行部分完成，这虽然损失了部分性能，但至少保证了功能的完整性。
在具体实现时有2个要点： 1. 使用处理器内部的事务cache来存放事务执行过程中更新的数据，这一类的经典代表：TCC 2. 使用cache一致性协议来检测事务之间共享数据访问的冲突，这一类的经典代表：LogTM.
from https://zhuanlan.zhihu.com/p/151425608
Tang, Chuzhe, et al. "XIndex: a scalable learned index for multicore data storage." Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020.
Ziegler, Tobias, et al. "Designing distributed tree-based index structures for fast RDMA-capable networks." Proceedings of the 2019 International Conference on Management of Data. 2019.

Gray's grind

(OSDI '20) Fast RDMA-based Ordered Key-Value Store using Remote Learned Cache