2021/03/14

# Intro

With the development of storage device, lots of novel stuff come to world, including NAND SSD, Optane 3D XPoint, STT-RAM… Personal machines have already multi-level storage: SRAM on CPU L2/L2 cache, DRAM on CPU’s DIMMs, SSD to accelerate operating system, and HDD to storage big file like media, movie.

On commercial servers, there would be some difference: bigger L3 cache on CPU, more DIMMS to hold more DRAM, faster SSD like Optane SSD, SAS HDD… On cutting-edge research, some new devices are brought to prototype system design. Optane DIMM is the first commercial persistent memory and can be plug in DIMM sockets directly. STT-MRAM can provide near DRAM performance and persistency.

However just like CAP in distributed systems, storage devices also have 3 trade-off elements: price, capacity, speed.

SRAM is super fast, and CMOS compatible, but its density is super low. On the other hand, HDD products can up to 18TB and even bigger with SMR, but the physical movements when locating data limits random I/O speed.

At the same time access distribution to big data is not even, so hybrid storage may be a good idea to combine different advantages of different devices. Hybrid storage is a classic method and has been used by lots of devices and systems, e.g HDD, RAID controller, SSD and Optane DIMM all have their own cache. Hybrid storage system need to identify cold and hot data, and migrate them to appropriate storage tiers.

Note: all sentences in markdown quote format are some comments..

# Identification

Identification means how to figure what data is cold or hot, which is similar with cache replacement algorithms’ job. For example, LRU (and Clock), LFU and some adaptive variants of them: EELRU, ARC, 2Q and MQ. Then LIRS, OSCA[4] based on reuse-distance.

Especially for NVM-based cache, H-ARC[6] improve ARC by taking dirty and clean in to account.

And some recent works, NHC[7] focus on adaptively bypass requests to low-level capacity storage devices for fully utilizing its performance.

While still some works[4][5] point out that on super big data trace, different cache replacement algorithms’ performance is quite similar.

industry works are more persuasive here..

HashKV[1] cheat non-updated data as cold data, and move them to a circular buffer and do normal GC like WiscKey[2]. While hot data would be saved in adaptive segments. This is a simple idea with high recall rate but clearly HashKV can only detect “ice cold” data.

Flashield[3] is the first work using machine learning algorithm SVM as a binary classifier to predict access times on a single object in the future (aka flashiness in this paper). And the most impactful features are the number of past reads and updates (time-related features are not helpful). The model itself is off-line, but would be retrained periodically.

for memcache-like format request, perhaps TTL is a good input feature?[5]

ML-WP[12] also uses machine learning approach to detect writing request is write-only or not. This method is like a machine learning trade-off solution between write-back & write-through. The input features include temporal and request-level (e.g size, small request ratio, etc.). And Naive Bayse is a trade-off model selection considering accuracy, recall and consumed time.

Take HMCached[8] as example, a number of works calculate hotness depending on counters that record access times of objects. HMCached estimate hotness by counting GET/SET total times. The counter would be divided with $2^{period_{idle}+1}$ as decaying.

A hard threshold apparently can’t fit dynamic workloads. As a result, HMCached will adjust threshold by comparing migration benefit between the last two migrations. Migration benefit means that the benefit brought by last migration move. If the benefit increases, then it proves that the direction of the threshold last movement is correct.

Ziggurat[9] redirects application’s write directly to capacity level(SSD) if it predicts this write is async and large. Its two predictors are two rule-based counters. And the hotness is calculated by average modification time.
NVMFS[10] is pretty same with here, but only using clean and dirty lists.

Since Ziggurat is a log file system, file read/write requests have richer information to make prediction.

# Migration

HMCached uses multi-queues algorithm by hotness for moving data, which improves parallelism.

Ziggurat combine writes to disk as groups when NVM space utilization over a dynamic threshold decided by read-write ratio. Besides, Ziggurat only migrate cold blocks in cold files.

Strata[11] is a novel multi-tier FS. The key idea behind Strata is like LSM-Tree, which is use logging to accelerate random writes. And the log digestion process is done in kernel space for reducing sys-call extra costs.

# refer

1. Chan, Helen HW, et al. “HashKV: Enabling Efficient Updates in {KV} Storage via Hashing.” 2018 {USENIX} Annual Technical Conference ({USENIX}{ATC} 18). 2018.
2. Lu, Lanyue, et al. “Wisckey: Separating keys from values in ssd-conscious storage.” ACM Transactions on Storage (TOS) 13.1 (2017): 1-28.
3. Eisenman, Assaf, et al. “Flashield: a hybrid key-value cache that controls flash write amplification.” 16th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 19). 2019.
4. Zhang, Yu, et al. “{OSCA}: An Online-Model Based Cache Allocation Scheme in Cloud Block Storage Systems.” 2020 {USENIX} Annual Technical Conference ({USENIX}{ATC} 20). 2020.
5. Yang, Juncheng, Yao Yue, and K. V. Rashmi. “A large scale analysis of hundreds of in-memory cache clusters at Twitter.” Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20), Banff, AL, Canada. 2020.
6. Fan, Ziqi, David HC Du, and Doug Voigt. “H-ARC: A non-volatile memory based cache policy for solid state drives.” 2014 30th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2014.
7. Wu, Kan, et al. “The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus.” 19th {USENIX} Conference on File and Storage Technologies ({FAST} 21). 2021.
8. Jin, Hai, et al. “Hotspot-aware hybrid memory management for in-memory key-value stores.” IEEE Transactions on Parallel and Distributed Systems 31.4 (2019): 779-792.
9. Zheng, Shengan, Morteza Hoseinzadeh, and Steven Swanson. “Ziggurat: a tiered file system for non-volatile main memories and disks.” 17th {USENIX} Conference on File and Storage Technologies ({FAST} 19). 2019.
10. Qiu, Sheng, and AL Narasimha Reddy. “NVMFS: A hybrid file system for improving random write in nand-flash SSD.” 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2013.
11. Kwon, Youngjin, et al. “Strata: A cross media file system.” Proceedings of the 26th Symposium on Operating Systems Principles. 2017.
12. Zhang, Yu, et al. “A machine learning based write policy for SSD cache in cloud block storage.” 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020.