<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.grayxu.cn/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.grayxu.cn/" rel="alternate" type="text/html" /><updated>2026-03-17T19:56:29+08:00</updated><id>https://www.grayxu.cn/feed.xml</id><title type="html">Gray&apos;s grind</title><subtitle>gray&apos;s blog</subtitle><author><name>Gray</name></author><entry><title type="html">2025年我的AI Coding使用评测</title><link href="https://www.grayxu.cn/2026/01/03/AI-Coding/" rel="alternate" type="text/html" title="2025年我的AI Coding使用评测" /><published>2026-01-03T00:00:00+08:00</published><updated>2026-01-03T00:00:00+08:00</updated><id>https://www.grayxu.cn/2026/01/03/AI-Coding</id><content type="html" xml:base="https://www.grayxu.cn/2026/01/03/AI-Coding/"><![CDATA[<p>2025年里ai coding有一波非常大的能力增长，随着模型能力/agent工具/IDE整个上下游能力的变化，我的使用策略也在不断调整，所以写一篇blog记录下今年我的工具箱变化。视角很不专业，纯粹是一个使用者角度体验（belike 使用评测）。</p>

<p>inspired by:</p>
<ul>
  <li>@xuanwo的vibe coding系列文章：https://xuanwo.io/posts/2025-12-09-vibe-coding/</li>
  <li><a href="https://simonwillison.net/2025/Dec/31/the-year-in-llms/">2025: The year in LLMs</a></li>
  <li><a href="https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills">Anthropic Equipping agents for the real world with Agent Skills</a></li>
  <li><a href="https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic?referrer=grok.com">How AI is transforming work at Anthropic</a></li>
</ul>

<p>写作逻辑比较随意，想到哪写哪。
（llm仅做用词的调整，并根据内容抛了些问题，作为观察章节写下）。</p>

<h1 id="目录">目录</h1>
<ul>
  <li><a href="#目录">目录</a></li>
  <li><a href="#2025年使用的不同阶段">2025年使用的不同阶段</a>
    <ul>
      <li><a href="#第一阶段">第一阶段</a></li>
      <li><a href="#第二阶段">第二阶段</a></li>
      <li><a href="#第三阶段">第三阶段</a></li>
    </ul>
  </li>
  <li><a href="#一些随机观察">一些随机观察</a>
    <ul>
      <li><a href="#护城河">护城河</a></li>
      <li><a href="#attention-is-all-you-need">attention is all you need</a></li>
      <li><a href="#mcp没用">MCP没用？</a></li>
      <li><a href="#能力边界">能力边界</a></li>
    </ul>
  </li>
  <li><a href="#end">End</a></li>
</ul>

<h1 id="2025年使用的不同阶段">2025年使用的不同阶段</h1>

<h2 id="第一阶段">第一阶段</h2>

<p><strong>关键词: Cursor</strong></p>

<p>多年前首批用上了github copilot，那时候觉得很牛逼，发pyq说现在有人和我随时结对编程了。但事实上那个时候的copilot是一个不及格的半成品，补全一个func call甚至参数的数量都会写错，某种意义上更是打乱心流的噪声。chatgpt爆了之后，涌现出一堆ai coding的vscode插件，比如codeium/continue之类的。再后来24年下半年Cursor很火，试用了之后，我直接rm了vscode。</p>

<p>基本25年的上半年，我仍然是靠Cursor来满足我的ai coding editor的需求。Cursor的优势非常明显，就是Cursor提供了<strong>交互友好且一揽子</strong>的方案，同时从vscode迁移过来绝大部分东西都可以继承。对我来说最关键的点是交互，review diff/行内对话/光标预测上的体验都很好。<br />
比如让写一部分代码，我可以很清楚地通过行内by word的diff看到他做了什么，而vscode+插件的方案只能拉一个很不明显的双栏对比。那个时间点上这些体验侧的功能在竞品上有很大差距。另外cursor上chat之类功能也可以接自己的api来避开限额的问题，<del>再另外还可以无限续杯</del>。 <br />
但复杂的设计问题上，我还是基本依赖外部的chatbot去讨论并形成方案，通过自建openwebui之类的服务，可以并行多个头部能力的模型一起讨论问题。</p>

<p>此时agent的体验对我来说更像是一个fancy toy。我不好说是受限于模型agent能力还是本身产品的问题，大部分时候的使用体验是拿不到正确上下文，鸡同鸭讲，甚至自己把自己干死循环都是常态。这样一个输出质量根本不稳定的工具，实在很难让人把他放置到工作流中。所以，这个时候的vibe coding对我还有点远。</p>

<p>所以第一阶段的情况大概是，我的工作流仍然是一个以我编辑体验为主的IDE，以及一个紧密协作、方便交互的inline chatbot辅助。更复杂的任务则直接用外部的chatbot，手动提供context。</p>

<h2 id="第二阶段">第二阶段</h2>

<p><strong>关键词：CLI Agent</strong></p>

<p>这个阶段的game changer是claude code等cli产品。<br />
cc是我第一个用上的可以offload相对复杂或长流程的任务、可以all in one地去完成的agent。但我很快就发现了缺点（<del>或者说是我的缺点</del>），就是token的消耗太快了，而且稳定第三方获取anthropic token一直都比较贵，另外还有封号问题，所以很快发现烧不起token之后，我又回退到Cursor+chatbot的组合。</p>
<blockquote>
  <p>话说是不是到现在cc的订阅方案等价换算token后的价格应该都还是比友商贵的？</p>
</blockquote>

<p>同期也有很多类似的ide/cli工具，但我并没有觉得比Cursor+chatbot产生什么质变，缺乏去做迁移的动力。毕竟还得新的适应过程，直到gemini-cli发布。</p>

<p>虽然gemini-cli初期也有很多问题，但他们迭代速度真的很快（可能也得益于开源吧）。<del>同时因为有每三个月刷新的神秘$300额度</del>，我直接在所有机器/vm上都装了gemini-cli。这个时候发现cli工具的美妙之处在于，先不说任务难不难，但真的可以做到全流程offload。直接terminal一开，ssh上去tmux里拉起就可以异步发布活，不需要任何额外东西。尤其各种部署/迁移的活可以全offload掉。</p>

<p>差不多时间段因为实习，迫于红线无法用各种闭源模型方案，折腾了很多<em>替代方案</em>，但体验都差口气，发现有明显的代差。<br />
IDE/Editor体验上，Lxxxx/Axxx之类的方案不仅仅只能用能力差的基模，甚至不能正常c/c++ intellisense。插件based的方案问题见上。<br />
cli工具上像qwen-code这些刚发布的时候，问题还是挺多的。也试了下套了一层router之后用cc的方案，但自测还是和用自家模型的体验差距明显。<br />
当然，这事实上这只是个时间问题，这里的gap现在肯定缩小很多了，毕竟各方面能力都拉齐了很多。</p>

<p>同期有各种更上层的、通过拉起vm/ack干活的agent方案，给一些简单的长链路的活，觉得体验还可以。比如用接入k2的aone agent做了很多看代码的活，用manus做了很多爬虫的活。但对我来说，这类<em>通用</em>方案的问题在于，我绝大部分的任务都是强依赖各种软硬件环境的，虚拟化的通用环境没法闭环甚至没法编译起来，实用性差了不少。</p>

<p>第二阶段对我来说最令人兴奋的就是，cli工具让工作流产生了质变。<br />
但用gemini-cli的时候仍然感觉一些更复杂的编写任务，即使有明确方案和实现路径和参考代码，即使在并不大的单库里，gemini-cli也很难实现好，更别说oneshot。</p>

<h2 id="第三阶段">第三阶段</h2>

<p><strong>关键词：Codex</strong></p>

<p>当发现社区都在讨论 “Codex牛逼” 和 “阅尽千帆，归来还是VSCode” 之后，感觉好像有点out了。叠加发现Cursor的一些烦点：</p>
<ol>
  <li>价格。低价的小杯订阅限额严重，自定义API则不支持很多新feature。</li>
  <li>稳定性。一些新插件支持有问题（比如codex看不了diff）。</li>
  <li>性能。cursor和gemini-cli都略重，某些小vm上会撞各种问题。</li>
</ol>

<p>于是抄社区作业，开始用vscode+copilot+codex+抄别人的agents.md<br />
马上发现copilot实际上已经把交互体验上的问题补全得七七八八了，该有的基本都有。不过能力也还是差不多(用opus)，然后订阅给的quota也不多。</p>

<p>而codex给了我很多意外惊喜：</p>
<ol>
  <li>订阅性价比好</li>
  <li>rust真不错，好轻好轻</li>
  <li>codex表现出来的解决任务能力明显比gemini-cli好</li>
</ol>

<p>我没深度用过cc，没啥发言权。但codex真的比同期的gemini-cli要强太多，在plan mode下讨论好的复杂任务，有很高概率是可以one-shot解决。然后开始给codex布置各种类型任务，完成度经常超预期。比如自己gdb做debug找问题，然后闭环自测。或者自己perf抓指标做profiling。<br />
在小任务上或者design明确且稳定（就是后面也不会因为什么偏移）的任务上，codex的表现相当相当好。受限人类的带宽，总是会有些想写的东西是没法写的，现在codex就是这类任务的疯狂的加速器。</p>

<p>我现在的开发纯代码量上，确实codex完成了绝大部分东西，比我写的好太多快太多。但从100% vibe coding的终极目标来看，我的使用场景下还是需要很多额外的干预。主要是一些对性能有极限需求的原型系统上，codex的防御性太强，实现上很多hacky的动作除非强调，不然不会做；设计上也是，比如按需求可以做无锁的地方，它也不会榨干。（这部分展开见<a href="#能力边界">能力边界</a>）</p>
<blockquote>
  <p>所以现在agents.md还是重要的，再叠加尤其如果你经常换不同类型项目开发的话，才能对齐好需求上的context。（<del>比如一会写汇编，一会写前端</del>）</p>
</blockquote>

<p>一些其他产品上</p>
<ul>
  <li>尝试了一些任务用github的copilot，使用上确实比较直觉，甚至你可以在手机上让提pr，就是速度有点过于慢，然后智力其实还是挺有限。</li>
  <li>看到很多worktree类型管多agent的产品/feature，我需求比较弱，很多时候还是倾向线性做病并尝试教它做对。当然还是有很多越写越多分叉问题的任务，可能也挑战带宽，后面会尝试下！</li>
</ul>

<p>总结下这个阶段，codex软广。</p>

<h1 id="一些随机观察">一些随机观察</h1>

<h2 id="护城河">护城河</h2>

<p>护城河是什么，感觉这是个对用户来说很难回答好的问题，怎么说都像是暴论。。<br />
都说今年scaling law放缓，基模能力提升变慢，但我的体感是基模仍然很重要。迁移到codex上之后，每一次新发版都可以感觉到一些提升。所以我觉得不仅仅是说一些agent能力榜，可能还是有很多关键链路上的指标需要拿出来量化做优化。所以我对“应用能力和模型能力是紧密结合的”这个观点是买单的（<del>所以买xxx call吧</del>）。</p>

<h2 id="attention-is-all-you-need">attention is all you need</h2>

<p>人的attention很重要。最终的产物还是需要review（<del>当然可能有的开发者的角度来看，可能不一定要review source code了</del>），所以人的带宽仍然还是瓶颈。<br />
cursor对我最早的爽点就是review压力很小，tabtab又tab。cli的爽点之一是无缝嵌入到现有的工作流程里。<br />
大家肯定都倾向选择把活干对的方案，然后才去考虑什么速度价格的问题。头部能力的模型和agent在这方面的号召力肯定拉满，而且可以规避掉迁移成本的问题。<br />
我觉得我目前的工作流里还是有很多交互/review上的开销，可能未来还存在某种更优解吧。</p>

<h2 id="mcp没用">MCP没用？</h2>

<p>好多人说mcp没用。其实换个角度看这个问题，应该说你的工作流在原来CLI shell里是不是已经完备了，如果是那ok。<br />
但cli显然不是掌握世界信息的，这世界上还是有很多东西没接口、不让爬，所以还是需要给某种交互形式。尤其对我来说coding agent已经不仅仅是用于开发，它某种程度上已经变成一个cli入口，但事实上的assistant，所以我需要它有更大的输入输出范围，mcp至少现在是可用解。</p>

<h2 id="能力边界">能力边界</h2>

<p>秋招里不止一场面试有人抛出来聊ai coding能力边界的问题。我觉得了解和明确边界肯定是一个重要的东西。显然目前coding agent离理想状态还很远，所以还是需要付出精力把最后一英里（or 最前一英里）补齐。</p>
<ol>
  <li>我现在的体感是coding agent就是一个没context的senior的开发。所以只要缺乏正确的context，计划不完备的任务，产出就可能控制不住质量，非常自信的它会走向一塌糊涂。</li>
  <li>大的项目里驾驭能力有限，不一句句聊好plan好，就总有捞不出来正确context的问题。</li>
  <li>而有些问题，你和它就算讨论了plan后也规避不了。你看着task list没什么问题，但总有变量你控制不好且预期不到，你脑子里的先验它也不知道，于是它在某个十字路口选了和你对不齐的选项，后面就开始出问题了。所以我不太喜欢先vibe个架子再开始优化，感觉看了头大，还是喜欢拆解了做好。</li>
  <li>再进一步，你也是会错的。llm还是不耐挑战，太顺从了，如果能反过来多挑战你的design就好了。</li>
  <li>最后前面也说过，llm做high level design上和人还是差很多，更别说给idea了，大部分时候就是能做一个正确的但离好还很远的设计（<del>当然话说回来，很多时候也不需要什么好的设计</del>）。</li>
</ol>

<h1 id="end">End</h1>

<p>最后，感觉虽然赛道很多玩家很红海，但感觉还没出现特别收敛的结果，期待明年新的变化，给我们使用者多一些加速。</p>]]></content><author><name>Gray</name></author><category term="EC" /><summary type="html"><![CDATA[AI Coding, Vibe Coding]]></summary></entry><entry><title type="html">Accelerating Erasure Coding</title><link href="https://www.grayxu.cn/2025/03/06/EC-Lib/" rel="alternate" type="text/html" title="Accelerating Erasure Coding" /><published>2025-03-06T00:00:00+08:00</published><updated>2025-03-06T00:00:00+08:00</updated><id>https://www.grayxu.cn/2025/03/06/EC-Lib</id><content type="html" xml:base="https://www.grayxu.cn/2025/03/06/EC-Lib/"><![CDATA[<p>various existing acceleration techniques for erasure coding, including recent academic work and popular open-source libraries.</p>

<h1 id="forewords">Forewords</h1>

<p>Erasure coding (EC) strategies essentially provide system-level fault tolerance by <strong>encoding</strong> k data blocks into m redundant blocks. k can be much larger than m, unlike replicas where m is an integer multiple of k. The obvious advantage of EC is reduced storage space, but more complex data organization strategies introduce additional overhead, such as frequent encoding and decoding operations. Unlike XOR, EC encoding and decoding cannot achieve line speed, therefore, numerous EC acceleration libraries have been developed. Open-source and widely used libraries include Intel's ISA-L, the Jerasure series, and klauspost/reedsolomon in Go. Many companies also have their own proprietary libraries. Moreover, extensive academic research focuses on EC acceleration.</p>

<p>This blog primarily focuses on the end-to-end efficiency of Reed-Solomon (RS) codes, a specific type of systematic code, rather than the mathematical encoding problem itself.</p>

<blockquote>
  <p>The content was adapted by gemini 2.0 pro from my original notes. If the comments are too harsh, it's not me who wrote them.</p>
</blockquote>

<p>Some related background:</p>

<ul>
  <li>If you are completely unfamiliar with erasure coding, please check drxp's EC blog series for a 101 introduction:
    <ul>
      <li>Principles: <a href="https://blog.openacid.com/storage/ec-1/">https://blog.openacid.com/storage/ec-1/</a></li>
      <li>Implementation: <a href="https://blog.openacid.com/storage/ec-2/">https://blog.openacid.com/storage/ec-2/</a></li>
      <li>Optimization: <a href="https://blog.openacid.com/storage/ec-3/">https://blog.openacid.com/storage/ec-3/</a></li>
    </ul>
  </li>
  <li>Additional background: High-performance erasure codes - kkblog: <a href="https://abcdxyzk.github.io/blog/2018/04/12/isal-erase-1/">https://abcdxyzk.github.io/blog/2018/04/12/isal-erase-1/</a>
    <ul>
      <li>Parallel lookup table strategy for GF calculations</li>
      <li>Matrix operation partitioning to improve locality</li>
      <li>Cauchy encoding matrix: <a href="https://abcdxyzk.github.io/blog/2018/04/16/isal-erase-2/">https://abcdxyzk.github.io/blog/2018/04/16/isal-erase-2/</a></li>
    </ul>
  </li>
</ul>

<p>We will first divide the discussion into two categories:</p>

<ol>
  <li>
    <p><strong>XOR-based</strong>: Because finite field calculations are not instructions directly executable by the CPU, they are converted into multiple XOR operations.</p>
  </li>
  <li>
    <p><strong>Lookup Table</strong>: If the finite field is fixed (e.g., GF(8)), the multiplication results can be stored in a fixed table, and table lookups are used instead of calculations.</p>
  </li>
</ol>

<h1 id="xor-based">XOR-based</h1>

<h2 id="dsn-09-tc-13-efficient-encoding-schedules-for-xor-based-erasure-codes">DSN '09 TC '13 Efficient Encoding Schedules for XOR-Based Erasure Codes</h2>

<p>Jianqiang Luo, Mochan Shrestha, Lihao Xu, and <strong>James S. Plank</strong></p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-30-34.png" alt="Pasted image 20250305193332.png" /></p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-30-48.png" alt="Pasted image 20250305183609.png" /></p>

<p>As shown in the figures above, unlike the simple logic of finite field calculations, the computational logic of XOR codes can be understood as splitting each data block into sub-blocks (corresponding to packets). Each sub-parity (sub-p) block requires the XORing of multiple sub-data (sub-d) blocks across rows. Different XOR-code matrices represent different combinational logic. Therefore, much work focuses on proposing more efficient matrices with fewer XOR operations, resulting in less computation and faster speed. This paper focuses on the fact that, while fewer computational operations are important, the caching efficiency of sub-d blocks also plays a significant role, since sub-d blocks are repeatedly accessed during computation.</p>

<p>Therefore, this paper proposes several different scheduling strategies:</p>

<ul>
  <li><em>DPG (Data Packets Guided)</em>: Performs calculations in the order of packets, processing all calculations related to one packet before moving on to the next.</li>
  <li><em>DWG (Data Words Guided)</em>: Similar to DPG, but iterates by data word.</li>
</ul>

<p>This improves the locality of sub-d blocks. Because sub-p blocks are not repeatedly read here, pure store operations will only hit the cache. Although this paper targets pure XOR codes like EVENODD and RDP, rather than RS codes, the idea is still quite good.</p>

<h2 id="fast-19-fast-erasure-coding-for-data-storage-a-comprehensive-study-of-the-acceleration-techniques">FAST '19 Fast Erasure Coding for Data Storage: A Comprehensive Study of the Acceleration Techniques</h2>

<p>Work by Tianli Zhou and Chao Tian, TAMU.</p>

<p>Open source: <a href="https://github.com/zhoutl1106/zerasure">https://github.com/zhoutl1106/zerasure</a> This is further work based on Jerasure.</p>

<p>This work is very comprehensive and reading the original paper is recommended. It can be considered a small survey. Zerasure combines and optimizes several existing erasure coding acceleration techniques, including <em>coding matrix design, computational scheduling optimization, general XOR operation reduction, cache management, and vectorization</em>. It then proposes building a cost function based on the number of XOR and memcpy operations. Simulated annealing is used to choose among multiple mutually exclusive strategies, while non-mutually exclusive optimizations are directly overlapped.</p>

<ul>
  <li>Adjusting the selection still involves various matching scheduling techniques, as they are mutually exclusive.</li>
</ul>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-31-28.png" alt="Aspose.Words.8c4c77a8-626b-4a47-85dc-22a66dae0175.053.png" /></p>

<p>It is worth noting that this paper presents several viewpoints:</p>

<ul>
  <li>It considers the costs of memcpy and XOR to be equivalent.</li>
  <li>XOR vectorization is faster than direct GF calculation vectorization (e.g., ISA-L).</li>
  <li>Cache-related S-CO strategies do not offer significant improvements.
    <blockquote>
      <p>note: This may be because Zerasure itself is not very fast, and the access patterns of XOR codes are inherently not cache-friendly.</p>
    </blockquote>
  </li>
</ul>

<p>Performance improvements:</p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-31-51.png" alt="Pasted image 20240507234417.png" /></p>

<h2 id="sc-21-accelerating-xor-based-erasure-coding-using-program-optimization-techniques">SC '21 Accelerating XOR-based erasure coding using program optimization techniques</h2>

<p>Yuya Uezato, Dwango, Co., Ltd (Parent company of NICONICO, amazing!)</p>

<p>Open source: <a href="https://github.com/yuezato/xorslp_ec">https://github.com/yuezato/xorslp_ec</a></p>

<p>But it's a Rust proof-of-concept. It claims a library will be made, but nothing yet. The tools used are not very system-oriented. But this work is quite interesting.</p>

<p>As can be seen earlier, the process of XOR-based EC is actually logically quite simple, somewhat similar to CUDA kernels, but limited by computational resources, memory access resources, etc. So, this work directly abstracts it into SLPs (a concept from the PL field, Straight-Line Programs), and then uses various SLP optimization strategies to optimize the SLP (automated PL strategies):</p>

<ol>
  <li>Compressing: Using grammatical compression algorithms to reduce the number of XORs.</li>
  <li>Fusing: Using the functional program optimization method deforestation to reduce memory access.
    <ol>
      <li>This reduces memory access for intermediate variables, but it seems many are done manually.</li>
    </ol>
  </li>
  <li>Using the (red-blue) pebble game from program analysis to reduce cache misses.
    <ol>
      <li>A formal objective is created, and then heuristically optimized, performing XOR rearrangement.</li>
    </ol>
  </li>
</ol>

<p>[<em>It seems that there will still be conflicts among these multiple strategies. How to make trade-offs?</em>]</p>

<p>This strategy is different from ISA-L, which directly accelerates finite field calculations using lookup tables, without converting to XOR.</p>

<p>Results:</p>

<ul>
  <li>The optimized EC library outperforms ISA-L in RS(10,4) encoding, achieving a throughput of 8.92 GB/s.</li>
  <li>(Xor)RePair reduces XOR operations by approximately 60% on average.</li>
  <li>The combination of XOR fusion and (Xor)RePair reduces memory access by approximately 76% on average.</li>
  <li>The fusing step provides the largest improvement.</li>
</ul>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-32-14.png" alt="Pasted image 20240507234459.png" /></p>

<h2 id="iccd-23-tcad-24-cerasure-fast-acceleration-strategies-for-xor-based-erasure-codes">ICCD '23 TCAD '24 Cerasure: Fast Acceleration Strategies For XOR-Based Erasure Codes</h2>

<p>Tianyang Niu, Min Lyu, Wei Wang, Qiliang Li, Yinlong Xu, ADSL Lab, USTC</p>

<p>Open Source: <a href="https://github.com/ADSL-EC/Cerasure">https://github.com/ADSL-EC/Cerasure</a></p>

<p>Challenges:</p>

<ol>
  <li>The number of 1s in the bit matrices found by existing heuristic algorithms can obviously be further reduced.</li>
  <li>Creating pointers for reading/writing data leads to high encoding latency.</li>
  <li>The trade-off between computational efficiency and spatial locality can be further improved by selecting the packet size.</li>
  <li>Wide-stripe encoding (stripes containing many data/parity blocks) leads to low cache hit rates for commonly used packet sizes.</li>
</ol>

<p>Corresponding designs:</p>

<ol>
  <li>V-search: Searches Vandermonde and Cauchy matrices and greedily reduces the number of 1s in the matrix to find a near-optimal encoding matrix.
    <ol>
      <li>The trans version adds an opt-search: iteratively replaces matrix elements in descending order of the number of 1s in the bit matrix, and excludes those that increase the number of 1s or destroy the MDS property.</li>
    </ol>
  </li>
  <li>Uses offset reuse to accelerate the construction of read and write pointers (this engineering trick seems to be due to the strong coupling with ISA-L's interface).</li>
  <li>Finds a trade-off in packet size selection (computational efficiency and cache), using the L1 cache size to calculate an optimal solution.</li>
  <li>Decompose: For wide stripes, the number of data blocks is larger, putting significant pressure on the cache. Therefore, the calculation is separated into multiple sub-encoding tasks, which are merged at the end.
    <ol>
      <li>The trans version adds smart decompose, which greedily combines subtasks during decomposition to increase similarity, so that the previous scheduling is more effective.</li>
    </ol>
  </li>
</ol>

<p>Experiments were compared with Zerasure and SLPEC, but not with their implemented baseline, ISA-L. It can be more than twice as fast as Zerasure, but Zerasure cannot outperform the default ISA-L.</p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-32-22.png" alt="Pasted image 20240507234536.png" /></p>

<h2 id="hotstorage24-rethinking-erasure-coding-libraries-in-the-age-of-optimized-machine-learning">HotStorage'24 Rethinking Erasure-Coding Libraries in the Age of Optimized Machine Learning</h2>

<p>Jiyu Hu, Jack Kosaian, <strong>K. V. Rashmi</strong>, CMU</p>

<p>As mentioned earlier, some have used SLP to automatically optimize XOR organization and scheduling. This paper is even more interesting, directly using TVM to optimize EC computation scheduling. The difference between EC matrix calculations and NN matrix calculations is that the internal subunits are performing bitmatrix XOR.</p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-33-57.png" alt="{C6F8BE23-3838-47A6-863E-44E72348FFB5}.png" /></p>

<p>No internal modifications to TVM are needed, just call the API directly. However, TVM requires this data to be contiguous. They assume that the overhead of memcpy is general for libraries.</p>

<blockquote>
  <p>But it doesn't feel like it?</p>
</blockquote>

<p>This idea is very interesting, but it's quite engineering-heavy to implement. Although it uses TVM, it's not using the GPU. It's still compared with CPU libraries.</p>

<p><img src="https://www.grayxu.cn/images/2025/03/06/2025-03-06-16-34-04.png" alt="{50D31BF7-5956-4665-AD40-2D27999770AF}.png" /></p>

<p>Interestingly, it can be seen that the SC'21 work can no longer outperform ISA-L when r=4. In addition to the hardware issues mentioned in the paper, the main reason is that the increase in XOR operations is not linear for XOR codes with multiple parities. It can be seen that the advantage of TVM-EC increases with a larger number of parities. This may be because the increase in the number of operations provides more optimization space for TVM. Of course, this is an optimal calculation state after learning the parameters, requiring a process similar to preheating.</p>

<p>Introducing this system to existing systems requires a C++ runtime, and the layout needs to be adjusted. In addition, the specific memory access and calculation process becomes opaque, which is actually quite heavy.</p>

<p>By the way, there are also some works that use GPUs for EC:</p>

<ul>
  <li>ICC'15 PErasure: A parallel Cauchy Reed-Solomon coding library for GPUs</li>
  <li>TPDS'18 G-CRS: GPU Accelerated Cauchy Reed-Solomon Coding
    <ul>
      <li>Some <em>memory access efficiency optimization and control flow optimization</em> methods are made for GPUs. Because CRS is gf(2), XOR can be used directly.</li>
    </ul>
  </li>
</ul>

<h1 id="lookup-table">Lookup Table</h1>

<h2 id="jerasure">Jerasure</h2>

<p><a href="http://jerasure.org/jerasure/gf-complete/">http://jerasure.org/jerasure/gf-complete/</a></p>

<p>Jerasure uses lookup tables for finite field calculations. Including work by James S. Plank such as FAST '13, it is a C library. The precomputed multiplication tables include:</p>

<ul>
  <li>Log Table: Records the logarithmic value of each non-zero element in the field.</li>
  <li>Exp Table: Records the field element corresponding to each logarithmic value.</li>
</ul>

<p>It not only supports GF(8) finite field calculations, but also supports finite fields between GF(4) and GF(128). It only has SSE vectorization acceleration.</p>

<p>Different optimization strategies are employed for different values of <em>w</em>:</p>

<ul>
  <li>GF(2^4): Multiplication of 128-bit data is accomplished through two table lookups using the <code class="language-plaintext highlighter-rouge">_mm_shuffle_epi8</code> instruction.</li>
  <li>GF(2^8): The 8-bit number is split into two 4-bit numbers, each undergoing table lookup, leveraging the <code class="language-plaintext highlighter-rouge">_mm_shuffle_epi8</code> instruction.</li>
  <li>GF(2^16): The 16-bit number is split into four 4-bit numbers, utilizing eight lookup tables and the <code class="language-plaintext highlighter-rouge">_mm_shuffle_epi8</code> instruction. To fully exploit SIMD parallelism, an "Altmap" memory mapping scheme is adopted, mapping a set of every 16 words into two 128-bit variables.</li>
  <li>GF(2^32): Similar to GF(2^16), the 32-bit number is split into eight 4-bit numbers, employing 32 lookup tables and the "Altmap" memory mapping.</li>
</ul>

<p>A critical problem with Jerasure is its memory access pattern. Here is a pseudo-code example:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">jerasure_matrix_encode</span><span class="p">(</span><span class="kt">int</span> <span class="n">k</span><span class="p">,</span> <span class="kt">int</span> <span class="n">m</span><span class="p">,</span> <span class="kt">int</span> <span class="n">w</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">matrix</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">data</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">coding</span><span class="p">,</span> <span class="kt">int</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">m</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Iterate through each parity block</span>
        <span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o">&lt;</span> <span class="n">k</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Iterate through each data block</span>
            <span class="k">if</span> <span class="p">(</span><span class="n">matrix</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="n">k</span> <span class="o">+</span> <span class="n">j</span><span class="p">]</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// If the matrix coefficient is non-zero</span>
                <span class="n">galois_region_multiply</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="n">j</span><span class="p">],</span> <span class="n">matrix</span><span class="p">[</span><span class="n">i</span> <span class="o">*</span> <span class="n">k</span> <span class="o">+</span> <span class="n">j</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span> <span class="n">coding</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="p">(</span><span class="n">j</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">));</span>
            <span class="p">}</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>It can be seen that it is centered around parity blocks, which leads to poor locality of data blocks, resulting in poor cache efficiency.</p>

<h2 id="isa-l">ISA-L</h2>

<p><a href="https://github.com/intel/isa-l">https://github.com/intel/isa-l</a></p>

<p>Intel's ISA-L may also be for generality (adapting to various platforms and instruction sets). It doesn't have many additional complex optimizations, whether in matrix selection or encoding strategy. It also directly uses lookup tables.</p>

<p>ISA-L also uses split multiplication tables. A complete GF(8) table with 256-byte elements would be 64KB in size, which would put too much pressure on the cache. Therefore, the 8 bits are split into the high 4 bits and the low 4 bits, creating smaller multiplication tables. The calculation also involves two table lookups and one XOR to obtain the final result. The performance advantage of ISA-L comes from relatively simple factors, mainly:</p>

<ul>
  <li>Extensive assembly unrolling, efficient use of instructions.</li>
  <li>Good memory access locality. After accessing all data blocks in the same row during encoding, all parity blocks are written at once (of course, this is limited by the number of registers. They only support p&lt;=6 in this way, otherwise it needs to be split. But this could be extended).</li>
  <li>Newer instruction set acceleration selection (e.g., AVX512, GFNI).</li>
</ul>

<p>ISA-L's handwritten assembly here is similar to writing CUDA kernels. Interestingly, many EC implementations, even in top conferences, often use ISA-L's standard encode interface for indirect implementation. However, these self-made interfaces are not very particular about instruction execution efficiency. Writing assembly code in the style of ISA-L would result in at least a one or two times improvement (although this may not be their focus).</p>

<p>FYI, ISA-L also has some minor limitations:</p>

<blockquote>
  <p>Summary is ISA-L EC can use any encoding matrix, performs the same operation regardless of encoding matrix provided and the documentation is clear about the limitations of gf_gen_rs_matrix(). So not a bug in ISA-L.</p>

  <p>Vandermonde matrix example of encoding coefficients where high portion of matrix is identity matrix I and lower portion is constructed as 2^{i*(j-k+1)} i:{0,k-1} j:{k,m-1}. Commonly used method for choosing coefficients in erasure encoding but does not guarantee invertable for every sub matrix. For large pairs of m and k it is possible to find cases where the decode matrix chosen from sources and parity is not invertable. Users may want to adjust for certain pairs m and k. If m and k satisfy one of the following inequalities, no adjustment is required:</p>
  <ul>
    <li>k &lt;= 3</li>
    <li>k = 4, m &lt;= 25</li>
    <li>k = 5, m &lt;= 10</li>
    <li>k &lt;= 21, m-k = 4</li>
    <li>m - k &lt;= 3</li>
  </ul>
</blockquote>

<h3 id="gfni">GFNI</h3>

<p>A new feature introduced in ISA-L v2.31. The distributed interface is <code class="language-plaintext highlighter-rouge">ec_encode_data_avx512_gfni</code>. The Go EC library <a href="https://github.com/klauspost/reedsolomon">https://github.com/klauspost/reedsolomon</a> also provides a similar interface. The GFNI instruction set directly supports GF multiplication operations at the hardware level, eliminating the need for additional lookup tables, bit operations, etc. On corresponding platforms, my test results show that GFNI+AVX512 can <strong>double</strong> the performance compared to ordinary AVX512 single-threaded.</p>

<h1 id="dialga">Dialga</h1>

<p>In our ICPP '25 work, <em>Dialga</em>, we discovered that in CPU-centric scenarios involving Persistent Memory (PM) or CXL-based slow memory (high access latency), memory access efficiency is the true performance bound. The core bottleneck lies in the <strong>inefficiency of hardware prefetchers</strong> (a viewpoint also shared by the ASPLOS '25 paper, <em>Melody</em>).</p>

<p>Through various tests, we identified several interesting degradation phenomena, similar to issues reported in the community <a href="https://github.com/intel/isa-l/issues/152">intel/isa-l #152</a>. We found that prefetching efficiency degrades significantly in scenarios involving wide stripes, small blocks, or high concurrency. We employed reverse engineering and profiling for attribution: for instance, in wide-stripe scenarios, the core L2 hardware stream prefetcher's Stream Entry Table—which can only track 32 unidirectional streams—becomes overwhelmed. This destroys the prefetcher's confidence, eventually causing it to disable itself.</p>

<p>For various reasons, the scope of this work remains focused on Erasure Coding (EC) acceleration. However, for emerging memory scenarios, the destruction of hardware prefetching efficiency is the "elephant in the room." Unless you are using a GPU with massive SMs, CPU-driven efficiency will always be undermined by the memory wall. At the system level, modifying black-box cache logic is difficult, and CXL performance varies across vendors. Therefore, strategies are needed to bridge this gap.</p>

<p>The design part in our paper was actually less interesting. It primarily involved implementing adaptive software/hardware prefetch scheduling strategies to address the discovered issues. Due to other ongoing commitments, I haven't fully open-sourced the code yet, but I will do so immediately after finishing my dissertation.</p>]]></content><author><name>Gray</name></author><category term="EC" /><summary type="html"><![CDATA[Accelerating Erasure Coding]]></summary></entry><entry><title type="html">Erasure Coding + Disaggregated Memory</title><link href="https://www.grayxu.cn/2024/12/05/EC-DM/" rel="alternate" type="text/html" title="Erasure Coding + Disaggregated Memory" /><published>2024-12-05T00:00:00+08:00</published><updated>2024-12-05T00:00:00+08:00</updated><id>https://www.grayxu.cn/2024/12/05/EC-DM</id><content type="html" xml:base="https://www.grayxu.cn/2024/12/05/EC-DM/"><![CDATA[<p>Disaggregated Memory is currently a hot topic in systems research, and distributed large-capacity memory clearly requires system-level reliability strategies. While replication has always been a default choice, with many related works, including recent ones like SWARM@SOSP'24, erasure coding is also an option. This article lists existing EC+DM works.</p>

<h2 id="ipdps-21-f-write-fast-rdma-supported-writes-in-erasure-coded-in-memory-clusters">IPDPS '21 F-Write: Fast RDMA-supported Writes in Erasure-coded In-memory Clusters</h2>

<p>Previous works like octopus@ATC'17 have reconstructed network I/O (like RPC) using one-sided verbs.</p>

<p>This paper focuses on the scenario of RDMA+EC, where updates are slow due to I/O amplification.</p>

<ol>
  <li>Implements a 2PC scheme for EC using one-sided writes.
    <ul>
      <li>Essentially, it's octopus's one-sided RPC.</li>
    </ul>
  </li>
  <li>Then, it builds on top of this with <em>speculative updates</em>, implementing flying data merging (to merge multiple submissions) for EC.</li>
</ol>

<p><img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-21-52-02.png" alt="Pasted image 20241014090638.png" /></p>

<p><em>No NIC info provided.</em></p>

<h2 id="fast-22-hydra-resilient-and-highly-available-remote-memory">FAST '22 Hydra: Resilient and Highly Available Remote Memory</h2>

<p>SymbioticLab, available on arXiv since 2019.</p>

<p>Problems:</p>
<ul>
  <li><strong>High Latency:</strong> EC-based remote memory solutions cannot meet microsecond-level latency requirements due to encoding overhead, straggler issues, interrupt overhead, and data replication overhead.</li>
  <li><strong>Low Availability:</strong> Existing fault tolerance mechanisms based on replication and erasure coding can easily lead to data loss in the event of correlated failures due to the random placement of coding groups.</li>
</ul>

<p>Challenges:</p>
<ol>
  <li>Encoding overhead</li>
  <li>Splitting amplifies tail latency.</li>
  <li>Context switching overhead</li>
  <li>Copy overhead</li>
  <li>Placement strategy is not good for simultaneous errors.</li>
</ol>

<p>Design:</p>
<ul>
  <li><strong>Asynchronous encoded writes</strong> and delayed binding reads to hide latency.
    <ul>
      <li>Asynchronously Encoded Write: Fragments are not queued; similar to <em>late binding for writes</em>, only confirming after the first $k$ requests return.</li>
      <li>Late Binding: Basically, multi-fragment reads in an EC cache.</li>
    </ul>
  </li>
  <li>In-Place Coding minimizes data copying. Unregisters after receiving k splits to prevent overwriting by subsequent splits. [<em>Will there be no registration performance issues?</em>].</li>
  <li>Run-to-Completion avoids context switching because the latency is very low.</li>
  <li>The CodingSets algorithm improves availability by carefully designing the placement strategy of coding groups, reducing the probability of data loss under correlated failures. [<em>A classic EC problem</em>, from <strong>CopySet</strong>].</li>
</ul>

<p>Open Source: <a href="https://github.com/SymbioticLab/hydra">https://github.com/SymbioticLab/hydra</a></p>

<p><em>Is it really a good idea to use late binding so extensively?</em></p>
<ul>
  <li><em>Increased number of network packets (could RDMA verb scalability be limited?).</em></li>
  <li><em>Higher computational pressure (mainly due to increased latency, but throughput is still line rate (can it be pipelined?).</em></li>
</ul>

<h2 id="osdi-22-carbink-fault-tolerant-far-memory">OSDI '22 Carbink: Fault-tolerant far memory</h2>

<p>Google</p>

<p>Follow-up work to Hydra@FAST'22. The problem is that due to self-coding partitioning for a single object,</p>
<ol>
  <li>Multiple network I/O operations are required to reconstruct a page.
    <ol>
      <li><em>But late binding can be used, so it's okay. It seems like just a <strong>granularity</strong> trick.</em></li>
    </ol>
  </li>
  <li>Computation is centralized and cannot be offloaded to remote nodes.
    <ol>
      <li><em>But for DM, this is a false need. However, what about potential RNIC offload?</em></li>
    </ol>
  </li>
</ol>

<p>Design:</p>
<ol>
  <li>Therefore, it abstracts the concept of a span, where each span consists of multiple pages with similar object sizes. Then, cold/hot determination, grouping (clock algorithm), and eviction are performed asynchronously and transparently. [<em>Like slab</em>]. Note that the unit of processing here is clearly different from Hydra.
    <ul>
      <li>There are some system designs, but a lot of related work exists, especially in slab-related clustering.</li>
      <li><em>And from an EC perspective, one is self-coding and the other is cross-coding, making direct comparison difficult.</em></li>
    </ul>
  </li>
  <li>Asynchronous GC compaction (EC stripes).
    <ol>
      <li>Hydra does not have this issue because a set of stripes forms an object, so their lifecycles are tied together.</li>
      <li>Triggered by swap-out, and consistent completion needs to be ensured. 2PC is a naive approach.
        <ol>
          <li>EC-batch local</li>
          <li>EC-batch remote (offload parity calculation to remote nodes)</li>
        </ol>
      </li>
    </ol>
  </li>
</ol>

<blockquote>
  <p>"To reconstruct a span, a compute node only needs to contact a single memory node storing that span."</p>
</blockquote>

<h2 id="tpds-23-enabling-efficient-erasure-coding-in-disaggregated-memory-systems">TPDS '23 Enabling Efficient Erasure Coding in Disaggregated Memory Systems</h2>

<p>USTC ADSL</p>

<p>This work begins to focus on the problem from a DM perspective (i.e., purely memory nodes).</p>

<p>As one-sided RDMA latency drops to the microsecond level, encoding overhead degrades the performance of DM with EC. To enable efficient EC in DM, we thoroughly analyzed the coding stack from the perspectives of <strong>cache efficiency and RDMA transfer</strong>.</p>

<p>DM is a subset of RDMA, where local memory is more limited or only acts as a cache.
A natural approach is to use pipelining, but the challenges are:</p>
<ol>
  <li>Sub-stripe segmentation affects cache efficiency.</li>
  <li>Dedicated kernel coding reduces cache pollution.</li>
  <li>How object size impacts pipeline scheduling issues.</li>
</ol>

<p>MicroEC significantly reduces latency variation by reusing auxiliary encoding data. For example, it reduces the P99 latency of writing a 1 KB object by 27.81%. It optimizes the coding workflow and coordinates encoding and RDMA transfer through an exponential pipeline while carefully adjusting coding and transmission threads to minimize latency.</p>

<p>Note that this work only focuses on objects larger than 64KB.</p>

<p>Design:</p>
<ol>
  <li>Reuse auxiliary data.</li>
  <li>Propose efficient data structures to support the design.</li>
  <li>A non-blocking pipeline, and carefully adjust the coding and transmission threads.</li>
</ol>

<p>The sub-stripe size is a trade-off: larger sizes lead to poor performance (head-tail latency amplification degradation), while smaller sizes increase network latency (but isn't it possible to overlap?).</p>

<p>This work has a more EC-centric flavor. It focuses on <strong>reusing auxiliary encoded data, using an exponential pipeline, and carefully adjusting coding and transmission threads</strong>.</p>

<p>Open Source: <a href="https://github.com/ADSL-EC/MicroEC">https://github.com/ADSL-EC/MicroEC</a></p>

<p>I don't understand why they chose to use Java's Crail-1.3 for the system. It's surprising to use a system with a built-in GC for something so sensitive. No wonder it can only handle large objects.</p>

<h2 id="sosp-24-aceso-achieving-efficient-fault-tolerance-in-memory-disaggregated-key-value-stores">SOSP '24 Aceso: Achieving Efficient Fault Tolerance in Memory-Disaggregated Key-Value Stores</h2>

<p>Pengfei Zuo, DM KVS + EC</p>

<p>Checkpoint for index, EC for KV pair<br />
Differential checkpointing scheme, version recovery method, space reclamation mechanism based on differences, and hierarchical recovery scheme.</p>

<p><strong>Challenges:</strong></p>
<ol>
  <li>Checkpoint network overhead, rollback leads to loss of recently submitted KV pairs.</li>
  <li>EC introduces GC and recomputation.</li>
  <li>Memory node recovery is slow due to computation (<em>pure decoding recovery issue?</em>).</li>
  <li>Checkpoint transfer can interfere with performance.</li>
</ol>

<p>Solution:</p>
<ol>
  <li><strong>Differential Checkpointing for Index:</strong> RNIC IOPS are limited. By reducing the bandwidth consumed by checkpoint transfers, Aceso reduces the performance interference of the checkpoint mechanism.
    <ol>
      <li>Calculate the index delta -&gt; LZ4 -&gt; write to MN -&gt; adjacent MN decompresses and then XOR updates. (The atomicity guarantee here comes from the fact that the index being written will not be included in this checkpoint).</li>
      <li>After rolling back the checkpoint, you need to scan to match KV pairs. Some RDMA CAS tricks are used to apply versions to slots.</li>
      <li><strong>Version-based recovery method:</strong>
        <ol>
          <li>Index Slot Versioning: The slot is extended to ensure the latest version. By reading the latest checkpoint and reprocessing recent KV pairs, Aceso ensures that the index can recover to the latest and consistent state after fault recovery using RDMA CAS.</li>
          <li>Index Versioning implements further strategies to accelerate recovery (narrowing the scan range, etc.).</li>
        </ol>
      </li>
    </ol>
  </li>
  <li><strong>Offline Erasure Coding for KV Pairs:</strong> Offline EC, leveraging the linear properties of X-code erasure codes, Aceso implements an efficient space reclamation mechanism for old KV pairs with almost no overhead.
    <ol>
      <li>Offline mainly means that the MN performs the operation in the background. First, write everything to the MN, then the MN's CPU performs encoding in the background.</li>
      <li>Metadata records the role, validity, bitmap, etc., similar to previous DM hash work. Then, it uses a slab-like management. <img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-22-05-02.png" alt="{93C569C8-3635-44DA-A5B5-7EE869B488CB}.png" /></li>
    </ol>
  </li>
  <li><strong>Hierarchical Recovery Scheme:</strong> By prioritizing the recovery of critical data (such as the index), Aceso ensures fast recovery of KV storage functionality, minimizing user disruption.
    <ol>
      <li>Metadata is directly replicated, the index is recovered to a previous version using checkpoints, and then KV pair versions are scanned.</li>
      <li>Block regions are recovered using EC, while parity is recovered in the background (delta merging occurs here).</li>
      <li>By default, it optimizes <em>pipelining</em> of RDMA reads and decoding, as well as doorbell batching.</li>
    </ol>
  </li>
</ol>

<p>CX3 cluster of CloudLab<br />
Aceso achieves significant throughput improvements in write requests (INSERT, UPDATE, DELETE). Among them, the improvement in DELETE requests is the most significant, reaching 2.67 times.</p>

<p>The baseline is the replicated FUSEE@FAST'23, but many improvements come from the significantly reduced overhead of the index after checkpointing.</p>

<p><a href="https://zhuanlan.zhihu.com/p/5100600418#:~:text=%E9%83%BD%E5%8F%AF%E4%BB%A5%E6%94%AF%E6%8C%81CXL%E3%80%82-,Aceso,-%3A%20Achieving%20Efficient%20Fault">IPADS Notes</a></p>

<hr />

<p>Random thoughts:</p>
<ul>
  <li>…</li>
</ul>]]></content><author><name>Gray</name></author><category term="EC" /><summary type="html"><![CDATA[Erasure Coding + Disaggregated Memory]]></summary></entry><entry><title type="html">Erasure Coding NIC Offload</title><link href="https://www.grayxu.cn/2024/12/04/EC-offload/" rel="alternate" type="text/html" title="Erasure Coding NIC Offload" /><published>2024-12-04T00:00:00+08:00</published><updated>2024-12-04T00:00:00+08:00</updated><id>https://www.grayxu.cn/2024/12/04/EC-offload</id><content type="html" xml:base="https://www.grayxu.cn/2024/12/04/EC-offload/"><![CDATA[<p>About offloading erasure coding to NICs.</p>

<blockquote>
  <p>This article was written a long time ago, and the logic is a bit muddled. I recently did some new research and found this draft in Obsidian.  Although a little messy, it still contains some useful information, so I polished it with a large language model and am now publishing it. (Also because I suddenly realized it's been a long time since I last updated, and procrastination before ddl.)</p>
</blockquote>

<p>High-speed networks like RDMA are rapidly developing. 800 Gbps NICs are on the horizon. Despite numerous efforts dedicated to accelerating Erasure Coding (EC), EC acceleration libraries like ISA-L haven't kept pace with the advancements in networking. Consequently, for traditional EC where the bottleneck was primarily network bandwidth, a portion of the bottleneck has shifted to computation. Furthermore, computation within EC is also suitable for offloading to processors on PCI-E, which can simultaneously save CPU resources.</p>

<blockquote>
  <p>In fact, multi-core speed is sufficient, but yeah simple calculations should be offloaded to the DSA.</p>
</blockquote>

<p><a href="http://www.shihaiyang.me/"><em>Haiyang Shi</em></a> (now at ByteDance US Infrastructure System Lab), a PhD from OSU, has conducted significant research on offload encoding to NIC. <a href="https://etd.ohiolink.edu/acprod/odb_etd/etd/r/1501/10?clear=10&amp;p10_accession_num=osu160694815517547">his thesis</a></p>

<p><img src="https://www.grayxu.cn/images/2022/03/17/2022-03-17-17-27-35.png" alt="image.png" />
ps: Gibraltar is a EC library for GPU</p>

<p>The figure qualitatively illustrates the current throughput performance of different acceleration libraries on various processors. ISA-L, due to its cache-friendly design, significantly outperforms others.</p>

<p>It's evident that while the granularity of PCI-E is 64B, the sweet spot for offload devices lies at the MB level. Therefore, for small object cases like KVS, offloading could introduce substantial latency overhead.</p>

<h2 id="hpdc19-umr-ec-a-unified-and-multi-rail-erasure-coding-library-for-high-performance-distributed-storage-systems">HPDC'19 UMR-EC: A Unified and Multi-Rail Erasure Coding Library for High-Performance Distributed Storage Systems</h2>

<p><strong>Goal:</strong> Integrate devices such as CPUs, GPUs, and network interface cards (i.e., multi-rail support) to execute erasure coding (EC) operations in parallel.<br />
<strong>Methods:</strong> A unified multi-rail EC library that can fully leverage heterogeneous EC encoders. The proposed interface is complemented by <em>asynchronous semantics, an optimized metadata-free scheme, and EC rate-aware task scheduling</em>, enabling efficient I/O pipelines.</p>

<p>This work focuses on two-level hierarchies: CPU+GPU and CPU+RNIC (note: only CX5 provides EC features).</p>

<p>(<em>Intuitively, disregarding implementation effort, the core focus is on managing the computing power of different devices and task distribution. Intensive multi-tasking is straightforward. For individual small tasks, such as degraded reads, how to distribute them to cores with different computing capabilities to avoid tail latency necessitates a predictor. However, this approach confines offloading to enhancing computing power rather than shortening paths, for instance.</em>)</p>

<p>The primary strategy aims to reduce latency by overlapping the three stages of data retrieval, coding, and data transmission, similar to a pipeline.</p>

<p><img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-20-49-03.png" alt="Aspose.Words.8c4c77a8-626b-4a47-85dc-22a66dae0175.054.png" /></p>

<p>Read operations follow a similar approach. The core idea is that by splitting each coding task into multiple subtasks and distributing them across various devices, these devices can independently and concurrently complete these subtasks without blocking communication or other processes. 
The strategy for controlling task distribution is simple: maintaining three additional queues and observing their flow rates.</p>

<p>Note that although the GPU performs calculations, the CPU is still responsible for packet transmission.</p>

<p>seems like some data and benchmarks not align with their previous work in Bench'18?</p>

<h2 id="sc19-triec-an-efficient-erasure-coding-nic-offload-paradigm-based-on-tripartite-graph-model">SC'19 TriEC: An Efficient Erasure Coding NIC Offload Paradigm based on Tripartite Graph Model</h2>

<p>This paper discusses offloading the EC computation process to RDMA NICs. The problem is abstracted into a tripartite graph model. Additionally, some network primitives are designed to support this offloading. It is primarily a networking-focused work.</p>

<p><em>Here, a key difference from the HPDC work is that the RNIC can handle network packet transmission?</em></p>

<p>Two types of offload NICs:</p>
<ul>
  <li><strong>Incoherent:</strong> The CPU sends data in memory to the NIC for parity calculation, and subsequently issues a command to send the parity data.</li>
  <li><strong>Coherent:</strong> The NIC calculates and stores the parity data in memory before sending it.
    <ul>
      <li>Benefits: Reduces CPU overhead and DMA operations (i.e., fewer read operations).</li>
    </ul>
  </li>
</ul>

<p>However, the above optimization strategies have limitations:</p>
<ol>
  <li>Only one NIC is used for computation, leading to poor parallelism.</li>
  <li>NIC network resources are not fully utilized.</li>
  <li>Only the encode-and-send primitive is supported, not the receive-and-decode primitive.</li>
</ol>

<p><strong>Design:</strong></p>
<ul>
  <li>If we consider the original architecture as a bipartite graph (BiEC) (where the source and NIC are one node and the destination is another), their design is a tripartite graph. The encoding process is divided into multiple subsets and sent to multiple NICs on different nodes for calculation. This distributes the computational load across the NICs. The decoding process is similar, with decoding tasks also being decomposed and distributed.  [Implementation requires designating a leader within a group to manage request distribution.]
    <ul>
      <li>Finer-grained task decomposition enables improved parallelism.</li>
    </ul>
  </li>
</ul>

<p><img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-20-52-50.png" alt="Aspose.Words.8c4c77a8-626b-4a47-85dc-22a66dae0175.049.png" />
<img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-20-53-08.png" alt="Aspose.Words.8c4c77a8-626b-4a47-85dc-22a66dae0175.050.png" /></p>

<p><em>Hence, the process transforms from a single-hop to a double-hop-like network.</em></p>

<p>For in-band repair compared to out-of-band repair, intermediate results of subtasks can potentially be the desired result for a particular node without extra overhead. This allows direct delivery to that node, eliminating the need for write-back.
Furthermore, the initialization overhead of the EC offload APIs supported by the NIC is significant, necessitating buffering.</p>

<p>It's unclear how the receive-and-decode primitive is implemented. The intermediate forwarding nodes seem to still require CPU involvement.
Note that this network communication still uses two-sided verbs, not DM.</p>

<p>Random Thoughts</p>
<ul>
  <li><em>this approach of combining subtasks to satisfy a specific node's requirements is kindof a trade-off between computation and networking?</em></li>
  <li><em>Furthermore, this writing strategy inherently requires writing a portion of the data to specific nodes and then having those nodes calculate parity. This is an asynchronous process. Does this compromise reliability?</em></li>
  <li><em>Many of the choices presented here appear to extend local encoding techniques to distributed systems. Can this be extended further?</em></li>
</ul>

<h2 id="sc20-inec-fast-and-coherent-in-network-erasure-coding">SC'20 INEC: Fast and Coherent In-Network Erasure Coding</h2>

<p>This work seamlessly integrates operations like receiving data, calculating erasure codes, and sending results, reducing CPU intervention.
RDMA is extended with EC primitives within network primitives, such as <code class="language-plaintext highlighter-rouge">encode_and_send</code>, but with further expansions like <code class="language-plaintext highlighter-rouge">PPR</code> for forwarding encoding types (e.g., <code class="language-plaintext highlighter-rouge">receive_ec_send</code>).  [<strong>ec/xor-send, recv-ec/xor-send, and recv-ec/xor</strong>]
The combination of these three primitives is sufficient to express the computation and communication patterns of all five advanced erasure coding schemes shown in Figure 1.</p>

<p>This enables the construction of distributed erasure coding pipelines and the triggering of pre-submitted tasks without CPU intervention.</p>

<p><img src="https://www.grayxu.cn/images/2024/12/04/2024-12-04-20-59-02.png" alt="Aspose.Words.8c4c77a8-626b-4a47-85dc-22a66dae0175.055.png" /></p>

<p>The modified Mellanox OFED driver supports INEC primitives.</p>

<p>The implementation uses RDMA WAIT (This seems more suitable for DPUs and Bluefield. If line rate is not achieved, it can be awkward).</p>

<h2 id="refer">refer</h2>

<ol>
  <li>Shi, Haiyang, Xiaoyi Lu, and Dhabaleswar K. Panda. "EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures." International Symposium on Benchmarking, Measuring and Optimization. Springer, Cham, 2018.</li>
  <li>Shi, Haiyang, et al. "High-performance multi-rail erasure coding library over modern data center architectures: early experiences." Proceedings of the ACM Symposium on Cloud Computing. 2018.</li>
  <li>Shi, Haiyang, et al. "UMR-EC: A unified and multi-rail erasure coding library for high-performance distributed storage systems." Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019.</li>
  <li>Shi, Haiyang, and Xiaoyi Lu. "Triec: tripartite graph based erasure coding NIC offload." Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019.</li>
  <li>Shi, Haiyang, and Xiaoyi Lu. "INEC: fast and coherent in-network erasure coding." SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.</li>
</ol>]]></content><author><name>Gray</name></author><category term="EC" /><summary type="html"><![CDATA[Erasure Coding NIC Offload]]></summary></entry><entry><title type="html">Data Movement with DMA/DSA Offloading</title><link href="https://www.grayxu.cn/2023/10/10/DMA-DSA/" rel="alternate" type="text/html" title="Data Movement with DMA/DSA Offloading" /><published>2023-10-10T00:00:00+08:00</published><updated>2023-10-10T00:00:00+08:00</updated><id>https://www.grayxu.cn/2023/10/10/DMA-DSA</id><content type="html" xml:base="https://www.grayxu.cn/2023/10/10/DMA-DSA/"><![CDATA[<p>About offloading memory data movement to DMA or DSA engines.</p>

<p>What is DMA: <a href="https://jianyue.tech/posts/dma/">https://jianyue.tech/posts/dma/</a></p>

<p>Pros:</p>
<ul>
  <li>offloading for async ops</li>
  <li>less cache polution</li>
  <li>less CPU cycles (less mem io stalls)</li>
</ul>

<p>Cons:</p>
<ul>
  <li>higher latency
    <ul>
      <li>resource management (addr translation, …)</li>
    </ul>
  </li>
  <li>bandwidth limited</li>
</ul>

<blockquote>
  <p><strong>note: partially translated by ChatGPT</strong></p>
</blockquote>

<h2 id="ipdps-07-designing-efficient-asynchronous-memory-operations-using-hardware-copy-engine-a-case-study-wi">IPDPS '07 Designing Efficient Asynchronous Memory Operations Using Hardware Copy Engine: A Case Study wi</h2>

<p>OSU's work, K. Vaidyanathan W.Huang L. Chai D. K. Panda</p>

<p>DMA copy offload:</p>
<ol>
  <li>Reduction in CPU Resources and Better Performance</li>
  <li>Computation-Memory Copy Overlap</li>
  <li>Avoiding Cache Pollution Effects</li>
</ol>

<p>But concern about:</p>
<ol>
  <li>a single transfer cannot span discontinuous physical pages</li>
  <li>source and dest overlap</li>
  <li>Bus CC</li>
</ol>

<p>They developed a DMA engine in the kernel for copying, which can also be extended for IPC. They also considered issues such as alignment, locking buffer, and multiple DMA channels.</p>

<p>Some experimental results:</p>
<ul>
  <li>Setup: Intel 3.46 GHz processors and 2MB L2 cache system with SuperMicro X7DB8+ motherboards that include 64-bit 133 MHz PCI-X interfaces. The machine is connected with an Intel PRO1000Mbit adapter. We used the Linux RedHat AS 4 operating system and kernel version 2.6.9-30. <strong><em>It doesn't mention memory, but it seems relevant.</em></strong></li>
  <li>Regarding data in the hot cache, CPU memcpy completely dominates.
    <ul>
      <li>On the contrary, for 16KB, 4-channel DMA is better than the CPU.</li>
    </ul>
  </li>
  <li>Beyond 2MB, the CPU lags behind DMA in terms of bandwidth.</li>
  <li>The effect of overlap testing is more evident above KB, around 0.3-0.4 at 1KB. When the size is too small, there is no effect due to the overhead of DMA itself.
    <ul>
      <li><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-01-30.png" alt="image.png" /></li>
    </ul>
  </li>
  <li>On pure read workloads, CPU memcpy is affected by cache pollution, resulting in a 30% drop.</li>
</ul>

<h2 id="cluster-07-efficient-asynchronous-memory-copy-operations-on-multi-core-systems-and-ioa">CLUSTER '07 Efficient Asynchronous Memory Copy Operations on Multi-Core Systems and I/OA</h2>

<p>OSU's work in the same group, K. Vaidyanathan, L. Chai, W.Huang, D. K. Panda</p>

<p>The previous work seemed more focused on performance, while this work provides a hidden solution for multi-core system design. The overhead of initiating DMA can be assigned to a dedicated core, enabling better overlap of memory access and computation, up to 100%. Multi-cores can also facilitate copying to a dedicated core. [ <em>More memory bandwidth or more cores?</em> ] 
The "protect" strategy is used to achieve application transparency.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-08-30.png" alt="image.png" /></p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-08-37.png" alt="image.png" /></p>

<h2 id="intel-spdk--dma">Intel SPDK + DMA</h2>

<p>a simple callback interface from userspace</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-20-28.png" alt="image.png" /></p>

<table>
  <thead>
    <tr>
      <th>Function</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="https://spdk.io/doc/ioat_8h.html#a784c1a69962e0964cf6988badd945b6f" title="Enumerate the I/OAT devices attached to the system and attach the userspace I/OAT driver to them if d...">spdk_ioat_probe()</a></td>
      <td>Enumerate the I/OAT devices attached to the system and attach the userspace I/OAT driver to them if desired.</td>
    </tr>
    <tr>
      <td><a href="https://spdk.io/doc/ioat_8h.html#a87ce4a1c8bdd3fb69079ac51e00f92e5" title="Get the DMA engine capabilities.">spdk_ioat_get_dma_capabilities()</a></td>
      <td>Get the DMA engine capabilities.</td>
    </tr>
    <tr>
      <td><a href="https://spdk.io/doc/ioat_8h.html#ac1de22182996edecb435f9583665008d" title="Build and submit a DMA engine memory copy request.">spdk_ioat_submit_copy()</a></td>
      <td>Build and submit a DMA engine memory copy request.</td>
    </tr>
    <tr>
      <td><a href="https://spdk.io/doc/ioat_8h.html#a6025f251c715e93ea27ee03b5ab9557c" title="Build and submit a DMA engine memory fill request.">spdk_ioat_submit_fill()</a></td>
      <td>Build and submit a DMA engine memory fill request.</td>
    </tr>
  </tbody>
</table>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-20-46.png" alt="image.png" /></p>

<p>https://www.intel.com/content/www/us/en/developer/articles/technical/fast-memcpy-using-spdk-and-ioat-dma-engine.html</p>

<h2 id="fast-23-revitalizing-the-forgotten-on-chip-dma-to-expedite-data-movement-in-nvm-based-storage-systems">FAST '23 Revitalizing the Forgotten On-Chip DMA to Expedite Data Movement in NVM-based Storage Systems</h2>

<p>USTC's research focuses on synchronous data movement between NVM and DRAM.</p>

<blockquote>
  <p>Large size asynchronous movements on NVM are often considered a mere trick (e.g., HeMem@SOSP'21). But if we divide requests internally, does it also weaken the concepts of sync and async? In essence, everything discussed earlier is also synchronous.</p>
</blockquote>

<p>First, DMA on NVM was profiled, evaluating parallel copies for inter and intra requests, among other aspects. Some notable differences include:</p>
<ul>
  <li>Intra: Multi-channel DMA for PM writes is not very effective, while reads are feasible (<em>limited by write bandwidth</em>).</li>
  <li>Inter: DMA greater than 4 easily gets saturated.</li>
  <li>NVM management in kernel space differs from DRAM as the space is contiguous, allowing for simpler management.</li>
  <li>…</li>
</ul>

<p>A breakdown of read and write, starting directly from 16KB<br />
<img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-13-34.png" alt="image.png" /></p>

<p>Then, they proposed a fastmove library:</p>
<ul>
  <li>Batching (pin, submit, etc.), alignment, pre-allocation…</li>
  <li>DMA-CPU cooperation.</li>
  <li>Implementation involved modifying the DMA kernel module to better serve NVM-DRAM DMA copying, with additions to kernel file systems like Nova.</li>
  <li>Further, they developed a scheduler to manage DMA-CPU cooperation based on IO size, among other factors.</li>
</ul>

<p>It's worth noting that this work is 15 years newer than the previous one, so it leverages many new features in the kernel to further enhance DMA performance.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-13-43.png" alt="image.png" /></p>

<blockquote>
  <p>I wanted to see the difference in microbenchmarking between CPU and modified DMA for small sizes (1K? 4K?), but they didn't provide it. Many experiments focused on end-to-end latency in the file system.
Only support <code class="language-plaintext highlighter-rouge">fm_copy_to_user()</code> and <code class="language-plaintext highlighter-rouge">fm_copy_from_user()</code>.
The claim of the new hardware being compatible with general CXL seems a bit forced. It seems more dependent on the documentation than actual implementation.</p>
</blockquote>

<h2 id="arxiv-23-asplos-24-a-quantitative-analysis-and-guideline-of-data-streaming-accelerator-in-intel-4th-gen-xeon-scalable-processors"><del>arXiv '23</del> ASPLOS '24 A Quantitative Analysis and Guideline of Data Streaming Accelerator in Intel® 4th Gen Xeon® Scalable Processors</h2>

<!-- Related: *MICRO'23 CXL ≠ NUMA: Device-specific characteristics and effective use of true CXL memory* -->

<p>What is DSA?</p>
<ul>
  <li><a href="https://zhuanlan.zhihu.com/p/518157278">https://zhuanlan.zhihu.com/p/518157278</a></li>
</ul>

<p>DSA can offload operations including memcpy and even perform streaming CRC. A significant portion of the discussion is dedicated to the specification of DSA itself.</p>

<p>The key point is that DSA enables the calling end to operate with minimal latency:</p>
<ul>
  <li>Specialized hardware is used for IOMMU, allowing DSA to directly access SVM, thus eliminating the need for pinning as discussed earlier and avoiding most of the startup overhead.
    <blockquote>
      <ul>
        <li>Meanwhile, the address translations for the completion record, source, and destination buffers are performed by interacting with the on-device address translation cache (ATC) that interacts with the IOMMU on the SoC — a key difference from previous generations. This enables support of coherent shared memory between DSA and cores — they can access shared data in CPU virtual address space and thereby eliminate the need for applications to pin memory.</li>
      </ul>
    </blockquote>
  </li>
  <li>New instructions like <code class="language-plaintext highlighter-rouge">MOVDIR64B</code> bypass the cache to submit a 64B descriptor.</li>
  <li>On-chip features include QoS and similar mechanisms.</li>
</ul>

<p>Many of the data presented are quite intriguing:</p>

<p>Most notably, DSA directly bypasses many of the issues previously discussed regarding DMA from the hardware level, resulting in faster performance even for small sizes, such as 256B.<br />
<img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-25-27.png" alt="image.png" /></p>

<p>Async batching<br />
<img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-25-40.png" alt="image.png" /></p>

<p>Breakdown after batching<br />
<img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-25-48.png" alt="image.png" /></p>

<p>Saving CPU cycles<br />
<img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-25-57.png" alt="image.png" /></p>

<p>Finally, numerous guidelines are provided on maximizing throughput, interactions with the cache/memory hierarchy, and the configuration of DSA hardware resources.</p>

<blockquote>
  <p>DSA + memory-intensive systems? and nontrivial <br />
DSA + EC?</p>
</blockquote>]]></content><author><name>Gray</name></author><category term="System" /><summary type="html"><![CDATA[Data Movement with DMA/DSA Offloading]]></summary></entry><entry><title type="html">SW Prefetch in System&amp;amp;DB</title><link href="https://www.grayxu.cn/2023/10/09/prefetch/" rel="alternate" type="text/html" title="SW Prefetch in System&amp;amp;DB" /><published>2023-10-09T00:00:00+08:00</published><updated>2023-10-09T00:00:00+08:00</updated><id>https://www.grayxu.cn/2023/10/09/prefetch</id><content type="html" xml:base="https://www.grayxu.cn/2023/10/09/prefetch/"><![CDATA[<p>Prefetch to hide memory access latency (CPU stall)</p>
<blockquote>
  <ol>
    <li>What to prefetch</li>
    <li>When to prefetch</li>
    <li>Where to place the prefetched data</li>
  </ol>
</blockquote>

<p>Some ref:</p>
<ul>
  <li><a href="https://zhuanlan.zhihu.com/p/443829741">Prefetching、Interleaving 和 数据库</a></li>
  <li><a href="https://zhuanlan.zhihu.com/p/51588155">In-Memory DBMS 『Peloton』技术简述</a></li>
  <li><a href="https://howardlau.me/programming/improving-ilp-using-coroutines.html">使用协程提高流水线利用率 howardlau</a></li>
  <li><a href="https://stackoverflow.com/questions/72243997/how-to-use-software-prefetch-systematically/">https://stackoverflow.com/questions/72243997/how-to-use-software-prefetch-systematically/</a></li>
</ul>

<blockquote>
  <p><strong>note: partially translated by ChatGPT</strong></p>
</blockquote>

<h2 id="taco-14-when-prefetching-works-when-it-doesnt-and-why">TACO '14 When Prefetching Works, When It Doesn’t, and Why</h2>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-41-35.png" alt="image.png" /></p>

<p>Discussing HW prefetch and SW prefetch:</p>
<ul>
  <li>SW prefetch is suitable for scenarios such as short arrays, sequential and irregular reads, etc. The form of SW prefetch introduces more instruction calling costs.</li>
  <li>HW prefetch heavily depends on the platform, as specific patterns need to be recognized.</li>
  <li>SW prefetch has a training effect on HW, which might negatively impact HW performance.</li>
  <li>HW prefetchers generally prefetch to L2 or L3, as the performance gap between L1 and L2 can be tolerable for an OOO CPU when the miss rate is below 20%.</li>
</ul>

<p>For more details on this topic, refer to: <a href="https://hackmd.io/@jserv/HJtfT3icx?type=view">https://hackmd.io/@jserv/HJtfT3icx?type=view</a></p>

<ul>
  <li>T0 (Temporal data) - Prefetch data into <strong>all levels of the cache hierarchy</strong>.</li>
  <li>T1 (Data about L1 cache misses) - Prefetch data into <strong>level 2 cache and higher levels</strong>.</li>
  <li>T2 (Data about L2 cache misses) - Prefetch data into <strong>level 3 cache and higher levels</strong>, or as implementation-specific choices.</li>
  <li>NTA (Non-Temporal data across all cache levels) - Prefetch data into non-temporal cache structures and prefetch it to locations close to the processor, minimizing cache pollution.
    <ul>
      <li><code class="language-plaintext highlighter-rouge">prefetchnta</code> is only used to prefetch into the USWC memory region using line fill buffers. Otherwise, it prefetches into L1 (and L3 inclusive L3 on CPU), bypassing L2 (as stated in Intel's optimization manual). You cannot weakly order loads from WB memory; there is no way to bypass cache coherence on WB.</li>
    </ul>
  </li>
</ul>

<p>For further insights, refer to: <a href="https://stackoverflow.com/questions/46521694/what-are-mm-prefetch-locality-hints">https://stackoverflow.com/questions/46521694/what-are-mm-prefetch-locality-hints</a></p>

<h2 id="but">BUT</h2>

<p>For databases, workloads such as <em>point chasing</em> are prevalent, as seen in hash joins, where <strong>HW prefetching is ineffective</strong>.</p>
<ul>
  <li>Hash join involves a large set of keys, and the task is to perform table lookups [used in table joins for two DBs, resulting in a significant amount of random memory access].</li>
  <li>MVVC chains are another example.</li>
  <li>However, integrating operations other than hash join into this context is certainly challenging, and it might require a case-by-case approach.</li>
</ul>

<p>While some works might only discuss hash joins, the ideas are generally applicable, so distinctions regarding whether the implementations in the articles are general are not considered here.</p>

<h2 id="icde-04-improving-hash-join-performance-through-prefetching">ICDE '04 Improving hash join performance through prefetching</h2>

<p>Sw prefetch for hash join. In comparison to simple SW prefetch, which prefetches all related pages before access, further proposals include <em>Group Prefetching</em> and <em>Pipelined Prefetching</em>.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-14-47-39.png" alt="image.png" /></p>

<p>The idea is straightforward: for a batch of tasks, prefetch first, then perform subsequent computations. By the time the data is retrieved, it is already in the cache. Pipelined prefetching goes a step further than group prefetching, which naturally imposes additional constraints on the size within each loop body, batch size, and so on.</p>

<p><a href="https://ieeexplore.ieee.org/document/1319989">https://ieeexplore.ieee.org/document/1319989</a><br />
<a href="https://zhuanlan.zhihu.com/p/443829741">https://zhuanlan.zhihu.com/p/443829741</a></p>

<p><strong><em>VLDB '17 Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last</em></strong> is an example of a DB using GP (with SIMD). There are more detailed experiments, but it seems there are no significant changes in terms of methodology (although this isn't the focus of this article).</p>

<p><a href="https://zhuanlan.zhihu.com/p/51588155">https://zhuanlan.zhihu.com/p/51588155</a></p>

<h2 id="vldb-16-asynchronous-memory-access-chaining">VLDB '16 Asynchronous Memory Access Chaining</h2>

<p>AMAC provides a method to transform the pattern of chaining access (point chasing with many pointer dereferences) into one that can be SW prefetched in coding, but this requires <strong>a significant amount of manual effort</strong>, even for probing a hashtable.</p>

<p>The key observation is that not every access chain has a fixed size. Therefore, theoretically ideal pipeline prefetching isn't practical, and there will always be instances of <strong>pipeline stalls in irregular scenarios</strong>, similar to what occurs in superscalar processors.</p>

<p>Thus, they utilize a <strong>Finite State Machine (FSM)</strong> to abstract the entire process, enabling early modifications in the code to fill the pipeline effectively.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-15-17-18.png" alt="image.png" /></p>

<p>In comparison to simple group or simple pipeline strategies, AMAC represents a more dynamic approach, accounting for different sizes, among other factors. It emphasizes observing the dependency relationship to interleave prefetch and computation.</p>

<blockquote>
  <p>Interestingly, even if you handwrite the state machine, the compiler might still mess up your code, making it slower. <a href="https://www.youtube.com/watch?v=j9tlJAqMV7U">Watch this video</a> for more information.</p>
</blockquote>

<h2 id="vldb-17-interleaving-with-coroutines-a-practical-approach-for-robust-index-joins">VLDB '17 Interleaving with <em>Coroutines</em>: A Practical Approach for Robust Index Joins</h2>

<p>AMAC is excellent and can approach the theoretical limit indefinitely, but it is not practical. This work proposes using coroutine switching to replace manually interleaved execution. It delegates the task of scheduling interleaving to the compiler or DB engine.</p>

<p>The advantage of coroutines lies in their low switching overhead. Unlike heavyweight threads, in the best-case scenario, the overhead can be almost equivalent to that of a single function call.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-15-20-37.png" alt="image.png" /></p>

<p>Significantly, SW prefetching is indeed sensitive to many aspects, and such interleaving can potentially impose greater pressure on address translation. The group size still needs manual adjustment.</p>

<p>Note that here, of course, coding still needs to be done in this pattern, and engine developers still need to manually adjust various details of coroutines.</p>

<h2 id="vldb-18-exploiting-coroutines-to-attack-the-killer-nanoseconds">VLDB '18 Exploiting <em>coroutines</em> to attack the "killer nanoseconds"</h2>

<p>The discussion also revolves around using coroutines in DB to reduce memory stalls in <strong>pointer-intensive data structures</strong>. They transformed hashtables, binary searches, and more complex data structures like masstree and bw-tree, conducting numerous tests.</p>

<p>There are many intricacies, but only a few minor details are listed here:</p>
<ul>
  <li>HW thread (referring to Hyper-Threading) prefetch is not as effective as coroutine prefetch (referenced in <em>eurosys22</em>).</li>
  <li>The performance of coroutines varies significantly among different compilers.</li>
  <li>…</li>
</ul>

<p>There is also a connection to the line fill buffer (since the line fill buffer is also used for non-temporal store and similar operations, it appears that there might be some competition in this context).</p>

<h2 id="vldb-20-interleaved-multi-vectorizing">VLDB '20 Interleaved Multi-Vectorizing</h2>

<p>This is the work of ECNU by Zhuhe Fang, Beilei Zheng, and Chuliang Weng. It is also about SIMD+SW prefetch, which is quite an interesting study.</p>

<ol>
  <li>The first issue pertains to SIMD, which requires multiple contiguous data but might not all be present in the cache, resulting in a direct slowdown.
    <ol>
      <li>An interesting experiment demonstrated that although SIMD is often claimed to be powerful, its performance rapidly degrades as the workload size increases, becoming similar to scalar operations. Cache misses take around 200 cycles, which is orders of magnitude higher than the computation cycles. [ <em>Does dense SIMD also lead to frequency reduction?</em> ]</li>
    </ol>
  </li>
  <li>The second issue concerns the possibility of empty registers within SIMD, where some parts of the code stage might not be fully utilized, leading to a problem of underutilized hardware resources.</li>
</ol>

<p>They proposed IMV:</p>
<ol>
  <li>(Manually) interleave the execution of different SIMD computations to implement SW prefetching and reduce cache misses.</li>
  <li>Introduced residual vector states to merge with divergent vector states.</li>
</ol>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-15-25-23.png" alt="image.png" /></p>

<p>Understanding the concept is straightforward from the diagram.</p>

<blockquote>
  <p>*it seems that at least one assumption here is that misalignment exists? If alignment is directly addressed, all cache misses or code situations would be fully aligned in 64B units, leading to complete consistency.</p>
</blockquote>

<p>Experimental results:</p>
<blockquote>
  <p>We compared the performance of IMV with various other methods on Hash join probe (HJP) and Binary tree search (BTS). As shown in Figure 6, in most cases, IMV outperforms other methods, being 2.38 times, 1.39 times, 2.22 times, 2.74 times, and 4.85 times faster than AMAC (scalar code interleaving), FVA (fully vectorized AMAC), RAV (direct vectorized AMAC), SIMD (direct SIMD coding), and Naive (basic scalar implementation), respectively. In this experiment, Intel Vtune was used to further analyze the advantages of IMV through microarchitectural indicators, and the time breakdown of its execution is shown in Figure 7. The figure explains why IMV is much faster than other methods. IMV not only reduces maemory access overhead but also eliminates speculative execution errors. The results from Naive (pure scalar implementation) indicate that the execution time of HJP and BTS is mainly spent on memory access. Although AMAC optimizes memory access to improve performance, it is severely limited by speculative execution errors. Compared to Naive and AMAC, SIMD on the CPU only eliminates branch errors with little effect, as there are a large number of cache misses.</p>
</blockquote>

<p><a href="https://zhuanlan.zhihu.com/p/1466210">Link to Zhihu</a>
<a href="https://www.bilibili.com/video/BV1iJ411C7Jj">Link to Bilibili</a></p>

<h2 id="vldb-21-corobase-coroutine-oriented-main-memory-database-engine">VLDB '21 CoroBase: <em>coroutine</em>-oriented main-memory database engine</h2>

<p>Continuing from the previous work, this study also adopts a strategy of using alternating coroutines to implement SW prefetch. They aimed to create an implementation that is as automated as possible, which led them to explore C++20 coroutine. (Note that C++20's coroutine is stackless compared to boost, using suspension for switching.)</p>

<p>It's essential to note that the granularity here does not involve multiple threads as a batch, but rather multiple <code class="language-plaintext highlighter-rouge">get()</code> operations. This distinction sets it apart from AMAC, showing better performance with a small number of threads.</p>

<p><img src="https://www.grayxu.cn/images/2023/10/19/2023-10-19-15-29-01.png" alt="image.png" /></p>

<p>The major problems they encountered were as follows:</p>
<ol>
  <li>Coroutine switching overhead: The overhead is significant every time there is a suspend, so they implemented a two-level system. However, some parts still require manual unwinding.</li>
  <li>Scheduling: Fixed batch size based on profiling, such as the optimal number of CPU hardware prefetches, constrained by the number of registers, and so on.</li>
  <li>Resource management: Adjusting the timing of resource entry and reclamation.</li>
  <li>Concurrency control and DB architecture choices: Thread-local transformation.</li>
</ol>

<p>A drawback of reducing the execution granularity is that the overhead of synchronization locks increases.</p>

<p>Tianzheng Wang's presentation can be found here: <a href="https://www.bilibili.com/video/BV1dX4y1K7U1">Bilibili Link</a></p>]]></content><author><name>Gray</name></author><category term="System" /><summary type="html"><![CDATA[SW Prefetch in System&DB]]></summary></entry><entry><title type="html">Fault Tolerance of Persistent Memory</title><link href="https://www.grayxu.cn/2022/09/29/fault-tolerant-PM/" rel="alternate" type="text/html" title="Fault Tolerance of Persistent Memory" /><published>2022-09-29T00:00:00+08:00</published><updated>2022-09-29T00:00:00+08:00</updated><id>https://www.grayxu.cn/2022/09/29/fault-tolerant-PM</id><content type="html" xml:base="https://www.grayxu.cn/2022/09/29/fault-tolerant-PM/"><![CDATA[<p>In this article, we will list several papers on local NVM/PM fault tolerance.</p>

<blockquote>
  <p>note:</p>
  <ul>
    <li>the fault tolerance in some paper may indicate crash consistency, but here we mainly focus on device failures.</li>
    <li>fault tolerance across networks is not in the scope here. Related works mostly use replications, from <em>Mojim</em> (ASPLOS '15) to <em>Rowan-KV</em> (OSDI '23)</li>
  </ul>
</blockquote>

<p>changelog:</p>
<ul>
  <li>2/17 add Kamino-Tx</li>
  <li>2/24 add TENET</li>
  <li>4/10 add Pavise</li>
</ul>

<h1 id="problems">problems</h1>

<p>Define <em>data reliability</em> problems on PM:</p>
<ul>
  <li>media errors
    <ul>
      <li>cell wear out</li>
      <li>bit flip</li>
      <li>…</li>
    </ul>
  </li>
  <li>software scribbles
    <ul>
      <li>bugs in firmware level</li>
      <li>exposed addresses</li>
    </ul>
  </li>
  <li>crash inconsistency</li>
  <li>…</li>
</ul>

<p>ECC is only useful for small-scale media errors.</p>

<h1 id="existing-works">existing works</h1>

<h2 id="system">System</h2>

<blockquote>
  <p>seems like lots of works focus on <strong>transactional persistent memory</strong>, but lib details won't be mentioned below. check papers to know more</p>
</blockquote>

<h3 id="replication-style">Replication Style</h3>

<ul>
  <li><strong>libpmemobj-R</strong>
    <ul>
      <li>replication across different PM devices (pm pools)</li>
      <li><a href="https://pmem.io/blog/2015/11/an-introduction-to-replication/">more details</a></li>
    </ul>
  </li>
  <li><strong><em>Kamino-Tx</em> (EuroSys '17)</strong>
    <ul>
      <li><img src="https://www.grayxu.cn/images/2023/02/17/2023-02-17-17-43-10.png" alt="image.png" /></li>
      <li>Async backup to prevent additional data copy in the critical path of atomic ops.
        <ul>
          <li>only for write-intensive hot data to save some PM space</li>
        </ul>
      </li>
      <li>extend to chain replication (fault tolerance)
        <ul>
          <li><em>backup for crash consistency + replication for fault tolerance</em>: merge them, and only keep backup for the head of the chain</li>
          <li>to ensure the characteristic of chain replication, space cost is $(f+1+1+α)*datasize$ (note: backup and the head are in the same node)
            <ul>
              <li>1 for recovering non-head node</li>
              <li>α for backup</li>
            </ul>
          </li>
        </ul>
      </li>
    </ul>
  </li>
  <li><strong><em>Romulus</em> (SPAA '18)</strong>
    <ul>
      <li>async 2 reps for txn by only 4 fences (just like Kamino-Tx)</li>
    </ul>
  </li>
  <li><strong><em>TENET</em> (FAST '23)</strong>
    <ul>
      <li><em>TimeStone</em> (ASPLOS '20)
        <ul>
          <li>MVCC (<em>logging</em>) to scale performance. timestamp version control, non-blocking read and etc…
            <ul>
              <li>version chain in DRAM (TLog), a compressed checkpoint version (<em>group commit</em>) in PM (CLog), Obj in PM (maybe stale).</li>
              <li>So that recovery can use small op log <em>(params to replay txn)</em> to replay txn</li>
            </ul>
          </li>
          <li>[<a href="https://wangziqi2013.github.io/paper/2020/08/24/timestone.html">more details from GHC 6023</a>]: "<em>TimeStone is essentially redo-log + DRAM Buffer + group commit + operation logging.</em>"</li>
        </ul>
      </li>
      <li>TENET builds on TimeStone to create protections for spatial safety &amp; temporal safety of memory access</li>
      <li>use local SSD replication:
        <ul>
          <li>sync replications: Clog and op log</li>
          <li>async replications: data obj</li>
        </ul>
      </li>
    </ul>
  </li>
</ul>

<h3 id="coding-style">Coding Style</h3>

<ul>
  <li><strong><em>NOVA-Fortis</em> (SOSP '17) from NVSL</strong>
    <ul>
      <li>"TickTock for NVMM data structures that combines atomic update with error detection and recovery" (just like <em>Kamino-Tx</em> rep style)</li>
      <li>CRC32 checksums to detect errors (including silent errors unlike <em>TENET</em>)</li>
      <li>replicated checksums of data</li>
      <li>RAID-4 style parity to hide parity bits from application's address space</li>
      <li>as NOVA is based on CoW, UPDATE is "<em>allocates new pages, populates them with the written data, computes the checksums and parity, and finally commits the write with an atomic log appending operation</em>"</li>
      <li><img src="https://www.grayxu.cn/images/2022/09/30/2022-09-30-16-45-58.png" alt="image.png" /></li>
      <li>eval on PMEP: 1) the cost of checking and maintaining checksums and parity for file data incurs a steep cost for both reads and writes. 2) …</li>
      <li><a href="https://nan01ab.github.io/2018/08/NOVA-Fortis.html">more details</a></li>
      <li><a href="https://github.com/NVSL/linux-nova">source codes</a></li>
    </ul>
  </li>
  <li><strong><em>Pangolin</em> (ATC '19) from NVSL</strong>
    <ul>
      <li>replicated metadata</li>
      <li>1% XOR parities for 99% objects (with checksums)</li>
      <li>in-place delta update data with <strong>replicated redo logging</strong> in PM
        <ul>
          <li><em>So Cocytus (FAST '16)…</em>
            <blockquote>
              <p>why replicated redo? the only additional protection from replicated redo that I can think of is if data and parity are both crash-inconsistent and errors are found on a redo log entry.</p>
            </blockquote>
          </li>
        </ul>
      </li>
      <li>Adler32 for incremental checksums</li>
      <li>build a lib like libpmemobj on Opatne PM</li>
      <li>Concurrent updating of data is not supported, but concurrent updating of parity is supported (data in the same stripe but not in the same text)
        <ul>
          <li>atomic XOR is a simple solution but cannot be vectorized, on the other hand, vectorized XOR needs range locks =&gt; hybrid approach on an 8KB threshold
            <blockquote>
              <p>but page size is 4KB?</p>
            </blockquote>
          </li>
        </ul>
      </li>
      <li><img src="https://www.grayxu.cn/images/2022/09/30/2022-09-30-16-46-51.png" alt="image.png" /></li>
      <li><a href="https://nbjl.nankai.edu.cn/2020/0306/c12124a266826/page.htm">more details</a></li>
    </ul>
  </li>
  <li><strong><em>Vilamb</em> (arXiv '20) from Rajat Kateja, Andy Pavlo.</strong> (also named <em>ANON</em> I guess)
    <ul>
      <li><em>Palingon</em> sync-update parities -&gt; expensive -&gt; how to loosen guarantee?</li>
      <li>two background threads for async: one for check parities, and one for update. pros:
        <ol>
          <li>checksums are in page granularity -&gt; read amplification. async process can merge several ops to save BW.</li>
          <li>utilize wasted "dirty" bits in the page table
            <blockquote>
              <p>finding the gap from old redundant design is cool, it reminds me of DaxVM@MICRO'22</p>
            </blockquote>
          </li>
        </ol>
      </li>
      <li>rich experiments but on emulated NVM</li>
      <li>some metadata is still volatile -&gt; needs batteries</li>
      <li><a href="https://wangziqi2013.github.io/paper/2020/01/15/vilamb.html">more details</a></li>
    </ul>
  </li>
  <li><strong><em>Pavise</em>@PACT'22</strong>
    <ul>
      <li><em>Pangolin</em>'s following work</li>
      <li>one redo log</li>
      <li>a lib with less intrusive changes to the application
        <ul>
          <li>PMDK access tracking</li>
        </ul>
      </li>
      <li><a href="https://github.com/hjjq/pavise-pact22-artifact">source codes</a></li>
      <li>…</li>
    </ul>
  </li>
</ul>

<h2 id="architecture">Architecture</h2>

<p>Arch papers on PM fault tolerance are usually about hacking ECC modules…</p>

<ul>
  <li><strong><em>TVARAK</em> (ISCA '20) from Rajat Kateja</strong>
    <ul>
      <li>calculating parities like <em>Pangolin</em> is too slow (may lead to 50% drops)</li>
      <li>add a new HW controller beside LLC to offload computation (<em>maintain parities</em>)</li>
      <li>simulation on zsim</li>
    </ul>
  </li>
  <li><strong><em>Polymorphic Compressed Replication</em> (SYSTOR '20)</strong>
    <ul>
      <li>for columnar storage models on hybrid memory</li>
      <li>use compression to reduce writes to NVM as replications</li>
    </ul>
  </li>
  <li><strong><em>ECP</em> (ISCA '10)</strong>
    <ul>
      <li>Error-Correcting Pointers (ECP) to remap locations instead of ECC, for the ECC blocks wearing out problem</li>
      <li>and so many works on this approach, like zombie memory, chipkill, etc. <a href="https://my.eng.utah.edu/~cs7810/#:~:text=Mo%2027th%20Jan%3A%20Memory%20systems%3A%20reliability%2C%20PCM">more</a></li>
    </ul>
  </li>
  <li><strong><em>WoLFRaM</em> (ICCD '20)</strong>
    <ul>
      <li>wear-leveling + fault tolerance with programming address decoder (PRAD)</li>
    </ul>
  </li>
</ul>

<h1 id="design-space">design space</h1>

<ul>
  <li>LB + fault tolerance</li>
  <li>fault domains level
    <ul>
      <li>6~8 DIMMS but with 1% parity?
        <blockquote>
          <p>the difference of error granularity</p>
        </blockquote>
      </li>
    </ul>
  </li>
  <li>real error patterns of persistent memory</li>
  <li>not very erasure-coding style?</li>
  <li>not very optane style?</li>
  <li>only txn make sense?
    <ul>
      <li>workloads related</li>
    </ul>
  </li>
  <li>…</li>
</ul>]]></content><author><name>Gray</name></author><category term="PM" /><summary type="html"><![CDATA[Fault Tolerance of Persistent Memory]]></summary></entry><entry><title type="html">QoS on Persistent Memory Systems</title><link href="https://www.grayxu.cn/2022/03/12/Qos-PM/" rel="alternate" type="text/html" title="QoS on Persistent Memory Systems" /><published>2022-03-12T00:00:00+08:00</published><updated>2022-03-12T00:00:00+08:00</updated><id>https://www.grayxu.cn/2022/03/12/Qos-PM</id><content type="html" xml:base="https://www.grayxu.cn/2022/03/12/Qos-PM/"><![CDATA[<p>QoS (LB) on persistent memory systems to avoid interference.</p>

<h1 id="problem">Problem</h1>

<p>QoS is to control the priority among different applications, like latency-critical tasks against throughput tasks. Normally the source of those tasks fights for is bandwidth, which is a simple metric and easy to monitor. So best effort tasks won't affect latency-critical tasks. Some QoS works focused on DRAM[7].</p>

<p>Similarly, a hybrid access pattern on persistent memory incurs a dramatic performance drop. But it's tricky. Some other variables will also affect the overall performance.</p>

<h2 id="interference">interference</h2>

<p>[1] found some simple cache eviction strategies (like FIFO) without too much data migration can beat complex ones. <br />
[2] found:</p>
<blockquote>
  <ol>
    <li>The interference between a process accessing DRAM and one performing random <strong>reads to PM</strong> is small.</li>
    <li>When a process accessing DRAM is concurrently executed with one performing frequent <strong>writes to PM</strong>, the performance of the former is significantly degraded but that of the latter is not.</li>
  </ol>
</blockquote>

<p>multi-fold interference source:</p>
<ol>
  <li>iMC WPQ size is designed for fast DRAM access, so too many slow writes to PM will easily fill it up and block DRAM writes</li>
  <li>DDR bus</li>
  <li>PM write amplification</li>
  <li>…</li>
</ol>

<p>A couple of recent works focus on this QoS problem, including NVMSA '20, APSys '21, FAST '22 (2) [3-6]. QoS is all about monitoring and control, so let's discuss them separately.</p>

<h1 id="monitor">Monitor</h1>

<p>QoS systems should firstly know when an interference shows up and who it is.</p>

<p>FairHym[3] set up a couple of thresholds:<br />
<img src="https://www.grayxu.cn/images/2022/03/16/2022-03-16-11-02-31.png" alt="image.png" /><br />
And the exact values of them come from experiments, so it's coupled with workloads and HW settings.</p>

<hr />

<p>Dicio[4] only consider one best-effort task and one latency-critical task situation.</p>

<p>Similarly, Dicio has some rules from "priori knowledge".</p>

<p><img src="https://www.grayxu.cn/images/2022/03/16/2022-03-16-11-11-20.png" alt="image.png" /></p>

<p>$T_{DRAM}$ here is dynamic (5-30 GB/s) and depends on the access pattern on PM.<br />
estimated media-level write BW (μs level) = request-level BW * recent WA ratio(ms level)</p>

<blockquote>
  <p>note: different cases need different strengths on control?</p>
</blockquote>

<hr />

<p><em>MT^2</em>[5] is in kernel space, using Intel Memory Bandwidth Monitoring (MBM) and some toolkits to collect data (<em>Dicio[4] claims that MBM has some severe bugs for now</em>). So they can get the bandwidth of DRAM and PM (<em>a lot of effort here to implement, check paper details</em>).</p>

<p>Get read latency from $RPQ_O/RPQ_I$, write latency from periodical writes. And latency is used to detect interference. <strong>The latency threshold is different depending on the access type</strong> (random/seq + read/write).</p>

<blockquote>
  <p>note: the correlation between latency and bandwidth is basically linear. So the detection here is equal to BW?</p>
</blockquote>

<hr />

<p>in NyxCache, "<em>if the maximum IOPS of pattern A is MaxIOPSA, then the cost of each operation of pattern A is 1/MaxIOPSA.</em>"</p>

<blockquote>
  <p>note: the implicit assumption here is that the cost is linear and ignoring cache effects. emmmmmmmm</p>
</blockquote>

<p>In contrast to the above, NyxCache[6] find the victim application that can bring the biggest performance gain with the same suppression.<br />
Like the fig below, we want to choose one app to throttle between B and C to ensure A's perf.<br />
<img src="https://www.grayxu.cn/images/2022/03/16/2022-03-16-18-37-22.png" alt="image.png" /></p>

<blockquote>
  <p>note: you will get it after you find the author is Kan Wu, who is the author of <a href="https://www.grayxu.cn/2021/03/17/OrthusKV/">The Storage Hierarchy is Not a Hierarchy</a>.</p>
</blockquote>

<h1 id="control">Control</h1>

<p>After finding out which process should be throttled, QoS systems need to control it efficiently.</p>

<p>FairHym[3] assumes in VMs applications, every core is in exclusive use, so throttling the frequency of the target cores can reduce the BW on PM.</p>

<blockquote>
  <p>note: weak assumption and waste computing source</p>
</blockquote>

<hr />

<p>Dicio[4] tests some methods including <strong><a href="https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html">MBA</a></strong> (Intel Memory Bandwidth Allocation, basically it's delay injection in memory requests) and limit frequency:</p>

<p><img src="https://www.grayxu.cn/images/2022/03/14/2022-03-14-16-15-21.png" alt="image.png" /></p>

<p><em>*_stride means write 64B but at each 256B aligned addr to amplify writes.</em> they claim that the old method can't handle <em>PM_write_stride</em>.</p>

<blockquote>
  <p>note: maybe just not inject enough delays? like 1%.</p>
</blockquote>

<p>Dicio controls the number of cores to manage best-effort tasks' BW. And down to zero for BW tasks a single core (duty cycle).</p>

<hr />

<p><em>MT^2</em>[5] tries to combine MBA and CPU resources limitations. MBA only controls the ratio of delay injection, so the same throttling value may differ on different memory access pattersn. While throttling CPU resources can almost linearly reduce BW. What's worse, MBA doesn't work on PM.
<img src="https://www.grayxu.cn/images/2022/03/14/2022-03-14-20-31-35.png" alt="image.png" /></p>

<p>Instead of changing frequency or the number of cores, MT^2 changes the CPU quota of a thread by LINUX cgroup control, which is <strong>finer</strong>-grained.</p>

<p><img src="https://www.grayxu.cn/images/2022/03/15/2022-03-15-17-42-09.png" alt="image.png" /></p>

<p>Table 2: pagerank under 10% MBA is faster than 50% CPU with lower BW.<br />
The reason is simple: 50% CPU slows every instruction instead of only memory access.</p>

<blockquote>
  <p>note: but sorting, coding is computing-intensive…</p>
</blockquote>

<p><img src="https://www.grayxu.cn/images/2022/03/15/2022-03-15-17-41-45.png" alt="image.png" /></p>

<p>so they use MBA to throttle DRAM access and CPU scheduling for NVM.</p>

<blockquote>
  <p>same question here, maybe just because MBA is designed for fast DRAM access, and the ratio of injected delays are not big enough (throttling value &lt; 10 in fig.4)</p>
</blockquote>

<hr />

<p><em>NyxCache</em>[6]</p>
<blockquote>
  <p>quote "To mimic the behavior of Intel MBA, Nyx implements simple throttling by delaying PM accesses at user-level."<br />
"Our current implementation adds delays in units of 10ns with <strong>a simple computation-based busy loop</strong>. In some cases PM operations may need to be delayed indefinitely (e.g., when a resource limit is reached); in this case, PM operations are stalled until the Nyx controller sets the delay to a finite value"</p>
</blockquote>

<p>Application access PM through NyxCache interface, so NyxCache can utilize a user-level MBA with a fixed ratio. And it <strong>worked</strong>, so <em>delay injection</em> is not the problem.</p>

<h1 id="experiments">experiments</h1>

<p>…</p>

<h1 id="random-thoughts">random thoughts</h1>

<ul>
  <li>QoS from applications instead of system-level to bypass some limits brought from the bottom-view.</li>
  <li>Will injected delay waste CPU resource? context switching cost and CPU resource maybe trade-off here..</li>
  <li>Many networking QoS papers utilize or even create more "sensors" than all above. Can we mimic them without HW support?</li>
  <li>…</li>
</ul>

<h1 id="ref">ref</h1>

<ol>
  <li>Kassa, Hiwot Tadese, et al. "Improving Performance of Flash Based {Key-Value} Stores Using Storage Class Memory as a Volatile Memory Extension." 2021 USENIX Annual Technical Conference (USENIX ATC 21). 2021.</li>
  <li>Imamura, Satoshi, and Eiji Yoshida. "The analysis of inter-process interference on a hybrid memory system." Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops. 2020.</li>
  <li>Imamura, Satoshi, and Eiji Yoshida. "FairHym: Improving Inter-Process Fairness on Hybrid Memory Systems." 2020 9th Non-Volatile Memory Systems and Applications Symposium (NVMSA). IEEE, 2020.</li>
  <li>Oh, Jinyoung, and Youngjin Kwon. "Persistent memory aware performance isolation with dicio." Proceedings of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems. 2021.</li>
  <li>Yi, Jifei, et al. "MT2: Memory Bandwidth Regulation on Hybrid NVM/DRAM Platforms." 20th USENIX Conference on File and Storage Technologies (FAST 22), Santa Clara, CA. 2022.</li>
  <li>Wu, Kan, et al. "NyxCache: Flexible and Efficient Multi-tenant Persistent Memory Caching." 20th USENIX Conference on File and Storage Technologies (FAST 22), Santa Clara, CA. 2022.</li>
  <li>Fried, Joshua, et al. "Caladan: Mitigating interference at microsecond timescales." 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 2020.</li>
</ol>]]></content><author><name>Gray</name></author><category term="PM" /><summary type="html"><![CDATA[PM]]></summary></entry><entry><title type="html">RDMA+NVM remote persistence</title><link href="https://www.grayxu.cn/2021/10/19/remote-persistence/" rel="alternate" type="text/html" title="RDMA+NVM remote persistence" /><published>2021-10-19T00:00:00+08:00</published><updated>2021-10-19T00:00:00+08:00</updated><id>https://www.grayxu.cn/2021/10/19/remote-persistence</id><content type="html" xml:base="https://www.grayxu.cn/2021/10/19/remote-persistence/"><![CDATA[<h1 id="problem">Problem</h1>
<p>Due to RDMA NIC implementation, RNIC doesn't have remote persistent flush primitives. So one-sided write data from clients will write to the volatile cache on RNIC first and then RNIC directly sends ACK back before writing data to PM. As a result, a power loss will break remote data persistence easily.</p>

<p>Besides, <em>one-sided commit</em>[3] is immature or suffers poor performance.</p>

<p>Some researchers place this problem on the network systems level instead of the storage system level, and so ignore it. But for now, this problem does affect system availability.</p>

<h1 id="old-methods">Old methods</h1>

<p>2-side RPC communication can avoid this problem, but 2-side ops can't fully deposit RNIC's performance and lack of scalability at the same time[1].</p>

<p>For 1-side ops, a strawman implementation is sending a write request followed by a read request. But the cost of 2 RTT is still too high.</p>

<p><img src="https://www.grayxu.cn/images/2021/10/15/2021-10-15-17-43-51.png" alt="image.png" /></p>

<h1 id="new-methods">New methods</h1>

<p>[1] uses READ after WRITE, but with <strong>outstanding request</strong>[2] + <strong>doorbell batching</strong>[8] to process persistent WRITE request, which reduces latency from 4μs (2RTT) to 3μs.</p>

<ul>
  <li>outstanding request: WR which was posted to a work queue and its completion was not polled (like unfinished requests?</li>
  <li>doorbell batching (just batching on RDMA</li>
</ul>

<blockquote>
  <p>quote "<em>Specifically, outstanding request [23] allows us using the completion of READ as the completion of the WRITE, <strong>as long as the two requests are sent to the same QP</strong>. Since the READ to persist the WRITE must be post to the same QP as the WRITE (§2.3), we no longer need to wait for the first WRITE to complete. Thus, this optimization reduces the wait time of the first network roundtrip. Applying outstanding request to persistent WRITE is correct because first, later READ flushes previously WRITE [19], and RNIC processes requests from the same QP in a FIFO order [6].<br />
Based on outstanding request, doorbell batching [24] further allows us to send the READ and WRITE in one request using the more CPU and bandwidth efficient DMA, reducing the latency of posting RDMA requests. <br />
On our testbed, a single one-sided RDMA request takes 2µs. Thus, a strawman implementation of remote persistent write uses 4µs. After applying H9, one-sided remote persistent write takes 3µs latency to finish</em>"</p>
</blockquote>

<p>[4] claims that for small persistent writes to remote NVMM, RPCs have comparable latency as one-sided RDMA.</p>

<blockquote>
  <p>note: the tricky part is that the first author of [4] is also the first author of <em>outstanding request</em>[2]… Maybe the reason is the different devices (CX3 and CX4+CX5)?</p>
</blockquote>

<p>[5][10][11][12] use the RDMA <em>WRITE_WITH_IMM</em> verb to achieve remote persistence. So that servers will get the completion status and make data durable immediately.</p>

<blockquote>
  <p>note: WRITE_WITH_IMM can ensure atomicity since the data need to be confirmed by the extra involved server. On the other side, this imm is only 32-bit, which can't directly address the complete space.</p>
</blockquote>

<p>like tranditional DB systems, there are some optimistic methods, like using redundancy check. [6] do CRC when reading to check data consistency. <br />
To go a step further, [7] argues that CRC is expensive so they use a background thread to conduct integrity verification. (<em>but this work is based on simulation…</em></p>

<p>check their brief intro in [9].</p>

<p>[9] built and test some emulated hardware-support RDMA primitives to support RMDA remote flush primitives(RPC).</p>

<p>Popular RMDA RPC communication methods:<br />
<img src="https://www.grayxu.cn/images/2021/10/19/2021-10-19-19-36-36.png" alt="image.png" /><br />
Theirs:
<img src="https://www.grayxu.cn/images/2021/10/20/2021-10-20-10-38-23.png" alt="image.png" /></p>
<blockquote>
  <p>their work still relies on the existing RDMA primitives and the receiver's CPU to emulate RDMA RFlush primitives instead of programmable NIC.</p>
</blockquote>

<h1>?</h1>

<p>Mellonx, gkd</p>

<h1 id="refer">refer</h1>

<ol>
  <li><strong>Wei, Xingda, et al. "Characterizing and Optimizing Remote Persistent Memory with RDMA and NVM." Proceedings of the 2021 {USENIX} Annual Technical Conference ({USENIX}{ATC} 21). 2021.</strong></li>
  <li>Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Using RDMA efficiently for key-value services." Proceedings of the 2014 ACM Conference on SIGCOMM. 2014.</li>
  <li>Kim, Daehyeok, et al. "Hyperloop: group-based NIC-offloading to accelerate replicated transactions in multi-tenant storage systems." Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 2018.</li>
  <li><strong>Kalia, Anuj, David Andersen, and Michael Kaminsky. "Challenges and solutions for fast remote persistent memory access." Proceedings of the 11th ACM Symposium on Cloud Computing. 2020.</strong></li>
  <li>Lu, Youyou, et al. "Octopus: an rdma-enabled distributed persistent memory file system." 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17). 2017.</li>
  <li>Huang, Haixin, et al. "Forca: fast and atomic remote direct access to persistent memory." 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, 2018.</li>
  <li>Du, Jingwen, et al. "Fast and Consistent Remote Direct Access to Non-volatile Memory." 50th International Conference on Parallel Processing. 2021.</li>
  <li>Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Design guidelines for high performance {RDMA} systems." 2016 {USENIX} Annual Technical Conference ({USENIX}{ATC} 16). 2016.</li>
  <li><strong>Duan, Zhuohui, et al. "Hardware-Supported Remote Persistence for Distributed Persistent Memory." SC 2021.</strong></li>
  <li>Shu, Jiwu, et al. "Th-dpms: Design and implementation of an rdma-enabled distributed persistent memory storage system." ACM Transactions on Storage (TOS) 16.4 (2020): 1-31.</li>
  <li>Liu, Xinxin, Yu Hua, and Rong Bai. "Consistent RDMA-Friendly Hashing on Remote Persistent Memory." ICCD 21.</li>
  <li>Yang, Jian, Joseph Izraelevitz, and Steven Swanson. "Orion: A distributed file system for non-volatile main memory and RDMA-capable networks." 17th {USENIX} Conference on File and Storage Technologies ({FAST} 19). 2019.</li>
</ol>]]></content><author><name>Gray</name></author><category term="PM" /><summary type="html"><![CDATA[RDMA, PM]]></summary></entry><entry><title type="html">(SC &apos;21) LogECMem: Coupling Erasure-Coded In-memory Key-Value Stores with Parity Logging</title><link href="https://www.grayxu.cn/2021/10/11/LogECMem/" rel="alternate" type="text/html" title="(SC &apos;21) LogECMem: Coupling Erasure-Coded In-memory Key-Value Stores with Parity Logging" /><published>2021-10-11T00:00:00+08:00</published><updated>2021-10-11T00:00:00+08:00</updated><id>https://www.grayxu.cn/2021/10/11/LogECMem</id><content type="html" xml:base="https://www.grayxu.cn/2021/10/11/LogECMem/"><![CDATA[<p>LogECMem uses a hybrid method of in-place update and Parity logging (PL) for parity updates.</p>

<h1 id="motivation">motivation</h1>

<p>old update policies:</p>
<ol>
  <li>direct reconstruction: read all non-updated data, and compute the new parity with old parities. (huge data transfer costs)</li>
  <li>in-place update: read old parity, and compute the parity delta by data delta. (too many reads to old parity</li>
  <li>full-stripe update: out-of-place update, and GC stale data chunks. (no parity reads but it brings high space cost</li>
  <li>PL: logging the parity deltas (But PL is designed for disk-based systems</li>
</ol>

<p><img src="https://www.grayxu.cn/images/2021/11/24/2021-11-24-17-26-01.png" alt="1.png" />
<img src="https://www.grayxu.cn/images/2021/11/24/2021-11-24-17-26-27.png" alt="2.png" />
<img src="https://www.grayxu.cn/images/2021/11/24/2021-11-24-17-28-18.png" alt="3.png" /></p>

<p>They claimed that:</p>
<ul>
  <li><em>for wide-stripe EC, GC in full-stripe update will consume a lot of network bandwidth.</em></li>
  <li><em>full-stripe update will take more memory space due to invalid blocks</em></li>
  <li>single-failure is the most critical (in a MTTDL model)</li>
</ul>

<p>So they built <a href="https://github.com/YuchongHu/logecmem"><em>LogECMem</em></a> by in-place update  for XOR parity in DRAM and PL for other parities.</p>

<h1 id="methods">methods</h1>

<h2 id="design">design</h2>

<p>Like <em>buffer logging</em> of RAMCloud[2], they use buffer logging for other parity chunks to accelerate writes:  <br />
<img src="https://www.grayxu.cn/images/2021/10/11/2021-10-11-19-26-38.png" alt="image.png" />
(the buffer here is DRAM)</p>

<blockquote>
  <p>note: PM is a good log device, and not limited by capacity.<br />
btw, the persistence in <em>buffer logging</em> of RAMCloud[2] is ensured by battery-based DRAM. So PM is appropriate and may provide fast recovery (<em>since the gap between DRAM and disk is big</em>)? But it's weird to talk about logging persistence on a storage system keeping all data in DRAM….<br />
related: <a href="https://www.grayxu.cn/2020/09/16/FlatStore/"><em>flatstore</em></a></p>
</blockquote>

<p>With this XOR parity in DRAM, systems can perform degraded read in DRAM nodes.</p>

<h2 id="op">Op</h2>

<p>update:
<img src="https://www.grayxu.cn/images/2021/10/11/2021-10-11-21-50-22.png" alt="image.png" /></p>

<p>…<em>check paper for details</em>…</p>

<p>merge-based buffer logging: a log merging trick</p>

<h2 id="multiple-chunk-failures-repair">multiple chunk failures repair</h2>

<p><em>PLR</em> (FAST '14) trades write performance for the repair performance. Simple merging logging on PLR only can merge incoming parity deltas.<br />
So they use a lazy merging strategy (<em>parity logging with merging, PLM</em>) that writes parity to extra continuous disk space, and then reads them back for merging later.</p>

<p><img src="https://www.grayxu.cn/images/2021/10/12/2021-10-12-09-35-27.png" alt="image.png" /></p>

<blockquote>
  <p>note: kindof 2-level? the first non-ordered level won't hurt perf?<br />
also, any perf bottleneck in logging and 2nd-level logging replacement (<em>GC-like</em>)? It seems like only <strong>overall</strong> perf test in the experiment part.</p>
</blockquote>

<h1 id="expriments">expriments</h1>

<p>……</p>

<h1 id="ref">ref</h1>
<ol>
  <li>Cheng, et al. LogECMem: Coupling Erasure-Coded In-memory Key-Value Stores with Parity Logging, SC '21.</li>
  <li>Ousterhout, John, et al. "The RAMCloud storage system." ACM Transactions on Computer Systems (TOCS) 33.3 (2015): 1-55.</li>
</ol>]]></content><author><name>Gray</name></author><category term="EC" /><summary type="html"><![CDATA[KVS, EC]]></summary></entry></feed>