FoundationDB 分布式测试模拟器

2022-04-13 00:00:00 集群 多个 测试 注入 模拟

Paper FullText

  
FoundationDB (FDB) 引入了一套错误注入框架,用于测试该分布式数据库。具体地,它的分布式测试模拟器会启动多个 FDB “实例”,并 Hook 他们的网络、磁盘等结构,然后在测试脚本的驱动下让 FDB “实例” 集群按照预定的规则运行下去。测试脚本的内容包括:Workloads、错误注入命令、FDB配置项更新命令等。

理解的重点:所有的测试工作都在模拟器中完成,FDB 的服务器也是在模拟器中运行,并不会单独启动一个真的 FDB 集群。如上图,模拟器的 Simulator Process 中,每个FDB“实例”并不是一个单独的进程,而是受控的一段段代码逻辑(厉害吧,能把完整的数据库集群功能当成“函数”一样在一个进程内模拟出来!),这些代码逻辑能在 Simulator 的控制下有序执行,从而实现集群级别的可重复 bug 复现。

解决什么问题
分布式系统里引入错误注入框架并不是什么新鲜事,重点是如何用好这个框架,实现测试目的。Monkey Test 不是好的测试实践,不应该成为测试主要手段。我认为,一个错误注入框架,应该做到如下几点:

有效:不能因为引入一些随机的错误注入让系统近乎完蛋。
可重复:不能总是随机引入错误,需要能够高效地复现问题。
带着这个问题,来看看 FDB 的论文中是如何解决这些问题。

deterministic simulator
在 FDB 设计之初就确立了这样的模式:

the real database software is run, together with randomized synthetic workloads and fault injection, in a deterministic discrete-event simulation
FDB 是单进程单线程异步(?)模型,不存在多线程并发。网络、磁盘、时间、随机数生成等都可以受控。FDB 使用 Flow 编写,支持 async/await 语法。FDB 中的逻辑被封装成多个 Actor,由 Flow 的运行时库调度。

Simulator 可以启动多个 FDB,并自动 Hook 他们的网络,使得一次模拟运行,可以控制多个 FDB。

如何做测试
用 synthetic query 来检查 SQL 是否符合预期。
错误注入并不是完全随机,会小心地设计,避免系统因失败率过高而进入“a small state-space”(大部分模块不能正常工作,白测)
代码里加入“路径覆盖检查”逻辑。比如,加入TEST( buffer.is_full() ) 这样一行代码后,就会生成一个报告:N次模拟测试中,调用到 TEST 时,buffer.is_full() 为真的比例有多高。如果比例很低,说明测试覆盖可能不够。
并发模拟测试。增加覆盖,发现边缘问题。
时间加速。这个特性很有意思。比如在observer 测试中一个query需要10s超时。发现一个导致超时的bug你需要至少花10s。为了更快,FDB 把模拟的时间加快,可能模拟器1s,世上已千年。
局限
性能问题无能为力
第三方库无法注入,没法测试
必须是 Flow 实现的才能注入
回头看
文章的开始提到了:有效、可重复。 FDB 是怎么做到的?

有效 :FDB 不依靠 Monkey Test,而是通过仔细地设计注入用例,避免系统进入高错误率的场景,从而保证测试用例是有效的。
可重复:通过 deterministic simulator ,有力地控制系统随机事件,控制并发,使得测试步骤可重现。
对于 FDB 来说,比较创新的是它的 deterministic simulator ,而这又和它本身的受控单线程异步模型密不可分,并不是所有人都能抄到的。但是更多的细节,FDB 并没有披露,仅仅有如下内容:

FDB is written in Flow [4], a novel syntactic extension to C++ adding async/await-like concurrency primitives. Flow provides the Actor programming model [13] that abstracts various actions of the FDB server process into a number of actors that are scheduled by the Flow runtime library. The simulator process is able to spawn multiple FDB servers that communicate with each other through a simulated network in a single discrete-event simulation.

不过它的文档(foundationdb/documentation/sphinx/source/client-testing.rst)里有比较好的说明:
  

a fdbserver process can run a simulator that simulates a full fdb cluster with several machines and different configurations in one process. This simulator can run the same workloads you can run on a real cluster. It will also inject random failures like network partitions and disk failures.

翻译一下,它提供了 fdserver 这么一个二进制工具,它是单线程模型,里面能够模拟一个 FDB 集群,并能够控制集群内的通信和磁盘IO等。
上面文章里说的 deteriministic 就是这么来的,一个集群里,所有 Actor 的执行都是在一个进程里模拟的,那当然是很受控了!

参考内容
更多关于 Simulation 的内容需要去研究 FDB 的源码,详见github 中的测试描述:
foundationdb/documentation/sphinx/source/testing.rst

另外,这一章节写了如何写测试,有助于理解它的 simulation 技术:foundationdb/documentation/sphinx/source/client-testing.rst

Simulation
Simulation is a powerful tool for testing system correctness. Our simulation technology, called Simulation, is enabled by and tightly integrated with :doc:flow, our programming language for actor-based concurrency. In addition to generating efficient production code, Flow works with Simulation for simulated execution.

The major goal of Simulation is to make sure that we find and diagnose issues in simulation rather than the real world. Simulation runs tens of thousands of simulations every night, each one simulating large numbers of component failures. Based on the volume of tests that we run and the increased intensity of the failures in our scenarios, we estimate that we have run the equivalent of roughly one trillion CPU-hours of simulation on FoundationDB.

Simulation is able to conduct a deterministic simulation of an entire FoundationDB cluster within a single-threaded process. Determinism is crucial in that it allows perfect repeatability of a simulated run, facilitating controlled experiments to home in on issues. The simulation steps through time, synchronized across the system, representing a larger amount of real time in a smaller amount of simulated time. In practice, our simulations usually have about a 10-1 factor of real-to-simulated time, which is advantageous for the efficiency of testing.

We run a broad range of simulations testing various aspects of the system. For example, we run a cycle test that uses key-values pairs arranged in a ring that executes transactions to change the values in a manner designed to maintain the ring’s integrity, allowing a clear test of transactional isolation.

Simulation simulates all physical components of a FoundationDB system, beginning with the number and type of machines in the cluster. For example, Simulation models drive performance on each machine, including drive space and the possibility of the drive filling up. Simulation also models the network, allowing a small amount of code to specify delivery of packets.

We use Simulation to simulate failures modes at the network, machine, and datacenter levels, including connection failures, degradation of machine performance, machine shutdowns or reboots, machines coming back from the dead, etc. We stress-test all of these failure modes, failing machines at very short intervals, inducing unusually severe loads, and delaying communications channels.

For a while, there was an informal competition within the engineering team to design failures that found the toughest bugs and issues the most easily. After a period of one-upsmanship, the reigning champion is called “swizzle-clogging”. To swizzle-clog, you first pick a random subset of nodes in the cluster. Then, you “clog” (stop) each of their network connections one by one over a few seconds. Finally, you unclog them in a random order, again one by one, until they are all up. This pattern seems to be particularly good at finding deep issues that only happen in the rarest real-world cases.

Simulation’s success has surpassed our expectation and has been vital to our engineering team. It seems unlikely that we would have been able to build FoundationDB without this technology.
————————————————
版权声明:本文为CSDN博主「maray」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/maray/article/details/118209985

相关文章