在执行 I/O 密集型任务时,20 个进程中的 400 个线程优于 4 个进程中的 400 个线程

问题描述

下面是实验代码,它可以启动指定数量的工作进程,然后在每个进程内启动指定数量的工作线程,并执行获取 URL 的任务:

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

这是我运行这个程序的方式:

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <URLS>

例如,python3 foo.py 20 20 10000 创建 20 个工作进程,每个工作进程中有 20 个线程(因此总共有 400 个工作线程)并获取 10000 个 URL.最后,这个程序会打印出获取 URL 所花费的时间以及平均每秒获取多少个 URL.

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.

请注意,在所有情况下,我都会点击 www.example.com 域的 URL,即 www.example.com 不仅仅是一个占位符.换句话说,我在未修改的情况下运行上述代码.

Note that in all cases I am really hitting a URL of www.example.com domain, i.e., www.example.com is not merely a placeholder. In other words, I run the above code unmodified.

我正在一个具有 8 GB RAM 和 4 个 CPU 的 Linode 虚拟专用服务器上测试此代码.它正在运行 Debian 9.

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

案例 1:20 个进程 x 20 个线程

这里有一些试运行,其中 400 个工作线程分布在 20 个工作进程之间(即 20 个工作进程中的每个工作进程有 20 个工作线程).在每次试验中,会提取 10,000 个 URL.

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.

结果如下:

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

我们可以看到平均每秒获取大约 1900 个 URL.当我使用 top 命令监控 CPU 使用率时,我看到每个 python3 工作进程消耗大约 10% 到 15% 的 CPU.

We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU.

现在我以为我只有 4 个 CPU.即使我启动 20 个工作进程,在物理时间的任何时间点最多也只有 4 个进程可以运行.此外,由于全局解释器锁 (GIL),每个进程中只有一个线程(因此最多总共 4 个线程)可以在物理时间的任何点运行.

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

因此,我想如果我将进程数减少到4个,并将每个进程的线程数增加到100个,这样总线程数仍然保持在400个,性能应该不会变差.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

但测试结果表明,每个包含 100 个线程的 4 个进程的性能始终比每个包含 20 个线程的 20 个进程差.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

每个 python3 工作进程的 CPU 使用率在 40% 到 60% 之间.

The CPU usage is between 40% to 60% for each python3 worker process.

只是为了比较,我记录了一个事实,即案例 1 和案例 2 都优于我们在单个进程中拥有所有 400 个线程的情况.这肯定是由于全局解释器锁 (GIL).

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

单个 python3 工作进程的 CPU 使用率介于 120% 和 125% 之间.

The CPU usage is between 120% and 125% for the single python3 worker process.

再次,只是为了比较,这里是当有 400 个进程时的结果,每个进程都有一个线程.

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

每个 python3 工作进程的 CPU 使用率在 1% 到 3% 之间.

The CPU usage is between 1% to 3% for each python3 worker process.

从每个案例中选取中值结果,我们得到以下摘要:

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

问题

为什么即使我只有 4 个 CPU,20 进程 x 20 线程的性能也比 4 进程 x 100 线程好?

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?


解决方案

你的任务是 I/O-bound 而不是 CPU-bound:线程大部分时间都在睡眠状态等待网络数据等,而不是使用中央处理器.

Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU.

因此,只要 I/O 仍然是瓶颈,添加比 CPU 更多的线程就可以工作.只有当线程太多以至于有足够多的线程准备好开始积极竞争 CPU 周期时(或当您的网络带宽耗尽时,以先到者为准),这种影响才会消退.

So adding more threads than CPUs works here as long as I/O is still the bottleneck. The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first).

至于为什么每个进程 20 个线程比每个进程 100 个线程快:这很可能是由于 CPython 的 GIL.同一进程中的 Python 线程不仅需要等待 I/O,还需要相互等待.
在处理 I/O 时,Python 机器:

As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. Python threads in the same process need to wait not only for I/O but for each other, too.
When dealing with I/O, Python machinery:

  1. 将所有涉及的 Python 对象转换为 C 对象(在许多情况下,无需物理复制数据即可完成)
  2. 发布 GIL
  3. 在 C 中执行 I/O(包括等待任意时间)
  4. 重新获得 GIL
  5. 将结果转换为 Python 对象(如果适用)

如果同一个进程中有足够多的线程,那么当到达第 4 步时,另一个线程很可能处于活动状态,从而导致额外的随机延迟.

If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay.

现在,当涉及到大量进程时,其他因素也会发挥作用,例如内存交换(因为与线程不同,运行相同代码的进程不共享内存)(我很确定还有其他延迟来自进程而不是线程竞争资源,但不能从我的头顶指出).这就是性能变得不稳定的原因.

Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). That's why the performance becomes unstable.

相关文章