scala 范围与大型集合上的列表性能
我针对 10,000,000 个元素运行了一组性能基准测试,我发现每次实施的结果差异很大.
I ran a set of performance benchmarks for 10,000,000 elements, and I've discovered that the results vary greatly with each implementation.
谁能解释为什么创建 Range.ByOne 会产生比简单的基元数组更好的性能,但将相同的范围转换为列表会导致比最坏情况下更差的性能?
Can anybody explain why creating a Range.ByOne, results in performance that is better than a simple array of primitives, but converting that same range to a list results in even worse performance than the worse case scenario?
创建 10,000,000 个元素,并打印出以 1,000,000 为模的元素.JVM 大小始终设置为相同的最小值和最大值:-Xms?m -Xmx?m
Create 10,000,000 elements, and print out those that are modulos of 1,000,000. JVM size is always set to same min and max: -Xms?m -Xmx?m
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LightAndFastRange extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = (0 to 10000000).filter(_ % 1000000 == 0).toList
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
JVM 大小为 27m 需要 141 毫秒
It takes 141 milliseconds with a JVM size of 27m
相比之下,转换为 List 会显着影响性能:
In comparison, converting to List affects performance dramatically:
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeLinkedList extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
val results = chrono((0 to 10000000).toList.filter(_ % 1000000 == 0))
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2)
}
460-455 m 需要 8514-10896 ms
It takes 8514-10896 ms with 460-455 m
相比之下,这个 Java 实现使用了一个原语数组
In contrast, this Java implementation uses an array of primitives
import static java.util.concurrent.TimeUnit.*;
public class LargePrimitiveArray {
public static void main(String[] args){
long start = System.nanoTime();
int[] elements = new int[10000000];
for(int i = 0; i < 10000000; i++){
elements[i] = i;
}
for(int i = 0; i < 10000000; i++){
if(elements[i] % 1000000 == 0) {
System.out.println("x: " + elements[i]);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
JVM 大小为 59m 需要 116ms
It takes 116ms with JVM size of 59m
Java 整数列表
import java.util.List;
import java.util.ArrayList;
import static java.util.concurrent.TimeUnit.*;
public class LargeArrayList {
public static void main(String[] args){
long start = System.nanoTime();
List<Integer> elements = new ArrayList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
}
JVM 大小为 283m 需要 3993 毫秒
It takes 3993 ms with JVM size of 283m
我的问题是,为什么第一个示例如此高效,而第二个示例却受到如此严重的影响.我尝试创建视图,但未能成功再现该范围的性能优势.
My question is, why is the first example so performant, while the second is so badly affected. I tried creating views, but wasn't successful at reproducing the performance benefits of the range.
在 Mac OS X Snow Leopard 上运行的所有测试,Java 6u26 64 位服务器斯卡拉 2.9.1.final
All tests running on Mac OS X Snow Leopard, Java 6u26 64-Bit Server Scala 2.9.1.final
为了完成,这里是使用 LinkedList 的实际实现(这在空间方面比 ArrayList 更公平,因为正如正确指出的那样,scala 的列表是链接的)
for completion, here's the actual implementation using a LinkedList (which is a more fair comparison in terms of space than ArrayList, since as rightly pointed out, scala's List are linked)
import java.util.List;
import java.util.LinkedList;
import static java.util.concurrent.TimeUnit.*;
public class LargeLinkedList {
public static void main(String[] args){
LargeLinkedList test = new LargeLinkedList();
long start = System.nanoTime();
List<Integer> elements = test.createElements();
test.findElementsToPrint(elements);
long end = System.nanoTime();
System.out.println("Time: " + MILLISECONDS.convert(end-start, NANOSECONDS));
}
private List<Integer> createElements(){
List<Integer> elements = new LinkedList<Integer>();
for(int i = 0; i < 10000000; i++){
elements.add(i);
}
return elements;
}
private void findElementsToPrint(List<Integer> elements){
for(Integer x: elements){
if(x % 1000000 == 0) {
System.out.println("x: " + x);
}
}
}
}
480-460 mbs 需要 3621-6749 毫秒.这更符合第二个 scala 示例的性能.
Takes 3621-6749 ms with 480-460 mbs. That's much more in line with the performance of the second scala example.
最后,一个 LargeArrayBuffer
finally, a LargeArrayBuffer
import collection.mutable.ArrayBuffer
import java.util.concurrent.TimeUnit
import java.util.concurrent.TimeUnit._
object LargeArrayBuffer extends App {
def chrono[A](f: => A, timeUnit: TimeUnit = MILLISECONDS): (A,Long) = {
val start = System.nanoTime()
val result: A = f
val end = System.nanoTime()
(result, timeUnit.convert(end-start, NANOSECONDS))
}
def millions(): List[Int] = {
val size = 10000000
var items = new ArrayBuffer[Int](size)
(0 to size).foreach (items += _)
items.filter(_ % 1000000 == 0).toList
}
val results = chrono(millions())
results._1.foreach(x => println ("x: " + x))
println("Time: " + results._2);
}
大约需要 2145 毫秒和 375 mb
Taking about 2145 ms and 375 mb
非常感谢您的回答.
推荐答案
哦,这里发生了很多事情!!!
Oh So Many Things going on here!!!
让我们从 Java int[]
开始.Java 中的数组是唯一没有类型擦除的集合.int[]
的运行时表示不同于 Object[]
的运行时表示,因为它实际上直接使用 int
.因此,使用它时不涉及拳击.
Let's start with Java int[]
. Arrays in Java are the only collection that is not type erased. The run time representation of an int[]
is different from the run time representation of Object[]
, in that it actually uses int
directly. Because of that, there's no boxing involved in using it.
在内存方面,内存中有 40.000.000 个连续字节,每当读取或写入一个元素时,每次读取和写入 4 个字节.
In memory terms, you have 40.000.000 consecutive bytes in memory, that are read and written 4 at a time whenever an element is read or written to.
相比之下,ArrayList<Integer>
——以及几乎任何其他通用集合——由 40.000.000 或 80.000.00 个连续字节组成(在 32 位和 64 位 JVM 上)分别),加上 80.000.000 个字节以 8 个字节为一组分布在整个内存中.每次读取和写入元素都必须经过两个内存空间,而当您正在执行的实际任务如此之快时,处理所有这些内存所花费的绝对时间非常重要.
In contrast, an ArrayList<Integer>
-- as well as pretty much any other generic collection -- is composed of 40.000.000 or 80.000.00 consecutive bytes (on 32 and 64 bits JVM respectively), PLUS 80.000.000 bytes spread all around memory in groups of 8 bytes. Every read an write to an element has to go through two memory spaces, and the sheer time spent handling all that memory is significant when the actual task you are doing is so fast.
那么,回到 Scala 来处理第二个操作 List
的例子.现在,Scala 的 List
更像是 Java 的 LinkedList
,而不是名称错误的 ArrayList
.List
的每个元素都由一个名为 Cons
的对象组成,该对象有 16 个字节,带有一个指向该元素的指针和一个指向另一个列表的指针.因此,由 10.000.000 个元素组成的 List
由 160.000.000 个元素组成,这些元素以 16 个字节为一组分布在内存中,另外还有 80.000.000 个字节以 8 个字节为一组分布在内存中.所以对于 ArrayList
来说是正确的,对于 List
So, back to Scala, for the second example where you manipulate a List
. Now, Scala's List
is much more like Java's LinkedList
than the grossly misnamed ArrayList
. Each element of a List
is composed of an object called Cons
, which has 16 bytes, with a pointer to the element and a pointer to another list. So, a List
of 10.000.000 elements is composed of 160.000.000 elements spread all around memory in groups of 16 bytes, plus 80.000.000 bytes spread all around memory in groups of 8 bytes. So what was true for ArrayList
is even more so for List
最后,范围
.Range
是具有上下边界的整数序列,外加一个步长.10.000.000 个元素的 Range
为 40 个字节:三个整数(非通用)用于上下界和步长,加上一些预先计算的值(last
、numRangeElements
) 和另外两个用于 lazy val
线程安全的整数.为了清楚起见,NOT 40 乘以 10.000.000:总共 40 个字节.范围的大小完全不相关,因为它不存储单个元素.只有下限、上限和步长.
Finally, Range
. A Range
is a sequence of integers with a lower and an upper boundary, plus a step. A Range
of 10.000.000 elements is 40 bytes: three ints (not generic) for lower and upper bounds and step, plus a few pre-computed values (last
, numRangeElements
) and two other ints used for lazy val
thread safety. Just to make clear, that's NOT 40 times 10.000.000: that's 40 bytes TOTAL. The size of the range is completely irrelevant, because IT DOESN'T STORE THE INDIVIDUAL ELEMENTS. Just the lower bound, upper bound and step.
现在,因为一个 Range
是一个 Seq[Int]
,它在大多数情况下仍然需要经过装箱:一个 int
将转换成 Integer
然后再转换回 int
,这可悲的是浪费.
Now, because a Range
is a Seq[Int]
, it still has to go through boxing for most uses: an int
will be converted into an Integer
and then back into an int
again, which is sadly wasteful.
缺点大小计算
所以,这里是对 Cons 的初步计算.首先,阅读这篇文章,了解关于对象占用多少内存的一般准则.重点是:
So, here's a tentative calculation of Cons. First of all, read this article about some general guidelines on how much memory an object takes. The important points are:
- Java 为普通对象使用 8 个字节,为对象数组使用 12 个字节,用于管理"信息(该对象的类是什么等).
- 对象被分配在 8 个字节的块中.如果您的对象小于此值,则会对其进行填充以补充它.
其实我以为是16字节,不是8字节.反正Cons也比我想象的要小.它的字段是:
I actually thought it was 16 bytes, not 8. Anyway, Cons is also smaller than I thought. Its fields are:
public static final long serialVersionUID; // static, doesn't count
private java.lang.Object scala$collection$immutable$$colon$colon$$hd;
private scala.collection.immutable.List tl;
引用至少 4 个字节(在 64 位 JVM 上可能更多).所以我们有:
References are at least 4 bytes (could be more on 64 bits JVM). So we have:
8 bytes Java header
4 bytes hd
4 bytes tl
这使它只有 16 个字节长.相当不错,其实.在示例中,hd
将指向一个 Integer
对象,我假设它有 8 个字节长.至于 tl
,它指向另一个我们已经在计算的缺点.
Which makes it only 16 bytes long. Pretty good, actually. In the example, hd
will point to an Integer
object, which I assume is 8 bytes long. As for tl
, it points to another cons, which we are already counting.
我将修改估计值,并尽可能使用实际数据.
I'm going to revise the estimates, with actual data where possible.
相关文章