Stream.skip 行为与无序终端操作

2022-01-22 00:00:00 parallel-processing java-8 java java-stream collectors

我已经阅读了 this 和 this 质疑，但仍然怀疑观察到的 Stream.skip 行为是否是 JDK 作者有意为之.

I've already read this and this questions, but still doubt whether the observed behavior of Stream.skip was intended by JDK authors.

让我们简单地输入数字 1..20:

Let's have simple input of numbers 1..20:

List<Integer> input = IntStream.rangeClosed(1, 20).boxed().collect(Collectors.toList());

现在让我们创建一个并行流，将 unordered() 与 skip() 以不同的方式组合并收集结果:

Now let's create a parallel stream, combine the unordered() with skip() in different ways and collect the result:

System.out.println("skip-skip-unordered-toList: " + input.parallelStream().filter(x -> x > 0) .skip(1) .skip(1) .unordered() .collect(Collectors.toList())); System.out.println("skip-unordered-skip-toList: " + input.parallelStream().filter(x -> x > 0) .skip(1) .unordered() .skip(1) .collect(Collectors.toList())); System.out.println("unordered-skip-skip-toList: " + input.parallelStream().filter(x -> x > 0) .unordered() .skip(1) .skip(1) .collect(Collectors.toList()));

过滤步骤在这里基本上什么都不做，但给流引擎增加了更多的困难:现在它不知道输出的确切大小，因此关闭了一些优化.我有以下结果:

Filtering step does essentially nothing here, but adds more difficulty for stream engine: now it does not know the exact size of the output, thus some optimizations are turned off. I have the following results:

skip-skip-unordered-toList: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] // absent values: 1, 2 skip-unordered-skip-toList: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20] // absent values: 1, 15 unordered-skip-skip-toList: [1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20] // absent values: 7, 18

结果完全没问题，一切都按预期进行.在第一种情况下，我要求跳过前两个元素，然后以不特定顺序收集到列表.在第二种情况下，我要求跳过第一个元素，然后变成无序并再跳过一个元素(我不在乎哪个元素).在第三种情况下，我先变成了无序模式，然后跳过了两个任意元素.

The results are completely fine, everything works as expected. In the first case I asked to skip first two elements, then collect to list in no particular order. In the second case I asked to skip the first element, then turn into unordered and skip one more element (I don't care which one). In the third case I turned into unordered mode first, then skip two arbitrary elements.

让我们跳过一个元素并以无序模式收集到自定义集合.我们的自定义集合将是一个 HashSet:

Let's skip one element and collect to the custom collection in unordered mode. Our custom collection will be a HashSet:

System.out.println("skip-toCollection: " + input.parallelStream().filter(x -> x > 0) .skip(1) .unordered() .collect(Collectors.toCollection(HashSet::new)));

输出令人满意:

skip-toCollection: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] // 1 is skipped

所以总的来说，我希望只要流是有序的，skip() 会跳过第一个元素，否则它会跳过任意元素.

So in general I expect that as long as stream is ordered, skip() skips the first elements, otherwise it skips arbitrary ones.

不过，让我们使用等效的无序终端操作collect(Collectors.toSet()):

However let's use an equivalent unordered terminal operation collect(Collectors.toSet()):

System.out.println("skip-toSet: " + input.parallelStream().filter(x -> x > 0) .skip(1) .unordered() .collect(Collectors.toSet()));

现在的输出是:

skip-toSet: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20] // 13 is skipped

任何其他无序终端操作(如 forEach、findAny、anyMatch 等)都可以达到相同的结果.在这种情况下删除 unordered() 步骤不会改变任何事情.似乎虽然 unordered() 步骤正确地使流从当前操作开始无序，但无序的终端操作使整个流从一开始就无序，尽管如果 skip 这会影响结果() 被使用.这对我来说似乎完全误导了我:我希望使用无序收集器与将流转换为无序模式在终端操作之前并使用等效的有序收集器相同.

The same result can be achieved with any other unordered terminal operation (like forEach, findAny, anyMatch, etc.). Removing unordered() step in this case changes nothing. Seems that while unordered() step correctly makes the stream unordered starting from the current operation, the unordered terminal operation makes the whole stream unordered starting from very beginning despite that this can affect the result if skip() was used. This seems completely misleading for me: I expect that using the unordered collector is the same as turning the stream into unordered mode just before the terminal operation and using the equivalent ordered collector.

所以我的问题是:

这种行为是有意的还是一个错误?
如果是，是否记录在某处?我已阅读 Stream.skip() 文档:它没有说明无序的终端操作.还有 Characteristics.UNORDERED 文档不是很理解，也没有说整个流的排序都会丢失.最后，排序包摘要中的部分也不涵盖这种情况.可能我错过了什么?
如果打算无序的终端操作使整个流无序，为什么 unordered() 步骤仅从此时起才使其无序?我可以依靠这种行为吗?还是我很幸运，我的第一个测试运行良好?

Is this behavior intended or it's a bug?

If yes is it documented somewhere? I've read Stream.skip() documentation: it does not say anything about unordered terminal operations. Also Characteristics.UNORDERED documentation is not very comprehend and does not say that ordering will be lost for the whole stream. Finally, Ordering section in package summary does not cover this case either. Probably I'm missing something?

If it's intended that unordered terminal operation makes the whole stream unordered, why unordered() step makes it unordered only since this point? Can I rely on this behavior? Or I was just lucky that my first tests work nicely?

推荐答案

回想一下，流标志(ORDERED、SORTED、SIZED、DISTINCT)的目标是启用操作以避免做不必要的工作.涉及流标志的优化示例如下:

Recall that the goal of stream flags (ORDERED, SORTED, SIZED, DISTINCT) is to enable operations to avoid doing unnecessary work. Examples of optimizations that involve stream flags are:

如果我们知道流已经排序，那么 sorted() 是空操作；
如果我们知道流的大小，我们可以在 toArray() 中预先分配一个大小正确的数组，避免复制；
如果我们知道输入没有有意义的遭遇顺序，我们不需要采取额外的步骤来保持遭遇顺序.

If we know the stream is already sorted, then sorted() is a no-op;

If we know the size of the stream, we can pre-allocate a correct-sized array in toArray(), avoiding a copy;

If we know that the input has no meaningful encounter order, we need not take extra steps to preserve encounter order.

管道的每个阶段都有一组流标志.中间操作可以注入、保留或清除流标志.例如，过滤保留 sorted-ness/distinct-ness 但不保留 size-ness；映射保留大小，但不保留排序或独特性.排序注入排序性.中间操作的标志处理相当简单，因为所有决策都是本地的.

Each stage of a pipeline has a set of stream flags. Intermediate operations can inject, preserve, or clear stream flags. For example, filtering preserves sorted-ness / distinct-ness but not sized-ness; mapping preserves sized-ness but not sorted-ness or distinct-ness. Sorting injects sorted-ness. The treatment of flags for intermediate operations is fairly straightforward, because all decisions are local.

终端操作的标志处理更加微妙.ORDERED 是与终端操作最相关的标志.如果终端操作是无序的，那么我们会反向传播无序性.

The treatment of flags for terminal operations is more subtle. ORDERED is the most relevant flag for terminal ops. And if a terminal op is UNORDERED, then we do back-propagate the unordered-ness.

我们为什么要这样做?好吧，考虑一下这个管道:

Why do we do this? Well, consider this pipeline:

set.stream() .sorted() .forEach(System.out::println);

由于 forEach 不限制按顺序操作，所以对列表进行排序的工作完全是白费力气.所以我们反向传播这个信息(直到我们遇到一个短路操作，例如 limit)，以免失去这个优化机会.同样，我们可以在无序流上使用 distinct 的优化实现.

Since forEach is not constrained to operate in order, the work of sorting the list is completely wasted effort. So we back-propagate this information (until we hit a short-circuiting operation, such as limit), so as not to lose this optimization opportunity. Similarly, we can use an optimized implementation of distinct on unordered streams.

这种行为是有意的还是一个错误?

Is this behavior intended or it's a bug?

是的 :) 反向传播是有意的，因为它是一种有用的优化，不会产生不正确的结果.然而，错误部分是我们正在传播过去的 skip，这是我们不应该的.所以 UNORDERED 标志的反向传播过于激进，这是一个错误.我们将发布一个错误.

Yes :) The back-propagation is intended, as it is a useful optimization that should not produce incorrect results. However, the bug part is that we are propagating past a previous skip, which we shouldn't. So the back-propagation of the UNORDERED flag is overly aggressive, and that's a bug. We'll post a bug.

如果是，它是否记录在某处?

If yes is it documented somewhere?

应该只是一个实现细节；如果正确实施，您不会注意到(除了您的流更快.)

It should be just an implementation detail; if it were correctly implemented, you wouldn't notice (except that your streams are faster.)

相关文章