Java中大集合求交集的方法比较
背景
项目中使用到List求交集,很容易想到collecion.retainAll()方法,但是在数据量比较大时,这个方法效率并不高。本文研究了几种常用的方法,以供大家参考。
方法
【首先】梳理下思路,List去重一般有几种方法。
- 『外层遍历+内层遍历』查找:
复杂度O(NM) ,一般使用contains()检查是否包含
- 『外层遍历+内层Hash』查找:
复杂度O(N),一般将内层List转化为HashSet实现
- 『外层遍历+内层bitMap』查找:
复杂度O(N),一般将内层List转化为字节映射实现
【其次】这里其实忽略了一个点,就是 『单层遍历』中,检查 元素不包含 时,需要将这个元素移除(即remove方法)。remove时,也会导致性能问题。
这里面我们使用Java8中java.util.AbstractCollection#retainAll方法来验证下我们的思路。
// Java8 中 方法:java.util.AbstractCollection#retainAll
public boolean retainAll(Collection<?> c) {
Objects.requireNonNull(c);
boolean modified = false;
Iterator<E> it = iterator();
// 1. 外层遍历
while (it.hasNext()) {
// 2. 内层查找『是否包含』
if (!c.contains(it.next())) {
// 3. 不包含时,移除外层元素
it.remove();
modified = true;
}
}
return modified;
}
这里用图总结下,求交集的流程:
实现
1.『外层遍历+内层遍历』查找:
java中常用2种遍历查找的List:ArrayList、LinkedList,在内外层中测试。
// 外层:ArrayList,内层:ArrayList
private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:ArrayList
private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:LinkedList
private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:LinkedList
private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
}
2.『外层遍历+内层Hash』查找:
java中常用HashSet,内层替换为HashSet查找。
// 外层:ArrayList,内层:HashSet
private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
HashSet<Long> setB = new HashSet<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:HashSet
private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
HashSet<Long> setB = new HashSet<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
}
3.『外层遍历+内层bitMap』查找:
BitSet也称作BitMap,它是一种通用的快速数据结构,不幸的是它太费内存,所以通常我们使用压缩位图。RoaringBitmap是一种压缩位置,它提供更好的压缩效果,在某些情况下比其它压缩位图快好几百倍。
https://github.com/RoaringBitmap/RoaringBitmap
RoaringBitmap已经使用在很多知名的开源项目中:
- Apache Spark
- Apache Hive
- Apache Tez
- Apache Kylin
- ... ...
Roaringbitmap中在Long类型中,提供了2种实现Roaring64NavigableMap和Roaring64Bitmap。Roaring64NavigableMap基于红黑树实现,Roaring64Bitmap基于ART(The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases )数据结构实现。
那么,外层使用ArrayList、LinkedList,内层使用Roaring64NavigableMap、Roaring64Bitmap。
// 外层:ArrayList,内层:Roaring64NavigableMap
private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:Roaring64NavigableMap
private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:Roaring64Bitmap
private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:Roaring64Bitmap
private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
Roaring64Bitmap ansB = new Roaring64Bitmap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}
测试结果
使用Mac Pro 2021 M1 + JDK8测试,100万级的数据太慢,实行中没有太大参考意义,有兴趣可以自行测试。
说明:由于受数据,以及电脑本身负载影响,测试结果可能不一致,仅做量级参考。
| 查找方法(外层-内层) |
1万(毫秒) |
10万(毫秒) |
20万(毫秒) |
50万(毫秒) |
|
| 『外层遍历+内层遍历』查找 |
ArrayList-ArrayList |
67 |
5502 |
26280 |
264133 |
| LinkedList-ArrayList |
63 |
5418 |
27961 |
272362 |
|
| LinkedList-ArrayList |
57 |
5436 |
21330 |
260976 |
|
| LinkedList-LinkedList |
59 |
5153 |
23251 |
252472 |
|
| 『外层遍历+内层Hash』查找 |
ArrayList-HashSet |
8 |
46 |
75 |
102 |
| LinkedList-HashSet |
8 |
17 |
49 |
82 |
|
| 『外层遍历+内层bitMap』查找 |
ArrayList-Roaring64NavigableMap |
91 |
265 |
719 |
876 |
| LinkedList-Roaring64NavigableMap |
26 |
125 |
562 |
876 |
|
| ArrayList-Roaring64Bitmap |
20 |
78 |
572 |
801 |
|
| LinkedList-Roaring64Bitmap |
119 |
171 |
221 |
384 |
附录
pom.xml
<dependency>
<groupId>org.roaringbitmap</groupId>
<artifactId>RoaringBitmap</artifactId>
<version>0.9.23</version>
</dependency>
完整测试代码
import org.apache.commons.lang.math.RandomUtils;
import org.junit.Test;
import org.roaringbitmap.longlong.Roaring64Bitmap;
import org.roaringbitmap.longlong.Roaring64NavigableMap;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
public class SetOperation {
/**
* 集合的运算方法用时测试
*/
@Test
public void setOperation() {
int size = 50_0000;
List<Long> listA = new ArrayList<>(size);
List<Long> listB = new ArrayList<>(size);
initData(size, listA, listB);
//『外层遍历+内层遍历』查找
System.out.println("1. 『外层遍历+内层遍历』查找");
outArrayListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
outLinkedListInnerArrayList(new ArrayList<>(listA), new ArrayList<>(listB));
outArrayListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));
outLinkedListInnerLinkedList(new ArrayList<>(listA), new ArrayList<>(listB));
//『外层遍历+内层Hash』查找:
System.out.println("2.『外层遍历+内层Hash』查找:");
outArrayListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));
outLinkedListInnerHashSet(new ArrayList<>(listA), new ArrayList<>(listB));
//『外层遍历+内层bitMap』查找
System.out.println("3.『外层遍历+内层bitMap』查找");
outArrayListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
outLinkedListInnerRoaring64NavigableMap(new ArrayList<>(listA), new ArrayList<>(listB));
outArrayListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
outLinkedListInnerRoaring64Bitmap(new ArrayList<>(listA), new ArrayList<>(listB));
}
// 外层:ArrayList,内层:ArrayList
private void outArrayListInnerArrayList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[ArrayList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:ArrayList
private void outLinkedListInnerArrayList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:LinkedList
private void outArrayListInnerLinkedList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-ArrayList]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:LinkedList
private void outLinkedListInnerLinkedList(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
ArrayList<Long> setB = new ArrayList<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-LinkedList]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:HashSet
private void outArrayListInnerHashSet(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
HashSet<Long> setB = new HashSet<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[ArrayList-HashSet]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:HashSet
private void outLinkedListInnerHashSet(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
HashSet<Long> setB = new HashSet<>(listB);
setA.retainAll(setB);
long end = System.currentTimeMillis();
System.out.println("[LinkedList-HashSet]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:Roaring64NavigableMap
private void outArrayListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[ArrayList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:Roaring64NavigableMap
private void outLinkedListInnerRoaring64NavigableMap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[LinkedList-Roaring64NavigableMap]RetainAll耗时:" + (end - begin));
}
// 外层:ArrayList,内层:Roaring64Bitmap
private void outArrayListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
ArrayList<Long> setA = new ArrayList<>(listA);
Roaring64NavigableMap ansB = new Roaring64NavigableMap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[ArrayList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}
// 外层:LinkedList,内层:Roaring64Bitmap
private void outLinkedListInnerRoaring64Bitmap(List<Long> listA, List<Long> listB) {
long begin = System.currentTimeMillis();
LinkedList<Long> setA = new LinkedList<>(listA);
Roaring64Bitmap ansB = new Roaring64Bitmap();
listB.forEach(ansB::addLong);
setA.removeIf(e -> !ansB.contains(e));
long end = System.currentTimeMillis();
System.out.println("[LinkedList-Roaring64Bitmap]RetainAll耗时:" + (end - begin));
}
private void initData(int size, List<Long> listA, List<Long> listB) {
Random random = new Random();
Random random2 = new Random();
random.longs(size).forEach(e -> {
listA.add(e);
if (random2.nextFloat() > 0.5) {
listB.add(e);
} else {
listB.add(RandomUtils.nextLong());
}
});
}
}