Spark SQL:Join详解
(未完)
有无连接键 | 连接算法 | 说明 | 配置 |
---|---|---|---|
有连接键的等值连接 | Broadcast Hash Join (BHJ) | 通常,BHJ比其它连接算法要快,然而广播表又是一个网络密集型的操作。如果广播端的数据比较大可能会造成OOM或者执行的效率不如其它的连接算法。 | spark.sql.autoBroadcastJoinThreshold |
Shuffled Hash Join (SHJ) | 如果单个分区的平均大小足够小可以用来构建哈希表 | ||
Sort Merge Join (SMJ) | 如果连接键是可排序的 | spark.sql.join.preferSortMergeJoin 为true时,使用SortMergeJoin而不使用ShuffledHashJoin |
|
无连接键的非等值连接 | Broadcast Nested Loop Join (BNLJ) | ||
Cartesian Product (笛卡尔积) |
HashJoin
BroadcastHash Join
JoinSelection
参考
- http://hbasefly.com/2017/03/19/sparksql-basic-join/
- https://segmentfault.com/a/1190000039135435
- https://cloud.tencent.com/developer/article/1005502
- https://www.linkedin.com/pulse/spark-sql-3-common-joins-explained-ram-ghadiyaram/
- https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c
- https://databricks.com/session_na20/on-improving-broadcast-joins-in-apache-spark-sql
- https://www.youtube.com/watch?v=isOuTH_49pY