Spark SQL:Join详解

(未完)

有无连接键 连接算法 说明 配置
有连接键的等值连接 Broadcast Hash Join (BHJ) 通常,BHJ比其它连接算法要快,然而广播表又是一个网络密集型的操作。如果广播端的数据比较大可能会造成OOM或者执行的效率不如其它的连接算法。 spark.sql.autoBroadcastJoinThreshold
Shuffled Hash Join (SHJ) 如果单个分区的平均大小足够小可以用来构建哈希表
Sort Merge Join (SMJ) 如果连接键是可排序的 spark.sql.join.preferSortMergeJoin
为true时,使用SortMergeJoin而不使用ShuffledHashJoin
无连接键的非等值连接 Broadcast Nested Loop Join (BNLJ)
Cartesian Product (笛卡尔积)

HashJoin

BroadcastHash Join

JoinSelection

参考

  1. http://hbasefly.com/2017/03/19/sparksql-basic-join/
  2. https://segmentfault.com/a/1190000039135435
  3. https://cloud.tencent.com/developer/article/1005502
  4. https://www.linkedin.com/pulse/spark-sql-3-common-joins-explained-ram-ghadiyaram/
  5. https://medium.com/@achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c
  6. https://databricks.com/session_na20/on-improving-broadcast-joins-in-apache-spark-sql
  7. https://www.youtube.com/watch?v=isOuTH_49pY