Spark SQL:GROUP语句

聚合

TODO

高级聚合

GROUPING SETS

1
GROUP BY GROUPING SETS(分组集1[, 分组集2, ...])

对关键字 GROUPING SETS后面指定的多个分组集合中每个集合都进行一次分组操作,等价于按照每个分组集合单独进行聚合,然后再将这些聚合结果进行 UNION ALL

注意:为了兼容Hive,Spark也支持GROUP BY ... GROUPING SETS (...)这样的写法。GROUP BY 表达式通常被忽略,但是如果GROUP BY 表达式中包含有GROUPING SETS表达式以外的表达式,那么这些额外的表达式将被包含进分组表达式中且其值为NULL,例如SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)

GROUPING SETS 等价表示
GROUP BY GROUPING SETS((a), (b))

GROUP BY GROUPING SETS(a, b)
GROUP BY a
UNION ALL
GROUP BY b
GROUP BY a,b GROUPING SETS(a, b) SELECT a, NULL ... GROUP BY a
UNION ALL
SELECT NULL, b ... GROUP BY b

ROLLUP

1
2
3
GROUP BY 分组表达式1[, 分组表达式2, ...] WITH ROLLUP

GROUP BY ROLLUP(分组集1[, 分组集2, ...])
RULLUP 等价表示
GROUP BY a, b WITH ROLLUP

GROUP BY ROLLUP(a, b)
GROUP BY GROUPING SETS((a, b), (a), ())

CUBE

1
2
3
GROUP BY 分组表达式1[, 分组表达式2, ...] WITH CUBE

GROUP BY CUBE(分组集1[, 分组集2, ...])
CUBE 等价表示
GROUP BY a, b WITH CUBE

GROUP BY CUBE(a,b)
GROUP BY GROUPING SETS((a,b), (a), (b), ())
GROUP BY CUBE(a, b, (a,c)) GROUP BY GROUPING SETS((a,b,c), (a,b), (a,c), (b,a,c), (a), (b), (a,c), ())

参考:

  1. https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-groupby.html
  2. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy#LanguageManualGroupBy-Multi-Group-ByInserts
  3. https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C+Grouping+and+Rollup