VectorizedReader 和 ORC
spark SQL not only SQL 1.SparkSession/DataFrame/Datasets API 2.Catalyst Optimization & Tungsten Execution 3.DataSource Connectors/ Spark Core(RDD API) 优化尽可能的发生晚些,因为spark SQL,可以通过函数和库优化 整体的优化使用库和sql/dataframe RUN EXPLAIN plan Interpret plan tune plan https://dbricks.co/2rR8vAr optimizer: 使用启发式和代价重写查询计划 column pruning:列裁剪, outer join elimination:消除outer join Predicate push down:谓词下推, constraint propagation:约束传播(broadcast) constant floding:常量累加: join reordering: join重排序 ..... spark.sql.autoBroa...