生而为人

程序员的自我修养

0%

程序题

[toc]

大数据面试高频 SQL 必背真题(开窗 + 行列转换 + 倾斜优化 + 经典场景)

全是面试常考,直接背语法和套路,现场就能写。

一、开窗函数 三大必考

1. row_number /rank/dense_rank 区别

  • row_number():连续不重复 1,2,3,4
  • rank():跳跃排名 1,1,3,4
  • dense_rank():连续排名 1,1,2,3

2. 分组取 Top1(每组最新一条)

场景:每个用户取最近一条行为记录

1
2
3
4
5
6
7
8
9
SELECT * FROM (
SELECT
user_id,
event_time,
event_type,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY event_time DESC) AS rn
FROM user_behavior
) t
WHERE t.rn = 1;

3. 累计求和(逐月累计)

1
2
3
4
5
SELECT
month,
amount,
SUM(amount) OVER(ORDER BY month) AS total_acc
FROM sales;

4. 移动平均(近 3 日均值)

1
AVG(price) OVER(ORDER BY dt ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

二、行列转换 面试必写

1. 行转列(多行变一行,逗号拼接)

Hive/Spark SQL

1
2
3
4
5
SELECT
user_id,
CONCAT_WS(',', COLLECT_LIST(course)) AS course_list
FROM user_course
GROUP BY user_id;

2. 列转行(一行拆多行)

1
2
3
4
5
SELECT
user_id,
course
FROM user_course
LATERAL VIEW EXPLODE(SPLIT(course_list, ',')) tmp AS course;

三、数据倾斜 SQL 写法(面试高频)

1. 空值 / 大量 NULL 倾斜 优化

1
2
3
4
5
6
7
8
-- 原写法倾斜
WHERE city IS NULL

-- 优化:加盐打散
SELECT *
FROM table
WHERE IF(city IS NULL, CONCAT('rand_', RAND()), city)
= IF(city IS NULL, CONCAT('rand_', RAND()), city);

2. Key 热点倾斜 加盐两阶段聚合

1
2
3
4
5
6
7
8
9
-- 第一层:局部聚合加盐
SELECT key, CONCAT(key,'_',CAST(RAND()*10 AS INT)) AS salt_key, COUNT(1) AS cnt
FROM log
GROUP BY key, CONCAT(key,'_',CAST(RAND()*10 AS INT));

-- 第二层:去掉盐全局聚合
SELECT REPLACE(salt_key,'_',SUBSTRING_INDEX(salt_key,'_',-1)) AS key, SUM(cnt)
FROM tmp
GROUP BY REPLACE(salt_key,'_',SUBSTRING_INDEX(salt_key,'_',-1));

3. Join 倾斜优化

  • 小表广播 Join/*+ BROADCAST(small_table) */
  • 大表倾斜 Key 单独处理,其余正常 Join,再 Union
1
2
3
4
SELECT /*+ BROADCAST(b) */ a.* 
FROM big_table a
JOIN small_table b
ON a.key = b.key;

四、经典业务 SQL 真题

1. 连续登录天数(超高频)

思路:日期减去行号,相同即为连续

1
2
3
4
5
6
7
8
9
10
SELECT user_id, MIN(dt), MAX(dt), COUNT(1) AS continue_days
FROM (
SELECT
user_id,
dt,
DATE_SUB(dt, ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY dt)) AS flag
FROM login_log
) t
GROUP BY user_id, flag
HAVING COUNT(1) >= 3;

2. 留存率计算(次日留存)

1
2
3
4
5
6
7
8
9
SELECT
t1.dt,
COUNT(DISTINCT t1.user_id) AS new_user,
COUNT(DISTINCT t2.user_id) AS retain_user
FROM new_user t1
LEFT JOIN login_log t2
ON t1.user_id = t2.user_id
AND t2.dt = DATE_ADD(t1.dt,1)
GROUP BY t1.dt;

五、大数据 SQL 通用优化口诀(面试背)

  1. 尽早过滤:先 where 后 join,减少 shuffle
  2. 分组前过滤,聚合少数据
  3. 小表广播 Join,避免 shuffle 倾斜
  4. 避免 select *,只查需要字段
  5. 分区过滤必加,禁止全表扫描
  6. 倾斜 Key:加盐打散、局部 + 全局聚合
  7. 用好开窗代替子查询,简洁高效