生而为人

hive常用函数

Posted on 2026-06-18 In hive

常用

8位dt转10位

from_unixtime(unix_timestamp(t.dt,'yyyymmdd'),'yyyy-mm-dd')

8位减1小时

dt格式2019120100
from_unixtime(unix_timestamp(dt,'yyyyMMddHH') - 60*60, 'yyyy-MM-dd')

8位日期计算

regexp_replace(date_sub(from_unixtime(unix_timestamp(t.dt,'yyyymmdd'),'yyyy-mm-dd'), 10), '-', '')

当前日期前一天

保留2位小数

round(123.4567, 2)

时间戳转换

1
2

select hour(from_unixtime(event_timestamp/1000,'yyyy-MM-dd HH:mm:ss')) hour_time, count(distinct session_id)
from mart_eif_flow.bas_log_sdk_waimai_c where dt='20201217' group by hour(from_unixtime(event_timestamp/1000,'yyyy-MM-dd HH:mm:ss')) order by hour_time

计算

保留小数

字符串

替换

1	regexp_replace('abc/d', '/', '')

json

1 2	get_json_object(json_str, '$.csu_id') // json_str可以标准的json字符串，也可以是包含转译符的字符串

hive常用函数

Posted on 2026-06-18 In hive

with temp_table as (
  select '1' as f1, null as f2
  union all
  select '1', '3'
)

count distinct 结果是字段均不为null的去重值

所以在通过count distinct判断字段的重复情况时，注意把相关字段的null附上值

select count(*)
from temp_table
--
`count`(*)
2
    
select count(distinct f1, f2)
from temp_table
--
`count`(DISTINCT `f1`, `f2`)
1

select count(f1, f2)
from temp_table
--
`count`(`f1`, `f2`)
1

关联字段中有null时是关联不上的，即使左右都是null

select a.f1, a.f2, b.f1, b.f2
from temp_table a
left join temp_table b
on a.f1 = b.f1
and a.f2 = b.f2
--
f1	f2	f1	f2
1	NULL	NULL	NULL
1	3	1	3

sum中包含null，不会影响sum的值

select sum(ct)
from(select 1 as ct
union all
select null)a

--
`sum`(`ct`)
1

Untitled

Posted on 2026-06-18

Hive下查看表占用空间大小的方法

hive-官网索引

Posted on 2026-06-18 In hive

https://cwiki.apache.org/confluence/display/Hive/Home#Home-GeneralInformationaboutHive

https://cwiki.apache.org/confluence/display/Hive/LanguageManual

首页

语法手册

1. 命令行与客户端

1.1 数据类型

数据类型转换(Change Types)
数据类型隐式转换矩阵(Allowed Implicit Conversions)

1.2 参数配置

常用参数配置

1.3 select语法

All and DISTINCT 语句，UNION语句可以有相同效果
Partition Based Queries
HAVING Clause
REGEX Column Specification
GROUP BY
SORT/ORDER/CLUSTER/DISTRIBUTE BY
JOIN
UNION
TABLESAMPLE
Subqueries
Virtual Columns
Operators and UDFs hive函数
LATERAL VIEW
Windowing, OVER, and Analytics
Common Table Expressions 临时表语法

数据探查逻辑

Posted on 2026-06-18 In hive

1
2
3

select count(distinct seller_bu_id) as ct1, 
			 count(distinct seller_bu_id, management_city_id) as ct2
from mart_caterb2b.dim_seller_management_city_info

如果 ct1 = ct2，说明seller_bu_id:management_city_id是n:1的关系，n是1还是多，需要继续探查
如果 ct1 < ct2, 说明seller_bu_id:management_city_id是1:n的关系，n是多
如果 ct1 > ct2, 说明management_city_id中有null值，比例关系需要去掉这些null值重新计算。

Untitled

Posted on 2026-06-18

基础语法

Posted on 2026-06-18 In sql-mysql

Update

1. 单条更新

2. 批量更新

UPDATE mytable 
SET myfield = CASE other_field 
WHEN 1 THEN 'value' 
WHEN 2 THEN 'value' 
WHEN 3 THEN 'value' 
END 
WHERE id IN (1,2,3)

Select

1. limit

1
2

语句1：select * from student limit 9,4 //第一个参数表示从该参数的下一条数据开始，第二个参数表示每次返回的数据条数。
语句2：slect * from student limit 4 offset 9

mysql limit分页

mysql查询原理

Posted on 2026-06-18 In sql-mysql

子查询与关联查询区别

第20章流处理基础

Posted on 2026-06-18 In spark-读书笔记

连续处理与微批处理

连续处理：

优势：

1. 时延低

不足：

1. 吞吐量小
2. 连续处理系统通常有固定的计算拓扑，如果不停止整个系统，在运行状态下是无法改变的，这也可能会导致负载均衡的问题。计算拓扑是什么，为什么会有负载均衡的问题？

微批处理：

优势：

可以在每个节点上实现高吞吐量，因为他们可以利用与批处理系统相同的优化操作（例如，向量化处理，Vectorized Processing）。什么是向量化处理？

Untitled

Posted on 2026-06-18