假設有壹個登陸表login(當天登陸記錄,只有壹個uid),和壹個用戶註冊表regusers(當天註冊用戶,字段只有壹個uid),這兩個表都包含壹個字段,uid。
in查詢
如果要查詢當天登陸的註冊用戶,需要用in查詢,hive sql如下:
select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is not null
如果login表和regusers表按天分區,字段是dt,那麽查詢2013年1月1號當天登陸的註冊用戶,hive sql如下:
select login.uid from login day_login left outer join
(select uid from regusers where dt='20130101') day_regusers
on day_login.uid=day_regusers.uid where day_login.dt='20130101' and day_regusers.uid is not null
not in查詢
如果要查詢當天登陸的老用戶(這裏假設非當天註冊用戶就是老用戶),需要用not in查詢,hive sql如下:
select login.uid from login left outer join regusers on login.uid=regusers.uid where regusers.uid is null;
如果login表和regusers表按天分區,字段是dt,那麽查詢2013年1月1號當天登陸的老用戶,hive sql如下:
select login.uid from login day_login left outer join
(select uid from regusers where dt='20130101') day_regusers
on day_login.uid=day_regusers.uid where day_login.dt='20130101' and day_regusers.uid is null;
Hive join優化
========================================================
由
於 hive 與傳統關系型數據庫面對的業務場景及底層技術架構都有著很大差異,因此,傳統數據庫領域的壹些技能放到 Hive 中可能已不再適用。關於
hive 的優化與原理、應用的文章,前面也陸陸續續的介紹了壹些,但大多都偏向理論層面,本文就介紹壹個實例,從實例中壹步步加深對 hive
調優的認識與意識。
1、需求
需求我做了簡化,很簡單,兩張表做個 join,求指定城市,每天的 pv,用傳統的 RDBMS SQL 寫出來就這樣的:
1
2
3
4
5
6
7
8
9
10
11
SELECT t.statdate,
c.cname,
count(t.cookieid)
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON (t.area1= c.cname
OR t.area2 =c.cname
OR t.area3 = c.cname)
WHERE t.statdate>='20140818' and t.statdate<='20140824'
AND platform='pc'
GROUP BY t.statdate,
c.cname;
怎麽樣?根據 SQL 看懂需求沒問題吧?
2、非等值 join 問題
然後把這條 SQL 貼到 hive 中去執行,然後妳會發現報錯了:
1
FAILED: SemanticException [Error 10019]: Line 5:32 OR not supported in JOIN currently 'cname'
這是因為 hive 受限於 MapReduce 算法模型,只支持 equi-joins(等值 join),要實現上述的非等值 join,妳可以采用笛卡兒積( full Cartesian product )來實現:
1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT t.statdate,
c.cname,
count(t.cookieid)
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
AND (t.area1= c.cname
OR t.area2 =c.cname
OR t.area3 = c.cname)
GROUP BY t.statdate,
c.cname;
然後再拿著這條語句執行下。
3、優化:reduce side join VS Cartesian product
如果妳真的把這條語句放到 Hive 上執行,然後恰好妳有張表還非常大,那麽恭喜妳。。。集群管理員估計會找妳的麻煩了。。。
友情提示:笛卡兒積這種語句在 Hive 下慎用,大數據場景下的 m * n 映射結果妳懂的。。。對此,Hive 特意提供了壹個環境變量:hive.mapred.mode=strict; 防止笛卡兒積的執行:
1
FAILED: SemanticException [Error 10052]: In strict mode, cartesian product is not allowed. If you really want to perform the operation, set hive.mapred.mode=nonstrict
從 2 中的觀察得知我們在 on 後面跟 join
條件,走的是 reduce side join,如果妳在 where 後跟則是走 Cartesian product,但是這裏單條 sql
又沒法實現 reduce side join,還有沒有其它辦法呢?
4、改寫非等值 join:union all
既然不允許非等值 join,那我們換壹下思路,多個子查詢 union all,然後匯總:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SELECT dt,
name,
count(cid)
FROM
(SELECT t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area1 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt,
name;
5、優化:map side join
上述語句走的是 reduce side join,從我們的需求及業務得知,tmpdb.city 是壹張字典表,數據量很小,因此我們可以試試把上述的語句改寫成 mapjoin:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SELECT dt,
name,
count(cid)
FROM
(SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area1 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt,
name;
6、優化無極限:開啟 parallel 和 控制 reduce 個數
上述語句執行時,妳可以看到執行計劃和狀態信息,以及結合妳的 union all 語句可知,三個 union 語句之間沒有依賴關系,其實是可以並行執行的:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
explain SQL...
...
STAGE DEPENDENCIES:
Stage-11 is a root stage
Stage-1 depends on stages: Stage-11
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2, Stage-6, Stage-9
Stage-12 is a root stage
Stage-5 depends on stages: Stage-12
Stage-6 depends on stages: Stage-5
Stage-13 is a root stage
Stage-8 depends on stages: Stage-13
Stage-9 depends on stages: Stage-8
Stage-0 is a root stage
...
我們在 SQL 前加上如下環境變量選項:
1
2
set mapred.reduce.tasks=60;
set hive.exec.parallel=true;
讓執行計劃中的 Stage-11、Stage-12、Stage-13 並行執行,並控制好 reduce task 個數。
完整的語句如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
hive -e "
SET mapred.reduce.tasks=60;
SET hive.exec.parallel=TRUE;
SELECT dt,
name,
count(cid)
FROM
(SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area1 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area2 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc'
UNION ALL SELECT /*+ MAPJOIN(c) */ t.statdate dt,
c.cname name,
t.cookieid cid
FROM tmpdb.city c
JOIN ecdata.ext_trackflow t ON t.area3 =c.cname
WHERE t.statdate>='20140818'
AND t.statdate<='20140824'
AND platform='pc') tmp_trackflow
GROUP BY dt,
name;
" > a1.txt
最後的優化效果是:2 中的語句三個小時沒出結果。。。5 比 4 快 8 倍左右,6 比 5 快 2 倍左右,最終 10min 出結果。