當前位置:編程學習大全網 - 源碼下載 - clickhouse數據壓縮對比

clickhouse數據壓縮對比

Clickhouse 數據壓縮主要使用兩個方案LZ4和ZSTD

LZ4解壓縮速度上會更快,但壓縮率較低,

ZSTD解壓縮較慢。但是壓縮比例較高。

clickhouse不同壓縮算法測試對比,LZ4最優。

/blog/2016/04/13/evaluating-database-compression-methods-update

以下測試主要驗證業內測試的結論,測試的zstd數據會多壹點,測試不是十分嚴謹,僅供參考。

開發(dev) 機器數量:3 cpu:40core 內存:256G disk:2.0T*10

kafka TOPIC: cdn-log-analysis-realtime。可消費數據總量363255827。數據消費4次到ck。

cdn_log_analysis_realtime lz4壓縮

cdn_log_realtime zstd壓縮

在/etc/metrika.xml

<compression incl="clickhouse_compression"> --指定incl

<case>

<min_part_size>10000000000</min_part_size> --數據部分的最小大小

<min_part_size_ratio>0.01</min_part_size_ratio> --數據部分大小與表大小的比率

<method>zstd</method> --壓縮算法,zstd和lz4

</case>

</compression>

執行sql :SELECT table AS 表名 , sum(rows) AS 總行數 , formatReadableSize(sum(data_uncompressed_bytes)) AS 原始大小 , formatReadableSize(sum(data_compressed_bytes)) AS 壓縮大小 , round((sum(data_compressed_bytes)/sum(data_uncompressed_bytes))*100, 0) AS 壓縮率 FROM system.parts WHERE (database IN ('default') AND (table = 'cdn_log_analysis_realtime') ) GROUP BY table

分別查看不同機器的壓縮比例

平均 4.85億 數據,原始數據 105G 壓縮後數據 27G ,平均壓縮率 27%

執行sql : select toDateTime(intDiv(toUInt32(its),60)*60) as t, count() as t_c, avg(speed) as t_v, quantile(0.99)(speed) as t_99, quantile(0.90)(speed) as t_90 , quantile(0.75)(speed) as t_75 , quantile(0.50)(speed) as t_50 , quantile(0.25)(speed) as t_25 from default.cdn_log_analysis_realtime_all where day='2020-12-17' group by t order by t_v desc

冷數據(第壹次查詢)

熱數據(第二次查詢)

執行sql :

SELECT table AS 表名 , sum(rows) AS 總行數 , formatReadableSize(sum(data_uncompressed_bytes)) AS 原始大小 , formatReadableSize(sum(data_compressed_bytes)) AS 壓縮大小 , round((sum(data_compressed_bytes)/sum(data_uncompressed_bytes))*100, 0) AS 壓縮率 FROM system.parts WHERE (database IN ('default') AND (table = 'cdn_log_realtime') ) GROUP BY table

分別查看不同機器的壓縮比例

執行sql :select toDateTime(intDiv(toUInt32(its),60)*60) as t, count() as t_c, avg(speed) as t_v, quantile(0.99)(speed) as t_99, quantile(0.90)(speed) as t_90 , quantile(0.75)(speed) as t_75 , quantile(0.50)(speed) as t_50 , quantile(0.25)(speed) as t_25 from default.cdn_log_realtime where day='2020-12-25' group by t order by t_v desc

冷數據(第壹次查詢)

熱數據(第二次查詢)

執行sql:SELECT 'ZSTD' as 壓縮方式 , table AS 表名 , sum(rows) AS 總行數 , formatReadableSize(sum(data_uncompressed_bytes)) AS 原始大小 , formatReadableSize(sum(data_compressed_bytes)) AS 壓縮大小 , round((sum(data_compressed_bytes)/sum(data_uncompressed_bytes)) 100, 0) AS 壓縮率 FROM cluster(ctyun31, system, parts) WHERE (database IN ('default') AND (table = 'cdn_log_realtime') ) GROUP BY table union all SELECT 'LZ4' as 壓縮方式 , table AS 表名 , sum(rows) AS 總行數 , formatReadableSize(sum(data_uncompressed_bytes)) AS 原始大小 , formatReadableSize(sum(data_compressed_bytes)) AS 壓縮大小 , round((sum(data_compressed_bytes)/sum(data_uncompressed_bytes)) 100, 0) AS 壓縮率 FROM cluster(ctyun31, system, parts) WHERE (database IN ('default') AND (table = 'cdn_log_analysis_realtime') ) GROUP BY table

測試不是十分嚴謹,ZSTD的ck表的數據多壹點,但是不影響測試結果,僅做參考。

壓縮能力上,ZSTD的壓縮比例為 22% ,LZ4的壓縮比例為 27% ,ZSTD的壓縮性能更好。但是效果不是很明顯。

查詢能力上,冷數據查詢,兩者相差不大。熱數據方面,ZSTD為 3.884s ,而LZ4為 1.150s 。ZSTD查詢時間在 3.37倍 以上,LZ4的查詢能力更強。

綜上所述,建議使用LZ4。

集群數據量後期預估,按當前使用lz4壓縮方案,3分片1副本,計算3 5.5 10*0.8(按磁盤最多使用80%算) 的硬盤能存儲大概多少數據。

壹天數據100億

壹天磁盤消耗 (10000000000/1453023308.0 84.98)/1024.0=0.57TB

能存儲天數 3 5.5 10 0.8/0.57=231.57 day。

壹天數據1000億

231.57/10=23.1day。

  • 上一篇:ps導入圖片如何降低清晰度方便臨摹
  • 下一篇:Linux操作系統有什麽特點?
  • copyright 2024編程學習大全網