原文:http://www.cnblogs.com/yumiko/p/6060485.html
对于ORACLE而言,CBO优化器可以根据直方图收集的列值分布信息,让选择性高(返回数据行比例少)的列值使用索引,而选择性低(返回数据行比例多)的列值不使用索引。尤其对于存在数据倾斜严重的列而言,直方图很重要。
注:数据倾斜,主要指某列上的一个数值,相较于该列其他数值,出现比例高,如:“性别”列,“男性”占到该列整体数值(男性、女性)的80%,存在明显的数据倾斜现象。
一般而言,直方图不受是否使用索引的限制,即可以用来统计索引列,也可以统计非索引列。但对于非索引列的统计,意义不大。
直方图类型的视图:DBA_TAB_COL_STATISTICS,USER_TAB_COL_STATISTICS,ALL_TAB_COL_STATISTICS
直方图具体信息的视图:DBA_TAB_HISTOGRAMS,USER_TAB_HISTOGRAMS,ALL_TAB_HISTOGRAMS
本示例中使用的数据库版本为ORACLE 11.2.0.4。
Yumiko_sunny@OA01> select distinct owner ,count(*) as col_rows, (select count(*) from test) as tab_rows, to_char(round(count(*)/(select count(*) from test)*100,2),'90.99')||'%' as data_ratio from test group by owner; OWNER COL_ROWS TAB_ROWS DATA_RATIO -------------------- --------------- ---------------- ---------- HR 476 535164 0.09% OE 1988 535164 0.37% ORDDATA 3598 535164 0.67% SCOTT 98 535164 0.02% SYS 529004 535164 98.85% |
从上图中可以看到,该列的SYS值分布占到了整体的98%,表明存在严重的倾斜。
为OWNER列创建索引,并使用ANALYZE TABLE的方法收集统计信息。
--收集统计信息 Yumiko_sunny@OA01> analyze table test compute statistics; Table analyzed. --验证最后的统计收集的时间 Yumiko_sunny@OA01> select table_name, to_char(LAST_ANALYZED,'YYYY-MM-DD HH24:MI:SS') LAST_ANALYZED from dba_tables where TABLE_NAME='TEST'; TABLE_NAME LAST_ANALYZED ------------------------------ ------------------- TEST 2016-11-13 21:23:19 --查看直方图的统计情况 Yumiko_sunny@OA01> select column_name,histogram from dba_tab_col_statistics where table_name='TEST'; COLUMN_NAME HISTOGRAM -------------------- --------------- OWNER NONE OBJECT_NAME NONE SUBOBJECT_NAME NONE OBJECT_ID NONE DATA_OBJECT_ID NONE OBJECT_TYPE NONE CREATED NONE LAST_DDL_TIME NONE TIMESTAMP NONE STATUS NONE TEMPORARY NONE COLUMN_NAME HISTOGRAM -------------------- --------------- GENERATED NONE SECONDARY NONE NAMESPACE NONE EDITION_NAME NONE |
这里先忽略,后面可以对比for all columns子句的情况再看下。
查看此时索引列执行计划的选择情况,这里以倾斜数据SYS为条件进行检索。
Yumiko_sunny@OA01> select * from test where owner='SYS'; rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 3856466897 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 107K| 10M| 1799 (1)| 00:00:22 | | 1 | TABLE ACCESS BY INDEX ROWID| TEST | 107K| 10M| 1799 (1)| 00:00:22 | |* 2 | INDEX RANGE SCAN | IND_TEST | 107K| | 228 (1)| 00:00:03 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- - access("OWNER"='SYS') |
从上面返回的结果看,529K行的SYS数据,仅仅返回107K行,显然存在很大的误差。
此外,对于数据倾斜达到98%的SYS而言,显然全表扫描的效率应该更高,这里应该与错误的统计信息有关。
使用dbms_stats.gather_table_stats的方式再次收集表的统计信息。
Yumiko_sunny@OA01> exec dbms_stats.gather_table_stats('SCOTT','TEST',cascade=>true); PL/SQL procedure successfully completed. Yumiko_sunny@OA01> select table_name, to_char(LAST_ANALYZED,'YYYY-MM-DD HH24:MI:SS') LAST_ANALYZED from dba_tables where TABLE_NAME='TEST'; TABLE_NAME LAST_ANALYZED ------------------------------ ------------------- TEST 2016-11-13 21:50:01 Yumiko_sunny@OA01> select column_name,histogram from dba_tab_col_statistics where table_name='TEST'; COLUMN_NAME HISTOGRAM -------------------- --------------- OWNER FREQUENCY OBJECT_NAME NONE SUBOBJECT_NAME NONE OBJECT_ID NONE DATA_OBJECT_ID NONE OBJECT_TYPE NONE CREATED NONE LAST_DDL_TIME NONE TIMESTAMP NONE STATUS NONE TEMPORARY NONE COLUMN_NAME HISTOGRAM -------------------- --------------- GENERATED NONE SECONDARY NONE NAMESPACE NONE EDITION_NAME NONE |
从上图可以看到,此时完成了对表的最新统计,同时收集了索引列的直方图信息,且该直方图为”等频直方图“。
再次查看此时索引列的执行计划选择情况,这里分别以选择性差的倾斜数据SYS为条件,以及以选择性好的SCOTT为条件分别进行检索。
--以SYS为条件进行查询 Yumiko_sunny@OA01> select * from test where owner='SYS'; rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1357081020 -------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 530K| 49M| 2098 (1)| 00:00:26 | |* 1 | TABLE ACCESS FULL| TEST | 530K| 49M| 2098 (1)| 00:00:26 | -------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- - filter("OWNER"='SYS') --以SCOTT为条件进行查询 Yumiko_sunny@OA01> select * from test where owner='SCOTT'; rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 3856466897 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 98 | 9506 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| TEST | 98 | 9506 | 5 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | IND_TEST | 98 | | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- - access("OWNER"='SCOTT') |
对于选择性差的SYS值,CBO优化器采用了全表扫描的方式进行数据的访问
对于选择性好的SCOTT值,CBO优化器则采用了索引扫描的方式进行数据的访问
如果采用索引的方式访问SYS相关的数据行,真实的代价会是怎样呢,这里,通过hint的方式进行一次索引扫描的访问
Yumiko_sunny@OA01> select /*+index(test,ind_test) */* from test where owner='SYS'; rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 3856466897 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 529K| 49M| 8885 (1)| 00:01:47 | | 1 | TABLE ACCESS BY INDEX ROWID| TEST | 529K| 49M| 8885 (1)| 00:01:47 | |* 2 | INDEX RANGE SCAN | IND_TEST | 529K| | 1115 (1)| 00:00:14 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- - access("OWNER"='SYS') |
从上图中可以看到,在信息收集无误的情况下,若采用索引扫描,其真实开销是全表扫描的4倍。
通过上面这个执行计划,也说明了,对于CBO优化器,准确无误的统计信息对于执行计划选择的重要性。
通过DBA_TAB_HISTOGRAMS视图,查看此时直方图的详细信息
Yumiko_sunny@OA01> select TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER, to_char(ENDPOINT_VALUE,'999999999999999999999999999999999999') as ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE from DBA_TAB_HISTOGRAMS where TABLE_NAME='TEST' and COLUMN_NAME='OWNER'; TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ---------------------------------------------------------------------------------------- TEST OWNER 19 41159093808690300000000000 TEST OWNER 61 41186001805076000000000000 TEST OWNER 62 43232584582496500000000000 TEST OWNER 5 37550853140200700000000000 TEST OWNER 5518 43277234965060400000000000 |
可以看到,虽然ENDPOINT_VALUE收集到了唯OWNER列唯一值的hash值,但真实列ENDPOINT_ACTUAL_VALUE显示为空。
后面对比analyze table for all columns操作后再看。
删除SYS值相关的数据行,观察直方图统计的变化
Yumiko_sunny@OA01> delete from test where owner='SYS'; rows deleted. Yumiko_sunny@OA01> commit; Commit complete. Yumiko_sunny@OA01> select TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER, to_char(ENDPOINT_VALUE,'999999999999999999999999999999999999') as ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE from DBA_TAB_HISTOGRAMS where TABLE_NAME='TEST' and COLUMN_NAME='OWNER'; TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU ---------------------------------------------------------------------------------------- TEST OWNER 19 41159093808690300000000000 TEST OWNER 61 41186001805076000000000000 TEST OWNER 62 43232584582496500000000000 TEST OWNER 5 37550853140200700000000000 TEST OWNER 5518 43277234965060400000000000 |
可以看到,对表数据的DML操作,直方图信息并未自动更改。
再次使用dbms_stats.gather_table_stats收集统计信息,此时直方图得到了更新,如下图:
Yumiko_sunny@OA01> exec dbms_stats.gather_table_stats('SCOTT','TEST',cascade=>true); PL/SQL procedure successfully completed. Yumiko_sunny@OA01> select TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER, to_char(ENDPOINT_VALUE,'999999999999999999999999999999999999') as ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE from DBA_TAB_HISTOGRAMS where TABLE_NAME='TEST' and COLUMN_NAME='OWNER'; TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU -------------------- ---------------------------- --------------------------------------- TEST OWNER 476 37550853140200700000000000 TEST OWNER 2464 41159093808690300000000000 TEST OWNER 6062 41186001805076000000000000 TEST OWNER 6160 43232584582496500000000000 |
上述说明了,对于直方图的信息,需要定期进行收集工作。
本示例承接上面示例内容,数据库版本一致。
在上面内容中,已经演示了analyze table table_name compute statistics无法针对表进行直方图的信息收集。
那么,对于已存在直方图的表,该操作又会有何影响呢。
首先,承接上面内容,再次执行analyze table table_name compute statistics的操作,观察直方图信息的变化。
Yumiko_sunny@OA01> analyze table test compute statistics; Table analyzed. Yumiko_sunny@OA01> select column_name,histogram from dba_tab_col_statistics where table_name='TEST'; COLUMN_NAME HISTOGRAM -------------------- --------------- OWNER NONE OBJECT_NAME NONE SUBOBJECT_NAME NONE OBJECT_ID NONE DATA_OBJECT_ID NONE OBJECT_TYPE NONE CREATED NONE LAST_DDL_TIME NONE TIMESTAMP NONE STATUS NONE TEMPORARY NONE COLUMN_NAME HISTOGRAM -------------------- --------------- GENERATED NONE SECONDARY NONE NAMESPACE NONE EDITION_NAME NONE |
可以明显的发现,OWNER列的直方图信息消失了,说明该操作会删除已存在的直方图信息。
如果这是一个生产环境,对于这样一张存在数据倾斜列的表,可能会带来不可预估的影响。
对于analyze table table_name compute statistics for all indexes的操作,这里不再演示,会在下面的总结中,直接给出对直方图影响的结论。
下面看一下analyze table table_name compute statistics for all indexes for all columns操作的影响。
Yumiko_sunny@OA01> analyze table test compute statistics for all indexes for all columns; Table analyzed. Yumiko_sunny@OA01> select column_name,histogram from dba_tab_col_statistics where table_name='TEST'; COLUMN_NAME HISTOGRAM -------------------- --------------- OWNER FREQUENCY OBJECT_NAME HEIGHT BALANCED SUBOBJECT_NAME NONE OBJECT_ID HEIGHT BALANCED DATA_OBJECT_ID HEIGHT BALANCED OBJECT_TYPE FREQUENCY CREATED FREQUENCY LAST_DDL_TIME FREQUENCY TIMESTAMP FREQUENCY STATUS FREQUENCY TEMPORARY FREQUENCY COLUMN_NAME HISTOGRAM -------------------- --------------- GENERATED FREQUENCY SECONDARY FREQUENCY NAMESPACE FREQUENCY EDITION_NAME NONE |
从上面可以看到,当执行for all columns子句的时候,不但收集了索引列的直方图信息,还收集了非索引列的直方图信息。
再看下此时DBA_TAB_HISTOGRAMS视图的详细信息
Yumiko_sunny@OA01> select TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER, to_char(ENDPOINT_VALUE,'999999999999999999999999999999999999') as ENDPOINT_VALUE,ENDPOINT_ACTUAL_VALUE from DBA_TAB_HISTOGRAMS where TABLE_NAME='TEST' and COLUMN_NAME='OWNER'; TABLE_NAME COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_ACTUAL_VALU -------------------- ---------------------------- --------------------------------------- TEST OWNER 476 37550853140200700000000000 HR TEST OWNER 2464 41159093808690300000000000 OE TEST OWNER 6062 41186001805076000000000000 ORDDATA TEST OWNER 6160 43232584582496500000000000 SCOTT |
可以看到,此时可以看见DBA_TAB_HISTOGRAMS视图上,ENDPOINT_ACTUAL_VALUE列真实值的信息。
最后一点,起码从直方图的收集情况看:analyze table table_name compute statistics并不等价于analyze table table_name compute statistics for all indexes for all columns
谨慎使用analyze table table_name compute statistics这个操作。
重要的事情说三遍!!!
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。