人与世界万物的互动会产生大量的时空数据。那么,当我们需要随时调用过去的数据时,改怎么办?尤其是面对各种海量、多维度的数据库,如果没有高效的搜索方法,我们只能望洋兴叹、束手无策。
别担心,本文将用详细的代码,手把手来传授高效搜索法的绝技!
对象数据分类
对象数据可分为两种类型:静态数据(相对静态,例如建筑)和动态数据(例如人的活动和物联网传感器的活动)。
按研究需求分类的索引
时空快照搜索
有些对象以相对较低的频率生成数据。例如,建筑物和道路等惰性物体可能在数年内不会发生任何变化。如果将为这些对象生成的数据写入数据库,并按时间范围查询数据(例如,查询日期为2017-07-01至2017-07-02),则可能找不到与这些对象相关的任何数据。原因很简单,在这段时间内数据库根本没有相关数据输入。
时空行为数据搜索
时空行为数据是指从人的活动等动态对象中获取数据。
例如,分析特定地区特定时间段内某一人群的特征,或者分析大学周边人群在工作日和周末构成的差异。
时空快照不属于本文的讨论范围。现在,我们看看如何搜索时空行为数据。
数据结构
时空行为数据包含三个属性:时间、空间和对象。
非结构化索引:
- create table test(
- id int8,
- crt_time timestamp, -- Time
- pos geometry, -- Location
- obj jsonb -- Object description
- );
除了应用于JSON,结构化数据还可以用于对象描述。例如:
- create table test(
- id int8,
- crt_time timestamp, -- Time
- pos geometry, -- Location
- c1 int, -- Some property examples
- c2 int,
- c3 text,
- c4 float8,
- c5 int,
- c6 date,
- c7 text,
- c8 int,
- c9 int,
- c10 int
- );
时空行为数据的SQL查询实例
- select * from test
- where
- pos <-> ? < ?
- and crt_time between ? and ?
- and ( (c1 = ? and c2 between ? and ?) or c10=?)
- ...
- ;
优化方法
考虑运用以下知识:
时间序列BRIN索引
crt_time字段是一个时间序列字段,表示生成数据的时间。在PostgreSQL堆存储中,存储和该字段的值具有很强的线性相关性。
因此,BRIN索引很合适。
使用BRIN索引来代替分区表进行TPC-H测试。大范围搜索的性能甚至优于使用分区表时的功能。
- create index idx_test_1 on test using brin(crt_time);
空间索引
显然,空间检索需要空间索引。PostgreSQL中可以使用三种方法实现空间检索。
1. 几何类型的GIST索引
- create index idx_test_2 on test using gist(pos);
该索引支持空间KNN搜索和空间位置确定等功能。
2. 几何类型的主索引
- create index idx_test_2 on test using spgist(pos);
该索引支持空间KNN搜索和空间位置确定等功能。
3. Geohash和B-tree索引(将经度和纬度转换为Geohash并为hash值创建B-tree索引)。只需使用表达式索引。
- create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );
此索引支持前缀搜索(其能落实编码地理信息网格中包含的关系)。它属于有损索引,需要二次过滤。
GiST和SPGiST空间索引能够找到准确的地理位置信息,优于GEOHASH索引。但是,查询信息时需要特别注意。
GIN 索引
此索引类型的目标是对象属性字段JSONB或多个结构化对象属性字段。只需使用GIN索引。
例如:
- create extension btree_gin;
非结构化索引:
- create index idx_test_4 on test using gin( obj );
结构化索引:
- create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );
BitmapAnd和BitmapOr
在上一节中,根据数据类型和查询需求可以为不同的查询维度选择相应的索引。
但是,可以同时使用这些索引吗? PostgreSQL为多个索引提供bitmapAnd及bitmapOr接口。它们可以组合多个索引,减少需要扫描的数据库数量。
- Heap, one square = one page:
- +---------------------------------------------+
- |c____u_____X___u___X_________u___cXcc______u_|
- +---------------------------------------------+
- Rows marked c match customers pkey condition.
- Rows marked u match username condition.
- Rows marked X match both conditions.
- Bitmap scan from customers_pkey:
- +---------------------------------------------+
- |100000000001000000010000000000000111100000000| bitmap 1
- +---------------------------------------------+
- One bit per heap page, in the same order as the heap
- Bits 1 when condition matches, 0 if not
- Bitmap scan from ix_cust_username:
- +---------------------------------------------+
- |000001000001000100010000000001000010000000010| bitmap 2
- +---------------------------------------------+
- Once the bitmaps are created a bitwise AND is performed on them:
- +---------------------------------------------+
- |100000000001000000010000000000000111100000000| bitmap 1
- |000001000001000100010000000001000010000000010| bitmap 2
- &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
- |000000000001000000010000000000000010000000000| Combined bitmap
- +-----------+-------+--------------+----------+
- | | |
- v v v
- Used to scan the heap only for matching pages:
- +---------------------------------------------+
- |___________X_______X______________X__________|
- +---------------------------------------------+
- The bitmap heap scan then seeks to the start of each page and reads the page:
- +---------------------------------------------+
- |___________X_______X______________X__________|
- +---------------------------------------------+
- seek------->^seek-->^seek--------->^
- | | |
- ------------------------
- only these pages read
例如:
- select * from test where
- c1 ...
- and crt_time between ? and ?
- and test->> c1 in (?, ? ...);
根据统计数据自动使用适当的索引。如果需要,bitmapAnd和bitmapOr将在多个索引上自动执行合并扫描。跳过不需要扫描的页面,重新检查命中的页面。
堆表存储分级和分区
存储可以分为一级分区或多级分区:
1. 单一分区
例如,按时间划分。
- create table test(
- id int8,
- crt_time timestamp, -- Time
- pos geometry, -- Location
- obj jsonb -- Object description
- )
- PARTITION BY range (crt_time)
- ;
- create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 );
- ......
2. 多层分区
例如,先按时间分区,然后按Geohash划分。
- create table test_201701 PARTITION OF test for values
- FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15));
- ...
- create table test_201701_prefix1 PARTITION OF test for values
- FROM ( xxxx1 ) TO ( xxxx2 );
- -- Generate BOX (GRID) on a map, find corresponding boundaries and use
- -- boundaries as partitioning conditions
使用分区时,如果查询条件包括分区键(如时间和空间范围),相应的分区将自动定位,这即为需要扫描的数据量。
创建面向对象属性的GIN索引,以实现高效查询。
索引分级与分区
与数据一样,索引在不使用分区表的情况下也支持分区逻辑。
空间索引+时间分区
- create index idx_20170101
- on tbl using gist (pos)
- where crt_time between 2017-01-01 and 2017-01-02 ;
- ...
- create index idx_20170102
- on tbl using gist (pos)
- where crt_time between 2017-01-02 and 2017-01-03 ;
- ...
通过使用前述分区索引,可以在输入时间范围后快速定位目标数据,执行空间搜索。
- select * from tbl
- where crt_time between 2017-01-01 and 2017-01-02 -- Time
- and (pos <-> ?) < ? -- Distance to a point to be searched for
- and ? -- Other conditions
- order by pos <-> ? -- Sort by distance
- limit ?; -- Number of results to be returned
可以使用更多的索引分区,比如用作搜索条件和商店类型的维度(对象属性)(假设它是可枚举的或在范围相对较小的情况下)。
- create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0;
- ...
- create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1;
- ...
通过使用前面的分区索引,在输入时间范围或特定条件以执行空间搜索后,可以快速定位目标数据。
- select * from tbl
- where crt_time between 2017-01-01 and 2017-01-02 -- Time
- and (pos <-> ?) < ? -- Distance to a point to be searched for
- and dtype=0 -- Object condition
- and ? -- Other conditions
- order by pos <-> ? -- Sort by distance
- limit ?; -- Number of results to be returned
请注意,前面的SQL查询可以实现最佳性能优化。
索引组织形式(或索引结构)可以由逻辑分区重新构造,可以用上述类似的索引创建方法覆盖所有条件。
CTID相交阵列连接扫描
如前所述,BitmapAnd和BitmapOr合并扫描是在多个索引或GIN索引中自动执行的。事实上,这种扫描也可以在SQL中显式执行。
每个条件渗透对应的CTID。
使用Intersect或Union生成满足总体需求的CTID。(Intersect对应于“and”条件;union对应于“or”条件。)
生成一个ctid数组。
示例
图片来源:unsplash.com/@markusspiske
1. 创建对象提要数据表
- postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int );
- CREATE TABLE
2. 将5000万条测试数据写入表中
- postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000;
- INSERT 0 50000000
3. 创建对象索引
- postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3);
- CREATE INDEX
4. 创建时间索引
- postgres=# create index idx_tbl_2 on tbl using btree (crt_time);
- CREATE INDEX
5. 创建空间索引
- postgres=# create index idx_tbl_3 on tbl using gist (pos);
- CREATE INDEX
6. 生成数据布局以方便后续查询
- postgres=# select min(crt_time),max(crt_time),count(*) from tbl;
- min | max | count
- ----------------------------+----------------------------+----------
- 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000
- (1 row)
7. 创建一个极限KNN查询函数
- create or replace function ff(point, float8, int) returns setof tid as
- $
- declare
- v_rec record;
- v_limit int := $3;
- begin
- set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number
- for v_rec in
- select *,
- (pos <-> $1) as dist,
- ctid
- from tbl
- order by pos <-> $1
- loop
- if v_limit <=0 then
- -- raise notice "Sufficient data obtained"
- return;
- end if;
- if v_rec.dist > $2 then
- -- raise notice "All matching points returned"
- return;
- else
- return next v_rec.ctid;
- end if;
- v_limit := v_limit -1;
- end loop;
- end;
- $
- language plpgsql strict volatile;
- postgres=# select * from ff(point (100,100) ,100,100) ;
- ff
- -------------
- (407383,11)
- (640740,9)
- (26073,51)
- (642750,34)
- ...
- (100 rows)
- Time: 1.061 ms
8. CTID合并检索
显示符合以下条件的记录
- (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- )
- and
- pos <-> point (0,0) < 5
- and
- crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;
首先,分别查看每个条件,找匹配一个条件的记录数量,以及在索引扫描上所花时长。
1. 54,907条记录
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55);
- QUERY PLAN
- -------------------------------------------------------------------------------------------------------------------------------
- Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1)
- Output: id, info, crt_time, pos, c1, c2, c3
- Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[]))
- Heap Blocks: exact=52778
- Buffers: shared hit=52866
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1)
- Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[]))
- Buffers: shared hit=88
- Planning time: 0.105 ms
- Execution time: 94.606 ms
- (10 rows)
2. 95,147条记录
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10;
- QUERY PLAN
- ---------------------------------------------------------------------------------------------------------------------------------
- Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1)
- Output: id, info, crt_time, pos, c1, c2, c3
- Recheck Cond: (tbl.c2 < 10)
- Heap Blocks: exact=88681
- Buffers: shared hit=88734
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1)
- Index Cond: (tbl.c2 < 10)
- Buffers: shared hit=53
- Planning time: 0.094 ms
- Execution time: 186.201 ms
- (10 rows)
3. 149930条记录(为快速获得结果,PostgreSQL使用位图进行合并扫描)
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10;
- QUERY PLAN
- ------------------------------------------------------------------------------------------------------------------------------------
- Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1)
- Output: id, info, crt_time, pos, c1, c2, c3
- Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10))
- Heap Blocks: exact=134424
- Buffers: shared hit=134565
- -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1)
- Buffers: shared hit=141
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1)
- Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[]))
- Buffers: shared hit=88
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1)
- Index Cond: (tbl.c2 < 10)
- Buffers: shared hit=53
- Planning time: 0.149 ms
- Execution time: 274.548 ms
- (15 rows)
4. 60,687条记录(即使运用出色的KNN性能优化,仍然需要耗费195毫秒)。
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000);
- QUERY PLAN
- ----------------------------------------------------------------------------------------------------------------------
- Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1)
- Output: ff
- Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000)
- Buffers: shared hit=61296
- Planning time: 0.029 ms
- Execution time: 195.097 ms
- (6 rows)
让我们看看不使用KNN优化需要多长时间。
结果非常令人惊讶——极限优化性能提高了一个数量级。
5. 2,640,751条记录
使用所有索引逐个扫描数据条件,得到ctid并执行ctid扫描。
现在,让我们来分解这个过程:
首先,让我们看看时间和对象属性的合并查询,成果非常惊人。使用位图BitmapOr时,查询可以跳过大多数数据块,并且扫描时间比单索引扫描要短。
注意,在此步骤中记录的数量减少到7,847条。
- postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl
- where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- );
- QUERY PLAN
- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1)
- Output: ctid
- Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone))
- Heap Blocks: exact=6983
- Buffers: shared hit=14343
- -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1)
- Buffers: shared hit=7360
- -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1)
- Buffers: shared hit=141
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1)
- Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[]))
- Buffers: shared hit=88
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1)
- Index Cond: (tbl.c2 < 10)
- Buffers: shared hit=53
- -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1)
- Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone))
- Buffers: shared hit=7219
- Planning time: 0.203 ms
- Execution time: 216.697 ms
- (20 rows)
然后,看KNN的扫描时间:
注意,60,687条记录满足KNN距离条件,所以接下来将解释CTID合并扫描与原始扫描之间的性能比较。
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000);
- QUERY PLAN
- ----------------------------------------------------------------------------------------------------------------------
- Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1)
- Output: ff
- Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000)
- Buffers: shared hit=61296
- Planning time: 0.029 ms
- Execution time: 195.097 ms
- (6 rows)
最后,将这些片段合并到ctid中。
- select * from ff(point (0,0) ,5,1000000)
- intersect
- select ctid from tbl
- where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- );
- ff
- ------------
- (1394,8)
- (3892,50)
- (6124,45)
- (7235,8)
- (7607,45)
- (11540,8)
- (13397,31)
- (14266,36)
- (18149,7)
- (19256,44)
- (24671,62)
- (26525,64)
- (30235,48)
- (13 rows)
- Time: 463.012 ms
取得最终纪录。
- select * from tbl where ctid = any
- (
- array( -- array start
- select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl
- where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- )
- ) -- array end
- );
- id | info | crt_time | pos | c1 | c2 | c3
- ---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----
- 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543
- 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360
- 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8
- 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245
- 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561
- 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875
- 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208
- 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355
- 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232
- 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820
- 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321
- 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74
- 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917
- (13 rows)
- Time: 462.715 ms
过程花费462毫秒。
9. 测试原始SQL查询的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳过扫描
直接编写SQL查询,而不是使用多CTID扫描。
- postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl
- where
- crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- )
- and
- pos <-> point (0,0) < 5;
- Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1)
- Output: id, info, crt_time, pos, c1, c2, c3
- Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone))
- Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision)
- Rows Removed by Filter: 7834
- Heap Blocks: exact=6983
- Buffers: shared hit=14343
- -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1)
- Buffers: shared hit=7360
- -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1)
- Buffers: shared hit=141
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1)
- Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[]))
- Buffers: shared hit=88
- -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1)
- Index Cond: (tbl.c2 < 10)
- Buffers: shared hit=53
- -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1)
- Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone))
- Buffers: shared hit=7219
- Planning time: 0.160 ms
- Execution time: 216.797 ms
- (22 rows)
性能如预期的那样好,之前解释过原因。KNN条件以外的条件已经将结果收敛到7,000条记录,因此没有必要使用包含KNN条件的索引。(即使使用KNN索引也需要195毫秒,因为有60,687条记录满足KNN条件。)
校验结果:
- select * from tbl
- where
- crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- )
- and
- pos <-> point (0,0) < 5;
- id | info | crt_time | pos | c1 | c2 | c3
- ---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----
- 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543
- 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360
- 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8
- 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245
- 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561
- 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875
- 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208
- 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355
- 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232
- 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820
- 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321
- 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74
- 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917
- (13 rows)
分区索引示例
假设前面的查询条件保持不变,使用分区索引来测试性能。
这是为了演示分区索引的极端效果。在实际场景中,集合级别可能没有那么高(例如按天集合或按ID散列集合)。只要集合是可能的,就可以展现出色的性能。
- postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- ) ;
- CREATE INDEX
- Time: 8359.330 ms (00:08.359)
重构极值KNN优化函数
- create or replace function ff(point, float8, int) returns setof record as
- $
- declare
- v_rec record;
- v_limit int := $3;
- begin
- set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number
- for v_rec in
- select *,
- (pos <-> $1) as dist
- from tbl
- where
- crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40
- and (
- c1 in (1,2,3,4,100,200,99,88,77,66,55)
- or
- c2 < 10
- )
- order by pos <-> $1
- loop
- if v_limit <=0 then
- -- raise notice "Sufficient data obtained"
- return;
- end if;
- if v_rec.dist > $2 then
- -- raise notice "All matching points returned"
- return;
- else
- return next v_rec;
- end if;
- v_limit := v_limit -1;
- end loop;
- end;
- $
- language plpgsql strict volatile;
查询性能:
- postgres=# select * from ff(point (0,0) , 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8);
- id | info | crt_time | pos | c1 | c2 | c3 | dist
- ---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+-------------------
- 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319
- 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376
- 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301
- 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508
- 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247
- 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409
- 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975
- 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952
- 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581
- 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986
- 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676
- 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757
- 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311
- (13 rows)
- Time: 0.592 ms
太棒了!查询时间从200毫秒减少到1毫秒以内。
优化方法综述
优化方法回顾:
1. 为不同的数据类型构建不同的索引。
例如,对空间使用GiST或SP-GiST索引,对时间使用B树或BRIN索引,对多个对象属性使用GIN索引。索引的目的是缩小数据扫描的范围。
2. 方法五提到数据分区。
数据分区的目的是有意地组织数据,这意味着有意地组织数据以满足搜索需求。例如,如果时间是必需的查询条件或公共查询条件,那么可以按时间(分区)分割数据,以减少需要扫描的数据量。
3. 方法六描述了索引分区。
目的类似于方法五。方法五和方法六的区别在于分区在索引级别使用,因此当执行索引扫描时,数据命中率会直接提高。
4.方法七中的ctid合并扫描类似于PostgreSQL中的多索引bitmapAnd或bitmapOr扫描。
bitmapAnd/bitmapOr跳过不需要扫描的块,方法七中的ctid合并扫描跳过不需要扫描的行。
合并从多个索引扫描获得的ctid。跳过不需要扫描的行数。
如果当其他条件为“AND”时,过滤条件可以显著减少ctid(记录),则没有必要使用ctid合并扫描。相反,使用FILTER作为另一个条件。(这将略微增加CPU开销。)
5. 最好的功夫总是以最大的灵活性、自由和对每一个动作的无限想象为特征。
PostgreSQL实现多索引BitmapAnd或BitmapOr扫描,显著提高了多种条件(索引)下的数据命中率。
此外,PostgreSQL具有出色的CBO估计机制,它允许PostgreSQL不总是使用位图合并扫描的所有索引。这也是为什么在“测试原始SQL查询的性能——PostgreSQL多索引BitmapAnd位图或跳过扫描”一节中描述的性能更好。
6. 如何实现极端优化
采用方法五或六,并使用可修复的条件作为分区键来分区数据或索引。
对于其他条件,可以使用PostgreSQL中的多索引BitmapAnd或BitmapOr扫描来提高多条件(索引)的数据命中率。
我们可以看到,按照时间、空间和对象属性从5,000万数据块中进行多维检索所需的时间减少到了0.592毫秒。
7. 对于空间数据,除了使用GiST索引,我们还可以使用BRIN索引,这降低了成本。有条理地组织数据后,会使滤波性能良好。