VirtualKNN2: a quick intro
|
What is the KNN (K-Nearest Neighbors) problem ? |
Imagine a set of arbitrary geometries; which may well be, a very huge dataset containing some millions of features.
Now imagine that for whatever reason you are interested in quickly identifying all features within a close spatial proximity to an arbitrary location.
This is the typical KNN problem. |
VirtualKNN2 is a complete refactoring of the previous VirtualKNN that is now deprecated.
You are warmly invited to get a preliminary quick glance to the previous documentation so to get a deeper and more detailed introduction.
The present document is essentially finalized to explain the main differences occurring between KNN and KNN2.
|
Understanding the two different approaches
|
The previous KNN implementation was fully built around the SQLite's API sqlite3_rtree_query_callback(), a very special API allowing to recursively explore the internal tree structure of the Spatial Index without accessing the main Spatial Table itself.
Exploring the branches of the tree not necessarily was a fast activity, but it allowed to identify all features nearest to a given reference point without any need to assume some predefined arbitrary distance radius.
Even in the case of a very sparse and unpredictable statistic dispersion of the points to be analyzed, this method robustly ensured that all nearest points will be surely identified with no possible exception.
|
On the other side the latest KNN2 approach is based on completely different approach, that does not require any special operation if not plain classic queries based on the Spatial Index.
- Adopting such an approach a predefined distance radius must always be specified. This introduces the unpleasant need for some arbitrary assumption, but ensures a by far faster access to data.
Some unexpected bad surprise will surely arise when the assigned distance radius is too small, because in this case many nearest features will fail to be identified just because they fall outside the expected range.
A generous exceeding distance radius will prevent such issues, but will surely vanish any expectation for a quick data processing.
Short conclusion: setting a reasonable distance radius well matching the statistical dispersion of the points to be analyzed is an absolutely critical task: a too small value will cause many failing matches, but a too big value will directly lead to sluggish performance.
- As a workaround intended to minimize such accidents KNN2 has the optional capability to automatically expand the distance radius when a number less than expected of nearest features is identified.
In this case, if the expand option has been enabled, the search for nearest neighbors will be repeated more times until the requested number of features will be identified, and on each iteration the distance radius will be doubled.
This will surely introduce some performance loss, but in the case of a very sparse and irregular statistic distribution it will quickly converge to a complete solution also when using a too small distance radius.
The secret for a good compromise between completeness and speed is in choosing a a distance radius large enough to successfully resolve the vast majority of cases, so to fire the expand recursion just in very limited exceptional cases.
|
Preliminaries
You can download a sample db-file exactly corresponding to the examples we'll use in this first tutorial.
Note: this is exactly the same sample db-file previously used by the old and nowadays obsolete VirtualKNN tutorial and was created by a previous version of SpatiaLite not yet supporting VirtualKNN2.
This dataset corresponds to a collection of about 30.000 airports scattered all over the world; the reference system is 4326 WGS84 (based on longitude and latitude angles).
|
VirtualKNN2
Every new db-file being created with a very recent version of SpatiaLite will always automatically define a KNN2 virtual table.
You can easily add KNN2 support on any already existing db-file created by older versions by manually executing the following SQL statement.
CREATE VIRTUAL TABLE knn2 USING VirtualKNN2();
Alternatively you can execute this other SQL statement with identical effects (this second solution is warmly suggested, because it will create any other missing system table).
SELECT CreateMissingSystemTables(1);
Note: VirtualKNN2 necessarily requires a corresponding library support, so any attempt to open a db-file including a VirtualKNN2 table by using some previous version of the SpatiaLite will surely raise an error condition.
A first basically simple KNN2 query
SELECT * FROM knn2
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND radius = 1.0;
db_prefix | f_table_name | f_geometry_column | ref_geometry | radius |
max_items | expand | pos | fid | distance_crs | distance_m |
MAIN | airports | geom | BLOB sz=60 GEOMETRY | 1.000000 | 3 | 0 | 1 | 6299623 |
0.338817 | 33043.319520 |
MAIN | airports | geom | BLOB sz=60 GEOMETRY | 1.000000 | 3 | 0 | 2 | 6299392 |
0.683116 | 65226.573149 |
MAIN | airports | geom | BLOB sz=60 GEOMETRY | 1.000000 | 3 | 0 | 3 | 6299628 |
0.788669 | 82387.014028 |
- a VirtualKNN2 query closely resembles a VirtualKNN query.
This is not surprising, but pay close attention because there are some relevant differences.
- Any valid VirtualKNN2 query must use a form such as:
WHERE knn-column = value AND knn-column = value ...
- the input columns (the ones you can reference into a WHERE clause) are:
- db_prefix (optional)
prefix identifying the attached-DB containing the SpatialTable (or SpatialView) to be queried.
If missing or set as NULL the MAIN will be always implicitly intended.
- f_table_name (mandatory)
name of the SpatialTable (or SpatialView) containing the Geometries to be searched.
- f_geometry_column (optional)
name of the column of the above table containing the Geometries to be searched.
- If the table identified by f_table_name just contains a single Geometry column you can safely omit the f_geometry_column argument (it will be automatically set).
- If, however, the table identified by f_table_name contains two (or more) Geometry columns explicitly specifying the f_geometry_column argument is strictly required, to avoid an ambiguous definition of the search context.
- In both cases f_table_name and f_geometry_column must exactly match a properly defined Geometry column supported by a corresponding Spatial Index.
- ref_geometry (mandatory)
some POINT Geometry representing the origin of the KNN2 search.
This Geometry will always be assumed to be in the same SRID as the target Geometries to be searched.
- radius (mandatory)
the distance radius to be searched. The unit of measure of radius will always be assumed to be the natural one corresponding to the SRID of the SpatialTable (or SpatialView) being searched.
- max_items (optional)
maximum number of rows to be returned into the resultset.
The valid range is from 1 to 1024 (higher values will require a longer time to be processed).
By default only the first 3 results will be returned.
- expand (optional)
a boolean flag intended to enable / disable the expand option of KNN2. By default it will be always kept disabled, you must explicitily enable it when required.
- the output columns (the ones containing values retrieved by the KNN search) are:
- pos (INTEGER)
relative rank (sorted by distance): the closest item will be #1, the second closest item will be #2 and so on.
- fid (INTEGER)
the unique ROWID value of the corresponding feature into the searched SpatialTable (or SpatialView).
- distance_crs (DOUBLE)
the minimum distance between the corresponding feature into the searched SpatialTable (or SpatialView) and the origin point defined by ref_geometry.
- When the dataset's SRID uses a planar (aka projected) reference system, all distances will be returned in the units defined by the projection (meters, feet, chains etc.).
- When using a geographic SRID (longitude and latitude degrees), all distances will always be measured in angles.
- distance_m (DOUBLE)
a second measure of the minimum distance, may be expressed in different units of measure.
- When using a planar SRID it will be exactly the same as distance_crs.
- When using a geographic SRID all distances will always be measured in meters, with the most precise geodetic formulas being automatically applied.
Important notice
VirtualKNN2 (as any other VirtualTable) is not a plain Table as any other, it has a peculiar internal logic of its own. Faithfully respecting the expected positional order of the arguments in a WHERE clause is absolutely critical.
Example
SELECT * FROM knn2
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND radius = 1.0;
This was our previous SQL query. But it could be rewritten in this following explicit form:
SELECT * FROM knn2
WHERE db_prefix = 'MAIN' AND f_table_name = 'airports' AND f_geometry_column = 'geom' AND ref_geometry = MakePoint(10, 43) AND radius = 1.0;
But we could even rewrite it in the following complete form:
SELECT * FROM knn2
WHERE db_prefix = 'MAIN' AND f_table_name = 'airports' AND f_geometry_column = 'geom' AND ref_geometry = MakePoint(10, 43) AND radius = 1.0 AND max_items = 3 AND expand = 0;
All three rewrites are one and the same with identical meaning. The unique relevant thing is to declare the clauses of the WHERE one after the other respecting the right sequence.
Examples: malformed WHERE clauses will surely misbehave and produce unexpected results, as in the following intentionally incorrect queries:
SELECT * FROM knn2
WHERE f_geometry_column = 'geom' AND f_table_name = 'airports' AND radius = 1.0 AND ref_geometry = MakePoint(10, 43);
SELECT * FROM knn2
WHERE max_items = 10 AND f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND radius = 1.0;
|
A second sample of a more sophisticated KNN2 query
SELECT a.pos AS rank, b.geonameid, b.name, b.country, a.distance_crs AS dist_deg, a.distance_m / 1000.0 AS dist_km
FROM knn2 AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND radius = 0.5 AND max_items = 10;
rank | geonameid | name | country | dist_crs | dist_km |
1 | 6299623 | Marina Di Campo | IT | 0.338817 | 33.043320 |
This second KNN2 query identified just a single airport although we requested for the first ten nearest (AND max_items=10).
This is not surprising, because just the airport of Marina Di Campo falls within the assigned radius limit. Attempting to identify the first 10 could become an irritating task of blindly guessing the appropriate radius value.
Don't despair; it's a rather trivial task just requiring to explicitly enable the expand option of KNN2.
SELECT a.pos AS rank, b.geonameid, b.name, b.country, a.radius, a.expand, a.distance_crs AS dist_deg, a.distance_m / 1000.0 AS dist_km
FROM knn2 AS a
JOIN airports AS b ON (b.geonameid = a.fid)
WHERE f_table_name = 'airports' AND ref_geometry = MakePoint(10, 43) AND radius = 0.5 AND max_items = 10 AND expand = 1;
rank | geonameid | name | country | radius |
expand | dist_crs | dist_km |
1 | 6299623 | Marina Di Campo | IT | 2.000000 | 1 | 0.338817 | 33.043320 |
2 | 6299392 | Bastia-Poretta | FR | 2.000000 | 1 | 0.683116 | 65.226573 |
3 | 6299628 | Pisa / S. Giusto | IT | 2.000000 | 1 | 0.788669 | 82.387014 |
4 | 6299630 | Grosseto Airport | IT | 2.000000 | 1 | 1.098494 | 91.549773 |
5 | 6694495 | Corte | FR | 2.000000 | 1 | 1.073460 | 102.819778 |
6 | 7730633 | Lucca / Tassignano Airport | IT | 2.000000 | 1 | 1.010007 | 103.240396 |
7 | 6457348 | Ampugnano | IT | 2.000000 | 1 | 1.280852 | 106.014558 |
8 | 7730632 | Massa Cinquale Airport | IT | 2.000000 | 1 | 0.996287 | 110.154116 |
9 | 6299393 | Calvi - Sainte-Catherine | FR | 2.000000 | 1 | 1.294830 | 111.687431 |
10 | 6299614 | Sarzana / Luni | IT | 2.000000 | 1 | 1.085563 | 120.604277 |
As you can easily notice, expand successfully identified all the 10 nearest airports during its third iteration, when the radius was automatically expanded from 0.5 to 2.0 degrees.
Important notice: pay close attention to the above resultset. When processing geographic datasets there is no obvious conversion between angular distances measured in degrees and linear distances measured in metres, because these latest are measured using complex geodetic formulas and the actual measure is strongly influenced by the positions of the two points on the ellipsoid.
As you can directly check, the rows in the returned resultset are ordered by increasing metric distances, whilst some of the angular distances are out of order.
Final considerations and a comparative benchmark
One of the main reasons leading to the deprecation of the previous KNN module was that it was rather slow, and a complete refactoring of the new KNN2 based on a radically different strategy was expected to be noticeably faster.
It's now time to practically check how efficient is KNN2 when objectively compared to its ancestor KNN. The following timings were measured on the same hardware always using the same airports.sqlite DB-file.
The problem to be solved
For each airport contained within the dataset identify the ten nearest airports.
|
#1 - SQL Query for testing old KNN
CREATE TABLE old_knn AS
SELECT a.geonameid AS departure_id, a.name AS departure_name, a.country AS departure_country,
b.geonameid AS arrival_id, b.name AS arrival_name, b.country AS arrival_country,
k.distance / 1000.0 AS dist_km
FROM knn AS k, airports AS a
JOIN airports AS b ON (b.geonameid = k.fid AND a.geonameid <> b.geonameid)
WHERE f_table_name = 'airports' AND ref_geometry = a.geom AND max_items = 11;
#2 - SQL Query for testing KNN2 (expand disabled)
CREATE TABLE knn2_no_expand AS
SELECT a.geonameid AS departure_id, a.name AS departure_name, a.country AS departure_country,
b.geonameid AS arrival_id, b.name AS arrival_name, b.country AS arrival_country,
k.distance_m / 1000.0 AS dist_km
FROM knn2 AS k, airports AS a
JOIN airports AS b ON (b.geonameid = k.fid AND a.geonameid <> b.geonameid)
WHERE f_table_name = 'airports' AND ref_geometry = a.geom AND radius = 0.5 AND max_items = 11;
#3 - SQL Query for testing KNN2 (expand enabled)
CREATE TABLE knn2_expand AS
SELECT a.geonameid AS departure_id, a.name AS departure_name, a.country AS departure_country,
b.geonameid AS arrival_id, b.name AS arrival_name, b.country AS arrival_country,
k.distance_m / 1000.0 AS dist_km
FROM knn2 AS k, airports AS a
JOIN airports AS b ON (b.geonameid = k.fid AND a.geonameid <> b.geonameid)
WHERE f_table_name = 'airports' AND ref_geometry = a.geom AND radius = 0.5 AND max_items = 11 AND expand = 1;
Note: in all the three above queries we are actually requesting for 11 items, because the first one will obviously coincide with the airport itself (distance zero) and will be consequently discarded.
Test case | Measured time | Count of generated rows |
old KNN | 7 mins, 30 secs, 677 millis | 302,040 |
new KNN2 (expand disabled) | 17 secs 603 millis | 221,265 |
new KNN2 (expand enabled) | 24 secs 554 millis | 302,040 |
Conclusions
- The new approach adopted by KNN2 is impressively faster than the previous one. There is an astonishing performance improvement.
Requiring a predefined distance radius known in advance forces to adopt some arbitrary assumption, but it ensures an impressive speed increase.
- Looking at the number of generated rows, it's rather obvious that a too small distance radius value could easily lead to many missing matches when expand is disabled.
- However it should be noted that after enabling expand the complete solution has been successfully found, and this introduces only a marginal and well acceptable slow down.
|