Benchmark (aka "comparing apples and oranges")
Back to main pageIntroduction
Today I've casually read this blog page; many thanks to my dear friend Markus Neteler (the maintainer of GRASS GIS) for pointing my attention on this topic.The original Clipping Contest simply compared several popular Desktop GIS (ArcGis, QGIS, gvSIG, GRASS and others), but there was an obvious omission.
No Spatial DBMS was considered at all. (too bad; a really embarrassing omission)
After all a Clipping Contest sounds more like a pure data-processing problem than like a typical GIS/GUI problem, so it's probably useful introducing some Spatial DBMS into this Contest. This was my first thought while reading the benchmark report, so I duly started preparing my own testbed based on SpatiaLite (BTW I'm going to release the new version 4.0.0 in the very next days, so after all performing some further test is surely welcome).
Test configuration
- Intel Core i5 3.46 GHz
- 4 GB RAM
- Windows 7 64 bit (physical)
- Debian Squeeze 32 bit (virtual machine)
I tested both Windows and Debian on the same hardware. The measured timings are rather similar; anyway Linux performed slightly better than the native Windows system (more or less, by a 10% - 20% factor). This is not at all surprising for me, because this trend seems to be constantly verified each time I manage some comparative benchmark. Very often Linux is a best performer than Windows, even when running as a Virtual Machine on the top of a physical Windows system. Quite paradoxical, but indisputably true and objectively verified many times.
The sample dataset
I've simply downloaded the same identical sample dataset already used for all other tests reported in the Clipping Contest.Very shortly said: it's a really huge set of Contour Lines (the uncompressed Shapefile is about 1,3 GB)
Fig. 1This one is the original set of Contour Lines to be clipped. |
|
Fig. 2This is the final result of the clipping operation.
|
|
Fig. 3This is a magnified detail of the Study Area and Clipped Contour Lines.As you can easily notice there are lots of Contour Lines, and each Line has many vertexes. This one is a rather complex problem to be resolved: and it surely represents a thorough and very demanding test. |
The Spatial SQL approach
This is a pure data-processing approach; we'll use the most classical Spatial Is Not Special methodology, i.e. we'll go performing the Clipping Contest exactly as if it was a trivial ordinary SQL query performed on behalf of plain generic data. We'll not use at all any GIS-specific frill, and we'll simply use a couple of Spatial SQL functions where strictly required. (BTW it's a good opportunity to introduce few basic concepts about Spatial SQL).Let's go ...
Preparing the SQL script
-- -- SQL script performing the Clipping Contest -- Target DBMS: SQLite with SpatiaLite extension -- SELECT DateTime('now'), 'Loading the StudyArea SHP'; -- -- loading the StudyArea Shapefile into the DB -- .loadshp StudyArea1MileBuffer StudyArea1MileBuffer CP1252 2260 Geometry Id POLYGON 2d no no_spatial_index SELECT DateTime('now'), 'Loading the Contours SHP'; -- -- loading the Contour Shapefile into the DB -- .loadshp Contours20Ft Contours20Ft CP1252 2260 Geometry Id LINESTRING 2d no with_spatial_index SELECT DateTime('now'), 'Clipping ... please wait'; -- -- creating a new table containing the Clipped Contours -- CREATE TABLE Contours20FtClipped AS SELECT c.Id AS Id, c.OBJECTID AS OBJECTID, c.Contour AS Contour, c.Shape_Leng AS Shape_Leng, CastToMultiLinestring(ST_Intersection( c.Geometry, a.Geometry)) AS Geometry FROM Contours20Ft AS c, StudyArea1MileBuffer AS a WHERE c.ROWID IN ( SELECT ROWID FROM SpatialIndex WHERE f_table_name = 'Contours20Ft' AND search_frame = a.Geometry ); SELECT DateTime('now'), 'deleting NULL Geometries'; -- -- clean up -- deleting any NULL Geometry from Clipped Contours -- DELETE FROM Contours20FtClipped WHERE Geometry IS NULL; SELECT DateTime('now'), 'recovering a genuine Geometry column'; -- -- recovering a genuine Geometry Column (Clipped Contours) --- SELECT RecoverGeometryColumn('Contours20FtClipped', 'Geometry', 2260, 'MULTILINESTRING', 2); SELECT DateTime('now'), 'creating the Spatial Index'; -- -- creating a Spatial (Index supporting Clipped Contours) -- SELECT CreateSpatialIndex('Contours20FtClipped', 'Geometry'); SELECT DateTime('now'), 'all done ... quitting'; -- -- end job --Don't be fouled; complexity is more apparent than real. The following is a quick explanation of the step by step process defined by the above SQL script.
.loadshp parameters | Step #1simply a dot macro command understood by the SpatiaLite front-end tool, allowing to directly import an external Shapefile into a corresponding Spatial Table within the DB. |
CREATE TABLE Contours20FtClipped AS SELECT c.Id AS Id, c.OBJECTID AS OBJECTID, c.Contour AS Contour, c.Shape_Leng AS Shape_Leng, CastToMultiLinestring(ST_Intersection( c.Geometry, a.Geometry)) AS Geometry FROM Contours20Ft AS c, StudyArea1MileBuffer AS a WHERE c.ROWID IN ( SELECT ROWID FROM SpatialIndex WHERE f_table_name = 'Contours20Ft' AND search_frame = a.Geometry ); |
Step #2this single SQL query performs all the actual Clipping task.ST_Intersection() is a Spatial SQL function computing the intersection between the Study Area (polygon) and each Contour Line (linestring). The returned result could be:
|
DELETE FROM Contours20FtClipped WHERE Geometry IS NULL; |
Step #3just a stupid post-processing step; the previous query has surely generated many NULL geometries, and we'll now get rid of all them. |
SELECT RecoverGeometryColumn('Contours20FtClipped', 'Geometry', 2260, 'MULTILINESTRING', 2); SELECT CreateSpatialIndex('Contours20FtClipped', 'Geometry'); |
Step #4the very final post-processing step: transforming the table containing all Clipped Contour Lines into a genuine Spatial Table supported by its own Spatial Index. |
SELECT DateTime('now'), 'some message'; |
measuring the timethis simply is a weird SQL trick; we'll query the system clock immediately before and after performing each step, so to get full trace of the corresponding timings. |
Bringing things to a head: running the test
# spatialite my_test.sqlite < test.sql SpatiaLite version ..: 4.0.0-RC2 Supported Extensions: - 'VirtualShape' [direct Shapefile access] - 'VirtualDbf' [direct DBF access] - 'VirtualXL' [direct XLS access] - 'VirtualText' [direct CSV/TXT access] - 'VirtualNetwork' [Dijkstra shortest path] - 'RTree' [Spatial Index - R*Tree] - 'MbrCache' [Spatial Index - MBR cache] - 'VirtualSpatialIndex' [R*Tree metahandler] - 'VirtualFDO' [FDO-OGR interoperability] - 'SpatiaLite' [Spatial SQL - OGC] PROJ.4 version ......: Rel. 4.7.1, 23 September 2009 GEOS version ........: 3.2.0-CAPI-1.6.0 2012-11-10 00:03:52|Loading the StudyArea SHP 2012-11-10 00:03:52|Loading the Contours SHP 2012-11-10 00:04:54|Clipping ... please wait 2012-11-10 00:07:31|deleting NULL Geometries 2012-11-10 00:07:33|recovering a genuine Geometry column 2012-11-10 00:07:34|creating the Spatial Index 2012-11-10 00:07:36|all done ... quitting #
Final considerations
- The whole process took less than 4 minutes
- About 1 minute was spent loading the Shapefiles and building the supporting Spatial Index
- Less than 10 seconds were spent performing post-processing steps.
- The hard-core problem itself (generating a Clipped set of Contour Lines) took about 2 minutes and half.
Please note: this Clipping Contest really is like comparing apples and oranges.
- the CPUs used to actually run the tests are of different generations, and obviously have different intrinsic speeds.
- it's not at all clearly stated if the Desktop GIS apps permanently saved the Clipped Contour Lines somewhere or not. And it's not at all clear which kind of data storage they used (directly accessing the disk-based Shapefiles ? loading all the stuff in memory ? some other kind of properly structured disk-based storage, may be temporary and transient ?)
- the amount of available RAM is absolutely not relevant; this Contest could eventually benefit from a faster CPU, but the required memory footprint is minimal (just few hundredth MB in the SpatiaLite's case). Consuming huge amounts of RAM doesn't looks at all a wise solution, and isn't at all strictly required in this specific case. When such a condition emerges this will probably mean that some barbaric brute-force data access strategy has been badly adopted.
- adopting some kind of parallel processing will obviously introduce a noticeable performance boost (SpatiaLite is single threaded; so using a quad-core processor had absolutely no effect at all).
What are we measuring, really ?
- many GFOSS implementations (QGIS, PostGIS, SpatiaLite) are based on GEOS; so any measured time effectively represents the intrinsic efficiency of GEOS much more than the efficiency of the top-level application itself. If a difference exists between different GEOS-based apps, this is simply when the top-level apps badly introduces some further overhead thus reducing the overall efficiency.
- other GFOSS Java implementations (gvSIG, OpenJump, uDIG) are based on JTS; but GEOS simply is a transposition in C++ of the original Java-based JTS, the underlaying algorithms are exactly the same for both libraries. So, roughly speaking, all GFOSS applications (both C, C++ and Java) after all are based on the top of the same base-level algorithms. And not at all surprisingly all measured timings are quite exactly the same.
- with the remarkable exception of GRASS GIS; this one adopts an original approach of its own, being based on a conceptually different Geometry model (topology vs simple features). Anyway, in this case too timings are very similar to the others.
- IMHO it's a good demonstration that we have complessively reached the best possible efficiency for this kind of problems. And that openness combined with collaboration/competition between many different independent projects always leads to good, robust and efficient software.
- many proprietary implementations looks to be embarrassangly less efficient. A sad demonstration that a development policy based on closedness and secrecy very difficultly pays on long term periods.
Back to main page