Serving real-time, tiled, point data directly from DynamoDB

Recently, I’ve been interested in how to serve and work with geographic data with the least amount of “work” possible. Everything here can be done pretty easily with PostGIS and a large enough server, but I’m always up for the challenge of doing things in a slightly different way.

I’m working on an app that will store lots of points around the world – points that people have submitted. Users will open this app, and be presented with a map with points that have been submitted in their vicinity. Once they zoom out, I want to show clusters of points to the users, with a number showing how many points are represented in that cluster.

There are two major ways to do this:

There are pros and cons for both of these methods:

This basically boils down to two solutions: let the client process the data, or let the server process the data. Serving the whole GeoJSON document may work up to a point (I usually use something like 1-2MB as a general limit), datasets larger than that need to be tiled. I use Mapbox’s tippecanoe for this, but tippecanoe takes a GeoJSON file and outputs a static mbtiles file with precomputed tiles. Great for static data, but updating this means regenerating all the tiles in the set.

This is where the title of this post comes in. I’ve been thinking about how to serve multiple large tilesets directly from DynamoDB, and I think I have a pretty good idea of what might work. If there’s something like this that already exists, or you find something wrong with this implementation, please let me know!

Here is the basic data structure:

PKSKRaw DataTileset#{tileset ID}#Data#{first 5 characters of geohash}{full geohash}#{item ID}Pregenerated TileTileset#{tileset ID}#Tile#{z}/{x}/{y}{generated timestamp}

This is how it’s supposed to work: raw point data is stored conforming to the “Raw Data” row. Updates to raw data are processed by DynamoDB streams, and a message to update the tiles that point exists in is enqueued.

Because this potentially can get very busy (imagine a case where hundreds or thousands of points in the same tile are updated within seconds of each other), the queue would have a delay associated with it . The processor that generates tiles based on messages in the queue would take a look at the tile it’s about to generate, and compare the generated timestamp with the timestamp of the item update. If the tile timestamp is newer then the item, that item can be safely disregarded, because another process has already picked it up.

I initially thought about using tile numbers to index the raw data as well, but decided to use geohash instead. The reasoning behind this is because geohash is a Z-order grid, we can use this property to optimize DynamoDB queries spanning multiple cells. Another problem with using tile numbers is the ambiguity of points that lie on tile boundaries. In contrast, the precision of a geohash can be arbitrarily increased without affecting queries.

Is there a way to encode tile numbers in to a Z-order curve? Maybe interpolating the X and Y values bitwise? How can this account for the zoom parameter? Am I overthinking this? Can I get away with using a sufficiently high zoom parameter (like z=22 or 23)? (How about a crazier idea: can the geohash algorithm be massaged in to corresponding directly to tiles? Is it worth it?)

Anyways, this is sort of something that I’ve been thinking about for a while, and I want to get started on a proof-of-concept implementation in the near future. Let me know on Twitter if you have any input / comments, or even better, if you want to collaborate with me on making something like this.