Serving real-time, tiled, point data directly from DynamoDB

February 23, 2022

Recently, I’ve been interested in how to serve and work with geographic data with the least amount of “work” possible. Everything here can be done pretty easily with PostGIS and a large enough server, but I’m always up for the challenge of doing things in a slightly different way.

I’m working on an app that will store lots of points around the world – points that people have submitted. Users will open this app, and be presented with a map with points that have been submitted in their vicinity. Once they zoom out, I want to show clusters of points to the users, with a number showing how many points are represented in that cluster.

There are two major ways to do this:

Get the bounding box (the coordinates of what the user’s map is currently displaying), send it to the server, and the server will return a GeoJSON object containing all the points in that bounding box.
Split the data in to predefined tiles and the client will request the tiles that are required for the current map.

There are pros and cons for both of these methods:

GeoJSON would contain the raw data for all points within the bounding box. When the user is zoomed in, this is not a big problem, but when they’re zoomed out, the client will have to do a lot of work calculating clusters. Also, as the user pans around, new queries would have to be made to piece together the point data for areas they don’t have the data for. Not only can this data get pretty big, the user experience is subpar – data is normally loaded after the map pan finishes.
Tiles at the base zoom level (where all data is shown, without any clustering) can be generated on-the-fly easily, but tiles at lower zoom levels need to use all of the information in the tile bounds to generate clusters.

This basically boils down to two solutions: let the client process the data, or let the server process the data. Serving the whole GeoJSON document may work up to a point (I usually use something like 1-2MB as a general limit), datasets larger than that need to be tiled. I use Mapbox’s tippecanoe for this, but tippecanoe takes a GeoJSON file and outputs a static mbtiles file with precomputed tiles. Great for static data, but updating this means regenerating all the tiles in the set.

This is where the title of this post comes in. I’ve been thinking about how to serve multiple large tilesets directly from DynamoDB, and I think I have a pretty good idea of what might work. If there’s something like this that already exists, or you find something wrong with this implementation, please let me know!

Here is the basic data structure:

PKSKRaw DataTileset#{tileset ID}#Data#{first 5 characters of geohash}{full geohash}#{item ID}Pregenerated TileTileset#{tileset ID}#Tile#{z}/{x}/{y}{generated timestamp}

This is how it’s supposed to work: raw point data is stored conforming to the “Raw Data” row. Updates to raw data are processed by DynamoDB streams, and a message to update the tiles that point exists in is enqueued.

Because this potentially can get very busy (imagine a case where hundreds or thousands of points in the same tile are updated within seconds of each other), the queue would have a delay associated with it . The processor that generates tiles based on messages in the queue would take a look at the tile it’s about to generate, and compare the generated timestamp with the timestamp of the item update. If the tile timestamp is newer then the item, that item can be safely disregarded, because another process has already picked it up.

I initially thought about using tile numbers to index the raw data as well, but decided to use geohash instead. The reasoning behind this is because geohash is a Z-order grid, we can use this property to optimize DynamoDB queries spanning multiple cells. Another problem with using tile numbers is the ambiguity of points that lie on tile boundaries. In contrast, the precision of a geohash can be arbitrarily increased without affecting queries.

Is there a way to encode tile numbers in to a Z-order curve? Maybe interpolating the X and Y values bitwise? How can this account for the zoom parameter? Am I overthinking this? Can I get away with using a sufficiently high zoom parameter (like z=22 or 23)? (How about a crazier idea: can the geohash algorithm be massaged in to corresponding directly to tiles? Is it worth it?)

Anyways, this is sort of something that I’ve been thinking about for a while, and I want to get started on a proof-of-concept implementation in the near future. Let me know on Twitter if you have any input / comments, or even better, if you want to collaborate with me on making something like this.