Categories
English GIS

Creating Vector Tiles from Scratch using TypeScript

I currently work at a Japanese mapping startup called Geolonia. We specialize in customizing and displaying geospatial data, and a big part of this is making vector tiles. There are a lot of ways to do this, the most popular being tippecanoe from Mapbox, ST_AsMVT in PostGIS, OpenMapTiles, and tilemaker. We normally use tippecanoe to convert GeoJSON data to vector tiles and tilemaker to generate OpenStreetMap-based tiles, but there isn’t really an easy way to generate arbitrary vector tiles in plain old TypeScript / JavaScript.

Recently, I’ve been working on a proxy for converting data directly to vector tiles on the fly, and that is exactly what I need to do — generate arbitrary tiles without relying on outside processes like tippecanoe or PostGIS (this specific proxy will be running in AWS Lambda).

The Mapbox Vector Tile (MVT) spec is the de-facto standard for delivering tiled vector data, and that’s what we’ll be using. MVT tiles are Protocol Buffers-encoded files that include the drawing instructions for map renderers. The MVT spec handily includes the protobuf definition file, so we can use this to compile to JavaScript and generate the TypeScript definition files.

For the following, I’m assuming you have a git repository with the basic JavaScript and TypeScript essentials already set up.

Working with the MVT format in TypeScript

First, we create a submodule to bring in the MVT protobuf definition.

git submodule add https://github.com/mapbox/vector-tile-spec.git

This should create a new directory vector-tile-spec, with the contents of the MVT vector tile specification repository inside. Now, we can set up the protobuf-to-JS compiler.

My project uses ES modules, so these instructions will generate ES modules, but if you need something else, check out the protobufjs documentation.

In package.json, create a new script called build:protobuf:

"scripts": {
  ...
  "build:protobuf": "pbjs -t static-module -w es6 -o src/libs/protobuf.js vector-tile-spec/2.1/vector_tile.proto && pbts -o src/libs/protobuf.d.ts src/libs/protobuf.js"
}

Install protobufjs:

npm install --save protobufjs

Now, we’re ready to generate the code.

npm run build:protobuf

This command will generate two files: src/libs/protobuf.d.ts and src/libs/protobuf.js. Now, we’re ready to generate tiles.

The Mapbox Vector Tile format

Mapbox provides a very good guide on how data is laid out in the MVT container in their documentation. I highly recommend reading through it, but here are a few takeaways especially relevant to us:

  • Feature property values and keys are deduplicated and referred to by index.
  • Coordinates are encoded using “zigzag” encoding to save space while supporting negative coordinates. Starting from zero, 0, -1, 1, -2, 2, -3, 3…
  • Drawing points, lines, and polygons differ from GeoJSON in a few major ways:
    • Instead of absolute coordinates, drawing is done using a cursor and relative coordinates to the last operation.
    • Instead of declaring whether a geometry is a point, line, or polygon, MVT uses commands to imperatively draw geometry.
    • Coordinates are local to the tile, where (0,0) refers to the top-left corner of the tile.

Now that we’re ready to jump in to actually building up these tiles, let’s try some code.

import { vector_tile } from "../libs/protobuf";

// zigzag encoding
const zz = (value: number) => (value << 1) ^ (value >> 31);

const tile = vector_tile.Tile.create({
  layers: [
    {
      version: 2, // This is a constant
      name: "test", // The name of the layer
      extent: 256, // The extent of the coordinate system local to this tile. 256 means that this tile has 256×256=65536 pixels, from (0,0) to (256,256).
      features: [
        {
          // id must be unique within the layer.
          id: 1,
          type: vector_tile.Tile.GeomType.POLYGON,
          geometry: [
            // We start at (0,0)
            ((1 & 0x7) | (1 << 3)), // MoveTo (1) for 1 coordinate.
              zz(5), zz(5), // Move to (5,5)
            ((2 & 0x7) | (3 << 3)), // LineTo (2) for 3 coordinates.
              zz(1),  zz(0), // Draw a line from (5,5) to (6,5)
              zz(0),  zz(1), // Draw a line from (6,5) to (6,6)
              zz(-1), zz(0), // Draw a line from (6,6) to (5,6)
            15, // Close Path (implicitly closes the line from (5,6) to (5,5))
          ],
          tags: [
            0, // test-property-key-1
            0, // value of key 1

            1, // test-property-key-2
            1, // value of key 2 and 3

            2, // test-property-key-3
            1, // value of key 2 and 3
          ],
        }
      ],
      keys: [
        "test-property-key-1",
        "test-property-key-2",
        "test-property-key-3"
      ],
      values: [
        {stringValue: "value of key 1"},
        {stringValue: "value of key 2 and 3"},
      ],
    }
  ]
});

I tried to annotate the code as much as possible. Can you figure out what shape this draws?

Now, to convert this to binary:

const buffer: Buffer = vector_tile.Tile.encode(tile).finish();

And that’s it!

Conclusion

I was initially pretty scared about creating vector tiles from scratch — I’ve found them pretty hard to work with in the past, leaning on tools like vt2geojson to first convert them to GeoJSON. I was pleasantly surprised to find out that it wasn’t as hard as I thought it was going to be. I still got stuck on a few parts — it took me a few trial and error runs to figure out that my absolute-to-relative math was off — but once everything was working, it was definitely worth it.

Let me know what you think.

Categories
GIS

Serving real-time, tiled, point data directly from DynamoDB part 2 – the plan

I previously wrote about something I wanted to do with DynamoDB and geospatial data, and got a lot of responses on Twitter.

In the end, I think I’m going to go with a hybrid approach — using Uber’s H3 geocoding algorithm to generate clusters and to take care of indexing points, and then generating vector tiles as a separate process based on that data.

Here’s a bird’s-eye view of how the data will flow.

DynamoDB
trigger
DynamoDB…
New Point
H3 Index resolution 9
PK = PointData#89283470d93ffff
SK = {ULID}
New Point…
when n=0
when n=0
Refresh H3 Aggregation
H3 Index resolution n-1
PK = HexAgg#88283472b3fffff
SK = {ULID}
Refresh H3 Aggregation…
until n=0
until n=0
Generate Vector Tile
Tile Zoom n-1
PK = GenTile#z/x/y
SK = {ULID}
Generate Vector Tile…
until n=0
until n=0
Viewer does not support full SVG 1.1

And here’s the plan:

  1. A point is created/updated/deleted. This point has a H3 index resolution of 9, and looks something like this. I probably would be able to get away with resolution 8, but I’m going with 9 for now because it’s a good resolution for clusters when zoomed out.A screenshot of a highlighted H3 resolution 9 hexagon encompassing a portion of Tokyo Station(map data is © OpenStreetMap contributors, the hexagon grid viewer is clupasq/h3-viewer)
  2. This event is passed off to a function that calculates the parent hexagon of the point at resolution n-1 (for the first iteration, this would be 8). Then, all points within this hexagon are queried and aggregated, and written to an aggregation corresponding to resolution n-1. This function loops until we get to resolution 0.
  3. After the aggregation hexagons have finished calculating, we will do a similar process for generating the actual vector tiles. Some kind of mapping will have to be made between zoom levels and H3 resolutions, but this is going to be highly subjective and I plan to figure it out on the fly.

And that’s about it — all I have to do now is actually implement it. I have a few worries about what the end result is going to turn out: higher level aggregations not being perfect due to hexagons not cleanly subdividing, or positions of clusters not lining up with the data (for example, if the data in a hexagon is heavily weighted to one side, showing a corresponding cluster at the center of the hexagon would not be appropriate)… These problems will become apparent when I start the implementation, and I already have seeds of ideas that I might be able to use to mitigate them.

Anyway, that’s it for today. Thank you to everyone who responded to my tweets! If you have any comments, please don’t hesitate to comment.

Categories
English GIS

Serving real-time, tiled, point data directly from DynamoDB

Recently, I’ve been interested in how to serve and work with geographic data with the least amount of “work” possible. Everything here can be done pretty easily with PostGIS and a large enough server, but I’m always up for the challenge of doing things in a slightly different way.

I’m working on an app that will store lots of points around the world — points that people have submitted. Users will open this app, and be presented with a map with points that have been submitted in their vicinity. Once they zoom out, I want to show clusters of points to the users, with a number showing how many points are represented in that cluster.

There are two major ways to do this:

  • Get the bounding box (the coordinates of what the user’s map is currently displaying), send it to the server, and the server will return a GeoJSON object containing all the points in that bounding box.
  • Split the data in to predefined tiles and the client will request the tiles that are required for the current map.

There are pros and cons for both of these methods:

  • GeoJSON would contain the raw data for all points within the bounding box. When the user is zoomed in, this is not a big problem, but when they’re zoomed out, the client will have to do a lot of work calculating clusters. Also, as the user pans around, new queries would have to be made to piece together the point data for areas they don’t have the data for. Not only can this data get pretty big, the user experience is subpar — data is normally loaded after the map pan finishes.
  • Tiles at the base zoom level (where all data is shown, without any clustering) can be generated on-the-fly easily, but tiles at lower zoom levels need to use all of the information in the tile bounds to generate clusters.

This basically boils down to two solutions: let the client process the data, or let the server process the data. Serving the whole GeoJSON document may work up to a point (I usually use something like 1-2MB as a general limit), datasets larger than that need to be tiled. I use Mapbox’s tippecanoe for this, but tippecanoe takes a GeoJSON file and outputs a static mbtiles file with precomputed tiles. Great for static data, but updating this means regenerating all the tiles in the set.

This is where the title of this post comes in. I’ve been thinking about how to serve multiple large tilesets directly from DynamoDB, and I think I have a pretty good idea of what might work. If there’s something like this that already exists, or you find something wrong with this implementation, please let me know!

Here is the basic data structure:

PKSK
Raw DataTileset#{tileset ID}#Data#{first 5 characters of geohash}{full geohash}#{item ID}
Pregenerated TileTileset#{tileset ID}#Tile#{z}/{x}/{y}{generated timestamp}

This is how it’s supposed to work: raw point data is stored conforming to the “Raw Data” row. Updates to raw data are processed by DynamoDB streams, and a message to update the tiles that point exists in is enqueued.

Because this potentially can get very busy (imagine a case where hundreds or thousands of points in the same tile are updated within seconds of each other), the queue would have a delay associated with it . The processor that generates tiles based on messages in the queue would take a look at the tile it’s about to generate, and compare the generated timestamp with the timestamp of the item update. If the tile timestamp is newer then the item, that item can be safely disregarded, because another process has already picked it up.

I initially thought about using tile numbers to index the raw data as well, but decided to use geohash instead. The reasoning behind this is because geohash is a Z-order grid, we can use this property to optimize DynamoDB queries spanning multiple cells. Another problem with using tile numbers is the ambiguity of points that lie on tile boundaries. In contrast, the precision of a geohash can be arbitrarily increased without affecting queries.

Is there a way to encode tile numbers in to a Z-order curve? Maybe interpolating the X and Y values bitwise? How can this account for the zoom parameter? Am I overthinking this? Can I get away with using a sufficiently high zoom parameter (like z=22 or 23)? (How about a crazier idea: can the geohash algorithm be massaged in to corresponding directly to tiles? Is it worth it?)

Anyways, this is sort of something that I’ve been thinking about for a while, and I want to get started on a proof-of-concept implementation in the near future. Let me know on Twitter if you have any input / comments, or even better, if you want to collaborate with me on making something like this.