Aeropostaux - The World's Airmail Networks, A Map.
I recently sent a package overseas to a friend and was sort of weary of the whole thing. USPS says they offer international service but interestingly own no planes, ships, or teleporting machines. So after mailing a package that in my mind was shipping into a black hole or a USPS trash container, I decided to do some investigation to see how this entire international mail machine works.
Airmail has definitely lost some of its magic in the internet age, but it is an incredible ad-hoc system running in nearly every country of the globe, entirely disconnected from diplomatic relations, and is known for being somewhat reliable. The only other diplomatic agreement so widely embraced and available to the general public was passport standardization.
When I was trying to figure out the mechanics of Airmail, I ran across a forum post that mentioned a publication by the Universal Postal Union or UPU for short. The UPU is the third oldest international organization and represents the brave beginnings of globalization in the mid 1800s. The document I found was a publication called CN68. It is a 1,300 page PDF describing details for the airmail of nearly every country in the world. It describes connections and not networks, so it’s not a route map - it’s a a connectivity one - but I still found the data incredibly interesting. So I decided to parse the PDF and try to make a map of it. After all… my friend got her package even though it got lost in Israel.
This project took me several weeks of work so I’d like to share how I got from 1,300 page French PDF to a map you can click on.
To try where a given country sends airmail, choose it in the dropdown below:
How I did it
Step 1: Getting the data out of the PDF.
Here is the (very technical) explanation of how I got the map from the PDF.
1) Figure out how to get data out of PDFs
PDFs are great document formats for many purposes but are pretty terrible for data retrieval via code. I decided to start by looking for command line utilities that could grab the text out of a PDF. I found one called
pdftotext which I ran several times. The useful data in the UPU PDF is actually quite tabular, but not enough to be easy to deal with. I ended up having to find a better version of
pdftotext from the Poppler project.
Once the data was being read properly, I tried many ways to get it to group into tables. Like exporting to HTML and different command line options. The only real option that worked was
-layout. So I ended up with a giant text file that looked like this:
2) Get a list of unique strings representing cities, countries, and couriers.
As you can likely see in the above code sample, the data set is almost 100% French. Not only French, but this document being updated in increments and not wholesale, makes various references to countries that no longer exist, Indian names of cities that changed in the 50s, countries whose perpetual wars prevent any sort of mail delivery and French words for various European things. Don’t forget the misspellings! This document was hand edited and mistakes are floating around all over the place.
While I worked on the table reading code above, I decided to make a dictionary of unique cities and countries in the dataset. Because I didn’t keep revisions of each dictionary I was building, the length of this dictionary varied from several hundred to several thousand words long. Remembering to lowercase and trim spaces off the edges of strings was the best bet at keeping the dictionary small and less repetitive. It makes a pretty big difference when it comes to having to hand-edit the broken data later.
3) Translate all strings in bulk
I don’t know French nor do I care to, so I used my buddy Google Spreadsheets to translate the entire dictionary of cities and countries, and a new data type: garbage reads from the table parsing code above. You simply run
=translate(A:1) in the
B1 cell, and drag the cell value down the page. TADA, mostly English. And entirely free. Thanks Google!
4) Massive hand-cleanup operation
As you can imagine, the table reading code above did make the data a little messy. The next step after translation is to take the dirty columns in the dictionary spreadsheets post-translation and make them clean. They need to be neatly formatted city or country name with no accessory data. There was a lot of accessory data attached to cities so it just gets removed and normalized. Example below:
5) Put all strings back into app, translating messed up ones to clean ones
Now that the bulk of the data is translated, we want to bring the translations into the app. So I added some code that compared the city or country name being read from the PDF text data to the column in the spreadsheet. I then replaced it with the clean value. It’s quite important to note that this process needed to be repeated several times.
6) Get city-country pairs
With the translations and hand normalization manual labor complete, I then modified the PDF text parsing code to output a CSV with city-country pairs. This would allow me to use a Google Spreadsheets geocoder to establish mappable locations for each countries mailing destinations.
This process unfortunately uncovered another data issue. Sometimes the cities and countries were not adding up together properly. It was frustrating, I thought my parser was flawed. Then I realized, the incorrect city-country pairs were the data’s weird notation of saying that the mail traveled to the city via the “wrong” country. This represents a method of smaller countries being able to access Airmail without needing as much infrastructure.
So I now have a new spreadsheet of city-country pairs, and I installed a Spreadsheets Geocoder plugin that made easy work of Geocoding the data that was ready to go. Of course, this process also revealed several hundred more strings in need of repair, so step 4 needs to be repeated over & over & over again. I spent about 5-7 nights just dealing with this mundane data normalization.
7) Massive cleanup operation number 2 - make sure city-country pairs make sense.
Since I mentioned above that some city-country combinations seemed wonky, I alphabetized the spreadsheet I was able to hand-assign the proper coordinates to the wonky geo city-country pairs. I then created a Via column and made sure the data was eventually present for the Vias data to be mapped more interestingly in the future.
8) Use the city-country pairs inside the parser code
I then added the CSV of city-country pairs back to the parser, where it could assign the proper coordinates to each destination for each country in the PDF. Of course, there are still places where the data needs cleanup.
9) Find the locations for the dispatching locations
Mail isn’t dispatched from a single location in most western or larger countries, rather it ships from one of several locations. For dramatic effect, the map does need to show this data - the dispatching location data was a little messy and required more mundane normalization
10) Map JSON
At this point, it’s possible to generate JSON suitable for mapping. I went through and displayed the data in browser. For most countries, the data was good to go. For some others, like Sri Lanka, there are some bugs (still) needing to be resolved. The data file was pretty enormous, so I used a slight ‘dictionary’ storage format in the JSON. There are still places where the code could be cleaned up to run more efficiently.
Step 2: Mapping
Mapping was the “easy” part of this project. The amount of work it required was miniscule. I just used Mapbox GL, Turf.js, Arc.js (good for dramatic effect), and good ol’ jQuery. React is too much hassle for tiny UX widgets like these, since they got rid of the in-browser JSX stuff. Sad Sigh.
- The data for mailing frequency exists for most places, and I want to show this as different lines stemming from the selected country. Someday!
- There are bugs in the data - see Sri Lanka, USA, or Australia for examples. These are hard to fix!
- The map also could do a better job panning around on its own.