July 7th 2023

Lak LakshmananFollowTowards Data Science--ListenShareIn some countries, Postcodes are points or routes and not areas. For example, the last three digits of a Canadian postcode correspond to the local delivery unit which may correspond to houses on one side of a street or rural route. Similarly, UK postcodes have a postcode of the form “YO8 9UR”. This could be as small as a single building in London. In a 5+4 US zipcode, the last four numbers determine a postal delivery route (so, a set of addresses) and not an area. Contrary to common belief, US 5-digit zipcodes are not areas either — they are simply a collection of the 5+4 postal routes, typically served from a single post office.France, as befits the originator of the metric system, is very logical. In France, postcodes correspond to an area — the last two digits correspond to the arrondissement, thus 75008 corresponds to the 8th arrondissement of Paris and is truly an area. Mail delivery routes are probably suboptimal, though.Because people and stores have addresses, which have associated postcodes, most consumer data is reported at a postcode level. In order to carry out computations such as areal coverage, market share, etc. it is necessary to determine the areal extent of a postcode. This is easy in France, but will be difficult in any country where postcodes are postal routes and not areas.Because their postcodes are mail delivery addresses, there are infinitely many polygons that can be drawn to partition the UK/Canada/US into valid postcode “regions”. This is why UK demographic data is published by their Office of National Statistics (ONS) on administrative areas (such as counties), not postcodes. The US census publishes data at a “Zip code tabulation area” (ZCTA) level, and US voting data is published at the county level. When working with UK/Canada/US data, you’ll often have a similar mixture of addresses (which are points) and spatial data collected over an area. How do you associate these together?To illustrate, I’ll tie together UK postcode data and census data in this article.If you are in a hurry, you can download the results of my analysis from https://github.com/lakshmanok/lakblogs/tree/main/uk_postcode — there are a couple of CSV files there, and they contain the data you may need.ukpopulation.csv.gz has the following columns:ukpostcodes.csv.gz has one extra column — the polygon for each postcode in WKT format:Please note that use of the data or code is at your own risk — it is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.In this article, I’ll step through how I created the dataset in that GitHub repo. You can follow along with me using the notebook uk_postcodes.ipynb.We start from three sources of raw data released under the UK Open Government License:My notebook downloads the data files using wget:Reading CSV directly into Pandas is straightforward:This gives me the centroid lat-lon of every postcode:There are many packages that allow you to read Excel files into Pandas, but I decided to use DuckDB because I’ll be using it later in the notebook to join the three datasets using SQL:This Excel file has 7 rows of header info that I can drop. I also rename the columns to meaningful variables:That was the sheet named P01. Note that the P04 sheet has a population density information, but it is not useful because population is not distributed evenly over the area code. We’ll derive the population of each postcode.I write this out to a CSV file so that I can easily read it from DuckDB.Similarly, I extract the necessary columns from the UK statistics office file and write it to a CSV file:Now, we can use DuckDB to join the three prepared datasets to get the population density at every postcode. Why DuckDB? Even though I can do the join in Pandas, I find SQL to be much more readable. Besides, this gave me an excuse to use the new hot thing.I join the datasets by first reading them into DuckB using read_csv_auto. Then, I look up the ward, parish, county that the postcode is in and find the area (parish, ward, or county) that the population density data is reported at:Note that the spatial quantities are scalars that correspond to the whole area and not the postcode. They have to be split among the postcodes.The all_persons, females, males above correspond to the whole area, not to the specific postcode. We could do it proportionally based on area of the postcode, but there are infinitely many polygons that can fit the postcodes, and as we will see later, the areal extent of postcodes near parks and lakes are a bit iffy. So we’ll do something simple that gives us a single, unique answer — we’ll split the scalar value evenly among all the postcodes in the area! This is not as strange as it sounds — in higher density neighborhoods, there are more postcodes, so dividing equally among the postcodes is akin to distributing the scalar quantity proportional to the population density.At this point, we have the quantity for each postcode — this is the association that we needed:So, write it out:For many analyses, we’ll want the postcodes to be not points but areas. Even though there are infinitely many polygons that we can use to divide the UK such that there is only postcode centroid in each polygon, there does exist a notion of the “best” polygon. That is the Voronoi partition, which splits the area such that any point belongs to the postcode closest to it:To compute this, we can use scipy:I’m assuming here that the area is small enough that there isn’t much of a difference between the geodesic distance and Euclidean distance computed from the latitude and longitude. UK postcodes are small enough that this is the case.The result is organized such that, for every point, there is a region consisting of a set of vertices. We can create a WKT polygon string for each point using:Here’s an example result:We can create a GeoDataFrame and plot a subset of postcodes:Here’s Birmingham:Note the horn at the top and the large area of blue in the middle. What’s going on? Let’s look at Birmingham in Google Maps:Notice the park areas? The Royal Mail doesn’t have to deliver to anyone there. So there are no postcodes there. Therefore, the nearby postcodes get “extended” into those areas. This will cause problems in spatial calculations as those postcodes will appear to be much larger than they are.To fix this, I’ll take a rather heuristic approach. I’ll grid the UK into 0.01x0.01 (approximately 1 sq km) resolution grid cells and find grid cells that have no postcodes in them:We’ll create fake postcodes in the center of such unpopulated grid cells, and assign a zero population density to those. Add these fake postcodes to the actual postcodes, and repeat the Voronoi analysis:Now, when we plot Birmingham, we get something much nicer:It is this dataframe that I saved as the second csv file:We can load the CSV file into BigQuery and do some spatial analysis with it, but it is better to have BigQuery parse the last string column as a geometry first and have the data clustered by postcode:Now, we can easily query it. For example, we can use ST_AREA for the postcodes:Spatial analysis often requires areal extent, not just point locations. In countries where postcodes are points/routes, there are infinitely many ways to generate a polygonal spatial extent for the postcodes. A reasonable approach is to use Voronoi regions to create polygons that contain those postcodes. However, if you do so, you will get unnaturally large polygons near lakes or parks where the post office does not deliver mail. To fix this, also grid the country and create artificial postcodes at unpopulated grid cells. In this article, I demonstrated how to do this for the UK. The associated notebook can be adapted to other places.----Towards Data Sciencearticles are personal observations and not investment advice.Lak LakshmananinTowards Data Science--9Khuyen TraninTowards Data Science--22Miriam SantosinTowards Data Science--13Lak LakshmananinTowards Data Science--8Case RobertsoninPython in Plain English--AbdishakurinSpatial Data Science--1The Useful TechinMac O’Clock--18Fabio MatricardiinArtificial Corner--1Konstantin GizdarskiinLyft Engineering--4Dominik PolzerinTowards Data Science--13HelpStatusWritersBlogCareersPrivacyTermsAboutText to speechTeams

Gamers' Guide to Grand Strategy Maste…
Understanding the Disadvantages of So…
How to Change the Screen Resolution o…
Zoos Orchestrate Breeding Program To …
Energiewende zu Hause wird RealitÃ¤t:…

This post first appeared on VedVyas Articles, please read the originial post: here