How to determine what constitutes a greater urban area of a city using open street map open data


Buenos Aires City is the capital and largest city of Argentina with almost 3 million inhabitants [1]. This means less than 40% of New York City population, with almost 25% of the land area of NYC [2]. This means Buenos Aires has higher density (almost 130% of the NYC density). The Greater Buenos Aires conurbation constitutes the fourth-most populous metropolitan area in the Americas, with a population of around 13 million (almost 65% of the Greater New York Population).

The national statistics institute in Argentina (INDEC) defined in a study [3] the Buenos Aires Urgan Aglomeration (AGBA) as ‘the geographical area bounded by the “population envelope”; What is also often called “urban spot”. “Population envelope” is understood as a line that marks the limit to where the continuity of urban dwellings extends. This line moves with time and, by the way, does not respect the administrative delimitations’.

The problem

This urban area is an always changing entity. It keeps changing its borders, as well as space within also changes in functionality. In peripheral areas, an area that used to be a mixture of urban and rural space may have grown into a fully developed urban setting. An old industrial installation (factories, ports, railroads, etc.) may have changed to a residential space. This changes may not be reflected in census data until the next census comes along (this means 10 years for most countries). And we worry about census data because it still is the main source of information about cities, specially when we try to get information for small geographic units. This units are called radios in Argentine census (in the US the smallest unit is the block).

Eventually, need to see how this changes express in the census radios. Luckily, for INDEC establishes which radios are urban. But some of them are “mixed”. So, me need to figure out which portion of those radios are actually urban. Without entering in the whole debate about whether urban it’s just a function of population density, we assume them to be related so we will treat an area with several dwellings as urban. This is true for the radios in the outer edges of the AGBA, but also for some of them within. For example, a radio that contains a huge park. Only a small portion of it may contain dwelling and be actually populated. This is the problem we need to solve: which radios and portion of radios we need to consider highly dense and urban.

The data

We begin using an unofficial radios shape file. This contains the type of the radio ( urban, rural, mix). We realize that using this unofficial shapefile hurts the reproducibility of this work, but in the INDEC website there is no information about the radio’s type. Provincial authorities have a dataset with this attribute, but it has some problems. We could build our own shapefile using this and also the Buenos Aires city shape file because we can consider all of its radios to be urban (although not all of them are highly dense and populated).

But how do we know what parts are densely populated? We can use OSM data via MapZen. OSM records most of the streets in an urban area and has criteria for residential streets. We assume that any area with a series of overlapping residential streets, we can consider it an urban area. So, any radio with a rural-urban mixed type or that includes bast unpopulated areas, can be re-framed to fit this area where we found the residential streets. We didn’t use footway as it was a source of errors: we ended up with sports and clubs as residential areas.

OSM typology of residential streets

Data analysis

We first select the radios belonging to those 30 districts that the INDEC defines as being part of the AGBA. Then we focus on those that are a mix between urban or rural. But also we look for those urban radios with low density (1 quantile). We well work with those radios that either are mix in type or they have low density.

agba = gpd.GeoDataFrame.from_file(home + ‘/Downloads/paisNoOficial/rad_paisp.shp’)#retain only those radios the belong to a 'departamento' in AGBA
agba = agba.loc[agba.DEPARTAMENTO.isin(departamentos.values()),:]
#calculate density based on population
agba['densidad'] = agba.Pob / agba.AREA
agba['densidad'] = pd.qcut(agba.densidad,10,labels=False)
agba['mixYden'] = (agba.TIPO == 'M') | (agba.densidad == 0)
agbaAltaDens = agba.loc[~agba.mixYden,:]
agbaBajaDens = agba.loc[agba.mixYden,:]

Now we need to take a look at the street data from OSM. After downloading the data we remove every street segment larger than 2km (for this we need to use a projection in meters) and create a buffer around that street in order to be able to perform the overlay operation later.

calles = gpd.read_file(home + ‘/Downloads/buenos-aires_argentina.osm2pgsql-shapefiles/buenos-aires_argentina_osm_line.shp’)
calles = calles.loc[:,[‘highway’,’geometry’]]
mask = (calles.highway == 'residential') | (calles.highway == 'living_street') | (calles.highway == 'footway') | (calles.highway == 'pedestrian')
calles = calles.loc[mask,:]
#change projection in meters for POSGAR94 Argentina 5
calles = calles.to_crs(epsg=22185)calles['largo'] = x: x.length)calles = calles.loc[calles.largo <= 2000,:]
calles['buffer'] = calles.geometry.buffer(100)
calles.columns = [u'highway', u'geometry']

Here we have a OSM residential street going through 3 radios.

Using the overlay method in GeoPandas we can divide each segment into the portion that belongs to each radio.

intersec = gpd.overlay(radiosSel, calle, how=’intersection’)
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(1,1,1)

We do this with another street, we append the resulting table to the previous one, and dissolve by the feature that identifies the radios (LINK).

intersec2 = gpd.overlay(radiosSel, calle2, how=’intersection’)
appenDF = intersec2.append(intersec)
appenDFdiss = appenDF.dissolve(by=’LINK’)

We then repeat this process for all of the lines. But previously we do a spatial join because we don’t want to intersect every street with every radio in the AGBA, but only with those radios that the street actually intersects.

primerID = calles.calleID.unique()[0]
counter = 0
for i in calles.calleID.unique():
#for each street
subset = calles.loc[calles.calleID==i,:]
unaLinea = subset.iloc[0:1,:]
conjuntoDeRadios = agbaBajaDens.loc[agbaBajaDens.LINK.isin(subset.LINK),:]
if i == primerID:
intercept = gpd.overlay(conjuntoDeRadios, unaLinea, how=’intersection’)
tmp = gpd.overlay(conjuntoDeRadios, unaLinea, how=’intersection’)
intercept = intercept.append(tmp)
counter += 1

Finally we use dissolve on the streets:

primerLINKid = intercept.LINK.unique()[0]
for i in intercept.LINK.unique():
if i == primerLINKid:
dissolve = intercept.loc[intercept.LINK==i,:].dissolve(by=’LINK’).reset_index()
tmp = intercept.loc[intercept.LINK==i,:].dissolve(by=’LINK’).reset_index()
#tomo los intercepts dentro de un radio por su LINK
tmp = intercept.loc[intercept.LINK==i,:]
#creo el primer elemento de la cascade_union
geoConError = tmp.geometry.iloc[0]
for i in range(1,tmp.shape[0]):
geoConError = shapely.ops.cascaded_union([geoConError,tmp.geometry.iloc[i]])
tmp = tmp.iloc[0:1,:].reset_index().drop([‘index’],axis=1)
tmp[‘geometry’][0] = geoConError

dissolve = dissolve.append(tmp)

A similar algorithm was used in a study [4] done by the National Secretary Of Energy, although it produced different results.

The results

Where it worked

Within the urban area, we were successful in removing some areas that contain no dwellings and therefore there is no population there. The typical situations are old industrial infrastructures (railroads, ports, factories) or recreational places (parks, sports clubs).

Factory and sports club area excluded from original radios
Railroad, stadiums and shopping malls excluded from original radios

On the borders of the urban area, we were successful in reshaping the radios to the actual area were dwellings and population are located. Small residential areas in a mixture of rural and urban settings, as well as gated communities and country clubs were clearly identified. There is a trade-off with the buffer parameter: to wide and incorporates areas without households because are close to a residential street (some areas of the stadiums are considered residential). But to narrow and it will produce wholes between streets for those blocks slightly bigger blocks.

Country clubs on the borders of the urban area

Where it didn’t

As we can see in the previous image, OSM is sensitive enough to exclude the area still undeveloped of that gated community on the west portion of the map. Maybe that is way no user uploaded street information to OSM. But in the same picture, we see some areas that should be included as populated areas, but we missed them (a bushy area in the center of the map).

The basic problem of OMS is that street data is crowdsourced and therefore location, shape, and type are given by the user. So, we can have streets missing from the OSM data and we do not detect a populated area where it there is one. Also, a street can be wrongfully assigned to a residential type where there are no actual dwellings. In this case, some area will be classified as densely populated when actually there are no households in that area (may be is a commercial or business area).

Airports and parks excluded, but with a erroneous type for the street in the middle of the park

In this picture we see that the parks and airport are correctly excluded. But a street in the middle of the park was added wrongfully as ‘residential’ for some user.

Elongated shapes

Also, one final disadvantage is that this method, while it could distinguis the populated portion in the mix radios, it also produces some elongated shapes that are not desirable.

The full set of results and the Jupyter Notebook that produced this can be found in:


[1] Censo 2010 INDEC

[2] United States Census Bureau 2010

[3] INDEC. (2003). ¿Qué es el gran buenos aires? (Tech. Rep.). Buenos Aires,

[4] Pino, F., Bellini, V., Stryjek, L., & Moroni, M. (2015). El uso de osm como herramienta para la identificación de núcleos poblados (Tech. Rep.). Buenos
Aires, Argentina: Secretarı́a de Energı́a de la Nación — Ministerio de Enerı́a
y Minerı́a de la Nación.




Urban Data Scientist Twitter: @lephcero. Github: /alephcero

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Clustered Randomised Controlled Trials

Research Dive Review: Statistics for the SDGs

Building High Performing Data Science Teams

Year-Over-Year Vacancy Rate Increases in the City of Boston vs. Suburbs

Gencove’s new pig haplotype reference panel

Simple Machine Learning Techniques To Improve Your Marketing Strategy: Demystifying Uplift Models

Data Science vs Data Engineering vs Machine Learning Engineering

Exploring Systems Innovation in Cities

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe González (@lephcero)

Felipe González (@lephcero)

Urban Data Scientist Twitter: @lephcero. Github: /alephcero

More from Medium

Swan Lake’s Back Nine

Retirement After One Month — Retired Musings

Using Outdoor Survival As A Rite Of Passage

a man backpacking up a rock-strewn hill

So, What Exactly is Volcanic Wine?

smoldering volcano