Applying Spatial Scan Statistics to London’s 1854 Cholera Epidemic
Introduction
This post explores methods for utilizing SaTScan, a spatial scan statistics program, with Geographic Information Systems (GIS) in the context of data-poor environments. Specifically, we will be working with John Snow’s data collected during London’s 1854 London cholera epidemic, accessing a range of tools available in ArcMap to see if we can identify high risk clusters and calculate their respective relative risks (R.R.). Deploying scan statistics for cluster detection can be very useful in a public health context and this paper provides an opportunity to work through a real-world example. In this paper we implement a range of tools available in ArcMap building upon methods devised by other researchers to assist us in obtaining an estimate for a population at risk in John Snow’s original study area, a crucial figure for running scan statistics. We will be able to compare our own findings to Snow’s and assess our methods accordingly. We expect to find at least one cluster with high relative risk at or near the Broad Street Pump. Before continuing it is important to acknowledge two limitations with this project.
As we proceed, it will become clear that the scope of this project requires us to make educated estimates on several important factors, including the population at risk. Because we are working with estimates, the figures we use will be imprecise. Second, the dataset we will be drawing from was originally published in 1855 and it may itself be imprecise. John Snow, speaking to this point, cautioned that his data was incomplete, for instance not capturing the deaths of residents who contracted cholera and later fled London, dying in the countryside (Snow, 1855, p. 45). Snow (1855) argues that his data, however imprecise it may be, still conveys an important story and helps to identify the Broad Street Pump as the point source of the 1854 cholera epidemic (p. 45-46). As we explore Snow’s data and apply cluster detection methods we should keep this in mind. Again, the primary goal of this paper is to explore ways one can incorporate the GIS methods described below to become more familiar with the SaTScan program and consider ways one might utilize it if key population data is unavailable.
The following section provides an in-depth discussion on the methods used and gives a rationale for why these were chosen. This section is followed up by a discussion on what was observed and references two maps, and an accompanying table containing key SaTScan output. We will look at Snow’s original work and ask if our observed results were expected. We then conclude with a brief discussion of our findings.
Methods
Our first task is getting the data in the format needed. We will utilize data provided by Wilson (2013), a researcher from Southampton University who has georeferenced all of John Snow’s 1854 outbreak data, including death and water pump locations, available to the public for download at The Guardian’s DataBlog (see Rogers, 2013, March 15). When we import this data into ArcMap we observe a georeferenced raster image of Snow’s map set to the OSGB 1936 British National Grid projection coordinate system. We also have point files for the 489 deaths and 8 water pumps recorded in Snow’s (1855) study area, including the Broad Street Pump.
Snow (1855) discusses his rational for choosing his study area, observing “The most terrible outbreak of cholera which ever occurred in this kingdom, is probably that which took place in Broad Street, Golden Square, and the adjoining streets, a few weeks ago. Within two hundred and fifty yards of the spot where Cambridge Street joins Broad Street, there were upwards of five hundred fatal attacks of cholera in ten days. The mortality in this limited area probably equals any that was ever caused in this country” (p. 38). He limits his investigation of the cholera outbreak to this area because it appeared to produce alarmingly high mortality rates. He collected data for the Soho neighborhood of London from the period between August 19th and September 30th, 1854 (Snow, 1855, p. 46). This is the area we see on the map georeferenced by Wilson (2013) and it will serve as our study area as well. We will refer to it throughout the rest of this paper simply as Soho. In Snow’s 1855 map we see deaths indicated by black bars, overlaid by points added by Wilson that provide a location and a count. This is where we find that there were a total of 489 deaths from cholera. It should also be mentioned that households reporting no cholera deaths are not identified on Snow’s original map. We will use this data to run scan statistics but we need to obtain data on the population at risk first.
To do this, we turn to Koch & Denike (2009), who ran into the same problem when they attempted to calculate a mortality ratio, and resolved this issue by overlaying maps that identified all of the houses in Soho, that would have been available to Snow, allowing them to count the number of houses and estimate a rough population for the area (p. 1247). For this paper we will follow a similar process. One of the maps used by Koch & Denike (2009) was produced by Edmund Cooper (p. 1247), which we will use to create a composite with Snow’s 1855 map. This will allow us to estimate a population at risk and finish our analysis. It may be of interest to note that Snow’s cholera map depicting where each cholera-related death had occurred was not the first map published on the 1854 epidemic; it was actually Edmund Cooper, employed by the Metropolitan Commission of Sewers, who is credited with producing the first cholera map for this area during the outbreak and his map was a little more detailed, containing every house in the Soho area of London (Brody, Rip, Vinten-Johansen, Paneth & Rachman, 2000, p. 66). Cooper was investigating claims by some that sewer lines that crossed older plague burial grounds and were transporting noxious air into households, bringing with it cholera, but he obviously found no evidence of such a link (Brody et al., 2000,p. 66). Nevertheless, his map will help us in obtaining an estimate for the population at risk. Brody et al. (2000) remark that it is curious that Cooper and Snow, who both created detailed maps of the outbreak and had similar data, could not agree on the source of the 1854 outbreak (p. 66). If they had scan statistical tools available, perhaps their findings would have led them to a consensus.
We will use scan statistics to confirm Snow’s hypothesis that the Broad Street Pump was the source of the outbreak, and we look to Cooper’s map to help capture some data that is currently missing from Snow’s map. We use the georeferencing tool in ArcMap to georeference Cooper’s map, saving a scanned image (available to the public for download online) as a JPEG and importing it as a raster file into ArcMap. The georeferencing tool allows us to overlay this map on top of the one produced by Snow, and give it spatial coordinates. We are careful to keep the same OSGB 1936 British National Grid projection coordinate system. Following this we see that the street grid on Snow’s map matches the street grid on Cooper’s, and that houses for most of the Soho neighborhood fill in the blocks.
Our next step involves creating a new point feature class. We need to make a new georeference database, calling it “Pop_At_Risk”. Then, using the editor tool, we proceed to draw points over each household. There are some areas on Snow’s map that do not overlap with the Cooper map, and we estimate the number of households for these blocks based on what we have assigned to areas of similar size. Following this, we open up the attribute table of our newly created point features and add fields (short integer) for Population at Risk and Deaths. We will go through the map and assign cases that were provided by Robin (2013) to the area on the map where case points and population points correspond. In deciding on the household size we again look to Koch & Denike (2009), who estimated the mean average household size for this area of London at the time to be about 10 people per household, based on 1851 census data (p. 1247). We can then go into Google Earth street view and verify that most households in this area are limited to 3 or 4 stories and a population of 10 per home seems reasonable.
We are assuming that everyone who resides in Soho is at risk of cholera because area residents likely obtain drinking water from the same local sources and cholera can be transmitted through contaminated water. Again, this is imprecise but will work for our purposes. Another way to do this might be to create two or three different “Population at Risk” fields, with varying estimates and running our analysis for each. For the purposes of this paper we limit our scope of analysis to this estimated population at risk. Our total population at risk for our analysis is 15,435. We now have a master population shapefile that contains the number of cases, the estimated population at risk and their spatial location.
Our final step before running scan statistics is assigning an X and Y Coordinate in decimal degrees to each location in our master population file. To do this we simply add two new fields for X and Y, select calculate geometry, set our properties to the corresponding X and Y coordinates and ensure our units are in decimal degrees. We then set SatScan to conform to our analysis by setting the parameters described below.
We set SatScan to run a purely spatial type of analysis using a discrete poisson probability model, scanning for areas with high or low rates. We set our maximum spatial cluster size to 2.5% of the population at risk using a circular scanning window and run 999 replications. After running SaTScan we find 9 statistically significant clusters, identified on Map 1 as clusters 1-9, all with higher than expected rates of cholera. Residents of households falling within these clusters will be at greater risk of contracting cholera compared to those living outside. When viewing the statistically significant clusters in ArcMap, we observe that the broad street pump, the source of the cholera epidemic, falls almost in the center of Cluster 1.
Discussion
From our map and our table of key SaTScan output data above we can draw several inferences. First, cluster 1, the least likely to occur by chance, centers almost directly over the Broad Street Pump and contains 42 cholera related deaths and a relative risk (RR) of 9.26. Residents of houses within cluster 1 were at a 926% higher risk for contracting cholera, compared to those not residing within the cluster. We observe an even higher relative risk in cluster 2, which reports one fewer cholera-related fatality. We then observe a drop in RR to the still high levels of 4.99 in cluster 3 and 4.84 in cluster 4. Curiously, SaTScan identified clusters 5 and 6 as having very high RR, but with very small populations.
In considering our final output data, it is important to recognize that while clusters 5 and 6 have extraordinarily high R.R. and are statistically significant, their populations are very small (1 house each), and this has the effect of distorting our R.R. figure. Upon reviewing the SaTScan output for cluster 5, we see that only one household, FID 257, is included and appears at the center of the cluster. In examining the data for cluster 6, we observe a similar phenomenon, again with only one household, FID 490, appearing in the center of the cluster. For the purposes of this exercise, we removed these two clusters from the final map to avoid confusion or misinterpretation of our findings.
In our final map, we see that cluster 1 centers over the broad street pump, with several other statistically significant clusters located near this area, confirming Snow’s hypothesis. This paper demonstrates how one can utilize scan statistics to perform cluster detection in data poor environments, as long as one possesses enough information to make informed population estimates. The methods describe above may have similar applications to other data-poor settings, including outbreaks of vector borne illness in refugee/Internally displaced persons camp where resources, especially data, may be limited. It provides a method for determining areas with statistically significant clusters of an outbreak. What was not demonstrated in this paper was SaTScan’s ability to also identify clusters of lower R.R. that might provide some crucial insights into what factors might assist in mitigating an outbreak. This should be an area for further investigation.
Works Cited:
Brody, H., Rip, M. R., Vinten-Johansen, P., Paneth, N., & Rachman, S. (2000). Map-making and myth-making in Broad Street: the London cholera epidemic, 1854. The Lancet, 356(9223), 64-68.
Koch, T., & Denike, K. (2009). Crediting his critics' concerns: Remaking John Snow's map of Broad Street cholera, 1854. Social science & medicine, 69(8), 1246-1251.
Snow, J. (1855). On the mode of communication of cholera. John Churchill.
Data Sources:
Michael (2013, April 13). DataViz history: myth-making and evolution of the ghost map. Retrieved from: http://datavizblog.com/category/henry-mayhew/
Rogers, S. (2013, March 15). John Snow’s data journalism: the cholera map that changed the world. Retrieved from: http://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map