Geocoding of Laboratory Data
This is a general outline of what I use and how I have geocoded laboratory results. I have a paper in press in the Annals of Clinical Biochemistry on the geocoding of vitamin D
The main computer I use is a Mac Pro with 2 x 2.8 Quad core Intel Xeon processors. Inside its expanded a little for speed and has a 256 GB solid state drive as a boot drive and then 2 x 1 TB drives in a Striped Raid setting. This isn’t strictly necessary but I’ve found that having a decent amount of computing power is recommended when working with big data sets and getting decent performance on SQL queries. I’m sure windows computers can cope but why not start a movement for more flexible computing in the NHS and ask for a Mac!
(ps I was hoping there would be a new Mac Pro in the WWDC 2012 announcements but no joy).
I also use an iMac so you don’t have to go for the “truck”.
1st Get Your Data
The 1st step in the process is to get the data from the LIMS system. Here at Heartlands we use a telepath system (original flavour!) we have two methods for extracting data from the system.
List generation routines
A full SQL map
The key point with this step is that your lims system should have a link to the HISS/PAS as this allows you to get some more details about the patient from the linked database. They key piece of information we require is the patient postcode.
We are aiming for data in a simple .csv (comma separated value) format along the lines of:
Result,Test_Code,Date_Time_Received,Specimen_Number,Postcode,Registration_Number <- The field names NA,NVITD,2011–07–07 09:17:00,“C,11.9658974.K”,B23 7EB,2 <- The actual data
This csv data is then imported into a MySQL database
http://www.mysql.com MySQl is the “worlds most popular open source database” and by being open source its freely available for any system (windows, linux and Mac). The community server version is free http://www.mysql.com/downloads/mysql/ and easy to install.
You will also need a MySQL graphical interface if you aren’t keen on using the command line. There is one freely available from MySQL itself http://www.mysql.com/downloads/workbench/ or there will be a number of 3rd party ones also available. I use Razor SQL http://www.razorsql.com largely because it also connects to other SQL databases including Cache which is the SQL map into our telepath system.
It should be noted that on occasion it is necessary to resort to the command line for some queries (especially the ones that take a long time or return big datasets as they can cause out of memory errors for the graphical systems)
In most systems importing csv data into the MySQL database is simple as there are wizards which help the process. Heres a short video of my doing it with some sample data.
Part 2 starting the Geocoding Process
Now that we have the data in the data base we need to start the geocoding process.
Georeferencing can be achieved in a number of ways, the most common of which are post-code referencing and area referencing.
An area reference code is a code that identifies an observation with a defined geographical area(8). Super Output Areas (SOA) are a geography for the collection and publication of small area statistics. The SOA layers form a hierarchy based on aggregations of Output Areas (OAs). They are better for statistical comparison as they are of much more consistent size and each layer has a specified minimum population to avoid the risk of data disclosure. SOAs are not be subject to frequent boundary change and therefore allow statistical comparison over time. In addition, they build on the existing availability of data for OAs. Currently there are:
- 175,434 OAs in England and Wales (165,665 in England; 9,769 in Wales).
- 34,378 Lower Layer SOAs (32,482 in England; 1,896 in Wales).
- 7,193 Middle Layer SOAs (6,780 in England; 413 in Wales).
SOAs are intended as a purely statistical geography. They are based on the boundaries of the post-code based Output Area geography from which they are constructed. OAs are based on post-codes as at Census Day (29 April 2001), and are constrained by the 2003 statistical ward & parish boundaries(9) Post-codes
The post-code is a widely used and understood geo-reference and has a number of advantages because of this. It is freely available and is critical part of any patient record. There are a number of disadvantages to post-codes in that post-codes were designed to allow the efficient delivery of mail, a post-code can straddle two or more other geographies and can change or be removed. In addition, their use could have potential privacy implications as they are related to individual or groups of houses and locations.
What we need to do is convert the Postcode data we have to LSOA or similar area referencing.
Fortunately the NHS connecting for health project has a database which can convert Postcodes to this standard geography. This data is available from http://nww.connectingforhealth.nhs.uk/ods/downloads/officenatstats (Note this is an nww link and therefore you wont be able to connect to this site unless you’re connected to the N3 network i.e usually a computer connected to the NHS network). On the day I looked at this site, the file you need is:
gridall.zip, Complete Gridlink NHS Postcode File, weighing in at 39Mb
This data is supplied as a CSV file and again you will need to download and add to your MySQL database as previously. When you’ve done this, you should have something along the lines of:
Table gridall ============= UnitPostCode, UnitPostCodeVar, DateOfIntro, DateOfTerm, Easting, Northing, County, LAD, EW, PostCodeUserType, GridRefPosInd, Country, SHA, Region_GOR, HA, POBOX, PCT, 1991CensusED, 1991CensusED1, EDPosInd, 1998Ward, OA, OAI, PanSHA, LSOA, SDZ, MSOA, URI, URIS, PSHA, OLDPCT, OLDIT, Constit, CancerReg, CancerNetwork ------------- UnitPostCode varchar(255) UnitPostCodeVar varchar(255) DateOfIntro int(11) DateOfTerm int(11) Easting int(11) Northing int(11) County varchar(255) LAD varchar(255) EW varchar(255) PostCodeUserType int(11) GridRefPosInd int(11) Country varchar(255) SHA varchar(255) Region_GOR varchar(255) HA varchar(255) POBOX varchar(255) PCT int(11) 1991CensusED varchar(255) 1991CensusED1 varchar(255) EDPosInd int(11) 1998Ward varchar(255) OA varchar(255) OAI int(11) PanSHA varchar(255) LSOA varchar(255) SDZ varchar(255) MSOA varchar(255) URI int(11) URIS int(11) PSHA varchar(255) OLDPCT varchar(255) OLDIT varchar(255) Constit varchar(255) CancerReg varchar(255) CancerNetwork varchar(255)
The meaning of these fields is explained in the accompanying documentation but, the two we are interested in are:
- UnitPostCodeVar and LSOA
I’ve found I get better matches if I use the UnitPostCodeVar field rather than the UnitPostCode field. I have no idea why but its worth trying both to see which matches most.
The other tip I would give here if you are a database expert (or know one) is to get them to create some indexes for you. These indexes speed up the queries significantly. I index both postcode fields and LSOA
Linking to two datasets together.
To link the 2 data sets together (Our lab data and the Gridall data) we have to create a join query in MySQL. My preference is to get the query to create a new table at the same time.
An example query would be:
CREATE TABLE matched AS (SELECT
tvitdn_gh_plus_heft_tvitdn_results_postcodes_only.Req_Postcode = nhs_postcode_lookup.UnitPostCodeVar
We will now have a database table of all records that have been able to be matched to LSOA code and we have georeferenced our data.
We can now link this data to other georeferenced datasets e.g. Indices of Multiple Deprivation and start to do analysis based on this parameter rather than postcodes.
The next step in producing something meaningful is to create maps of the data.
This is again based on the use of the LSOA data. The data can be obtained from the newly revamped data.gov.uk site at: http://data.gov.uk/dataset/lower_layer_super_output_area_lsoa_boundaries
(previously you had to send off for a CDROM of the data)
The shape file is:
“The Esri shapefile or simply a shapefile is a popular geospatial vector data format for geographic information systems software. It is developed and regulated by Esri as a (mostly) open specification for data interoperability among Esri and other software products. Shapefiles spatially describe geometries: points, polylines, and polygons. These, for example, could represent water wells, rivers, and lakes, respectively. Each item may also have attributes that describe the items, such as the name or temperature”.
In effect these file can draw lines on your map around areas.
To be honest, this is where the process gets a little complicated!
You can use a database system such as PostgreSQL (http://www.postgresql.org) which “understands” geography as its possible to load gospatial extensions. This is extremely powerful and allows you to do a number of calculations like how many x in a particular geographical area.
I’ve had better success using a utility called ogr2ogr which is part of the GDAL set of libraries (http://www.gdal.org/index.html)
I have a table created from the geography CD in MyMSQL database of the following format:
ID int(11) NOT NULL,
LSOA04CD varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
LSOA04NM varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
ogc_geom geometry DEFAULT NULL, PRIMARY KEY (
LSOA04CD) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
This allows linkage of the data with a subset of the geography data using an SQL query something similar to this with LSOA code and the link code.
2010_analysis_set_LOSA.LSOA FROM geocoding.
2010_analysis_set_LOSA, geocoding.lsoa_geometry lsoa_geometry WHERE
2010_analysis_set_LOSA.LSOA = lsoa_geometry.LSOA04CD
By doing this, this will speed up the map drawing process as you will only be drawing the LSOA areas where you have data.
We now have data, LSOA and geography parameters ready to plot onto a graph.
I have found it quicker to plot shapefile data rather than pulling data “live” from the database. The latter can be done though.
To produce a shape file you can use ogr2ogr
ogr2ogr outputfilename MYSQL:databasename,user=dbusername,password=password tablename
This will produce a shapefile with the geometry data and your laboratory data embedded in it.
There are a number of map drawing packages on the market, both commercial and open source. I use the open source QGIS software http://www.qgis.org which is available for all platforms.
Once installed its also easy to use and extensible with a number of 3rd party plugins which allow further functionality.
This short video outlines the processes involved:
Key points are: 1. Make sure your coordinate reference system is set to do conversions on the fly 2. Make sure you use British National Grid as the CRS for the shapefile layer 3. Make sure you use the appropriate CRS for the mapping provider you use. 4. Use the Openlayers overview and openlayers plugins to get the physical maps whereby you can overlay your shape file data.
You should end up with something like this:
- Item Tag: Geocoding, R&D, Vitamin D