Geocoding of Laboratory Data

Written by Craig Webster on Friday, 29 June 2012. Posted in Biochemistry, General

This is a general outline of what I use and how I have geocoded laboratory results. I have a paper in press in the Annals of Clinical Biochemistry on the geocoding of vitamin D

Workflow

The main computer I use is a Mac Pro with 2 x 2.8 Quad core Intel Xeon processors. Inside its expanded a little for speed and has a 256 GB solid state drive as a boot drive and then 2 x 1 TB drives in a Striped Raid setting. This isn’t strictly necessary but I’ve found that having a decent amount of computing power is recommended when working with big data sets and getting decent performance on SQL queries. I’m sure windows computers can cope but why not start a movement for more flexible computing in the NHS and ask for a Mac!

http://www.apple.com/mac/

(ps I was hoping there would be a new Mac Pro in the WWDC 2012 announcements but no joy).

I also use an iMac so you don’t have to go for the “truck”.

1st Get Your Data

The 1st step in the process is to get the data from the LIMS system. Here at Heartlands we use a telepath system (original flavour!) we have two methods for extracting data from the system.

  1. List generation routines 

Telepath list generator

  1. A full SQL map

SQL Map

The key point with this step is that your lims system should have a link to the HISS/PAS as this allows you to get some more details about the patient from the linked database. They key piece of information we require is the patient postcode.

We are aiming for data in a simple .csv (comma separated value) format along the lines of:

Result,Test_Code,Date_Time_Received,Specimen_Number,Postcode,Registration_Number <- The field names NA,NVITD,2011–07–07 09:17:00,“C,11.9658974.K”,B23 7EB,2 <- The actual data

This csv data is then imported into a MySQL database

http://www.mysql.com MySQl is the “worlds most popular open source database” and by being open source its freely available for any system (windows, linux and Mac). The community server version is free http://www.mysql.com/downloads/mysql/ and easy to install.

You will also need a MySQL graphical interface if you aren’t keen on using the command line. There is one freely available from MySQL itself http://www.mysql.com/downloads/workbench/ or there will be a number of 3rd party ones also available. I use Razor SQL http://www.razorsql.com largely because it also connects to other SQL databases including Cache which is the SQL map into our telepath system.

It should be noted that on occasion it is necessary to resort to the command line for some queries (especially the ones that take a long time or return big datasets as they can cause out of memory errors for the graphical systems)

In most systems importing csv data into the MySQL database is simple as there are wizards which help the process. Heres a short video of my doing it with some sample data.

 

Part 2 starting the Geocoding Process

Now that we have the data in the data base we need to start the geocoding process.

Georeferencing can be achieved in a number of ways, the most common of which are post-code referencing and area referencing.

Area Referencing

An area reference code is a code that identifies an observation with a defined geographical area(8). Super Output Areas (SOA) are a geography for the collection and publication of small area statistics. The SOA layers form a hierarchy based on aggregations of Output Areas (OAs). They are better for statistical comparison as they are of much more consistent size and each layer has a specified minimum population to avoid the risk of data disclosure. SOAs are not be subject to frequent boundary change and therefore allow statistical comparison over time. In addition, they build on the existing availability of data for OAs. Currently there are:

  • 175,434 OAs in England and Wales (165,665 in England; 9,769 in Wales).
  • 34,378 Lower Layer SOAs (32,482 in England; 1,896 in Wales).
  • 7,193 Middle Layer SOAs (6,780 in England; 413 in Wales).

SOAs are intended as a purely statistical geography. They are based on the boundaries of the post-code based Output Area geography from which they are constructed. OAs are based on post-codes as at Census Day (29 April 2001), and are constrained by the 2003 statistical ward & parish boundaries(9) Post-codes

The post-code is a widely used and understood geo-reference and has a number of advantages because of this. It is freely available and is critical part of any patient record. There are a number of disadvantages to post-codes in that post-codes were designed to allow the efficient delivery of mail, a post-code can straddle two or more other geographies and can change or be removed. In addition, their use could have potential privacy implications as they are related to individual or groups of houses and locations.

What we need to do is convert the Postcode data we have to LSOA or similar area referencing.

Fortunately the NHS connecting for health project has a database which can convert Postcodes to this standard geography. This data is available from http://nww.connectingforhealth.nhs.uk/ods/downloads/officenatstats (Note this is an nww link and therefore you wont be able to connect to this site unless you’re connected to the N3 network i.e usually a computer connected to the NHS network). On the day I looked at this site, the file you need is:

gridall.zip, Complete Gridlink NHS Postcode File, weighing in at 39Mb

This data is supplied as a CSV file and again you will need to download and add to your MySQL database as previously. When you’ve done this, you should have something along the lines of:

Table gridall
=============
UnitPostCode, UnitPostCodeVar, DateOfIntro, DateOfTerm, Easting, Northing, County, LAD, EW, PostCodeUserType, GridRefPosInd, Country, SHA, Region_GOR, HA, POBOX, PCT, 1991CensusED, 1991CensusED1, EDPosInd, 1998Ward, OA, OAI, PanSHA, LSOA, SDZ, MSOA, URI, URIS, PSHA, OLDPCT, OLDIT, Constit, CancerReg, CancerNetwork
-------------
UnitPostCode     varchar(255)
UnitPostCodeVar  varchar(255)
DateOfIntro      int(11)
DateOfTerm       int(11)
Easting          int(11)
Northing         int(11)
County           varchar(255)
LAD              varchar(255)
EW               varchar(255)
PostCodeUserType int(11)
GridRefPosInd    int(11)
Country          varchar(255)
SHA              varchar(255)
Region_GOR       varchar(255)
HA               varchar(255)
POBOX            varchar(255)
PCT              int(11)
1991CensusED     varchar(255)
1991CensusED1    varchar(255)
EDPosInd         int(11)
1998Ward         varchar(255)
OA               varchar(255)
OAI              int(11)
PanSHA           varchar(255)
LSOA             varchar(255)
SDZ              varchar(255)
MSOA             varchar(255)
URI              int(11)
URIS             int(11)
PSHA             varchar(255)
OLDPCT           varchar(255)
OLDIT            varchar(255)
Constit          varchar(255)
CancerReg        varchar(255)
CancerNetwork    varchar(255)

The meaning of these fields is explained in the accompanying documentation but, the two we are interested in are:

  1. UnitPostCodeVar and LSOA

I’ve found I get better matches if I use the UnitPostCodeVar field rather than the UnitPostCode field. I have no idea why but its worth trying both to see which matches most.

The other tip I would give here if you are a database expert (or know one) is to get them to create some indexes for you. These indexes speed up the queries significantly. I index both postcode fields and LSOA

Linking to two datasets together.

To link the 2 data sets together (Our lab data and the Gridall data) we have to create a join query in MySQL. My preference is to get the query to create a new table at the same time.

An example query would be:

CREATE TABLE matched AS (SELECT tvitdn_gh_plus_heft_tvitdn_results_postcodes_only.Result, tvitdn_gh_plus_heft_tvitdn_results_postcodes_only.Req_Postcode, nhs_postcode_lookup.UnitPostCodeVar, nhs_postcode_lookup.LSOA FROM geocoding.tvitdn_gh_plus_heft_tvitdn_results_postcodes_only tvitdn_gh_plus_heft_tvitdn_results_postcodes_only, geocoding.nhs_postcode_lookup nhs_postcode_lookup WHERE tvitdn_gh_plus_heft_tvitdn_results_postcodes_only.Req_Postcode = nhs_postcode_lookup.UnitPostCodeVar )

We will now have a database table of all records that have been able to be matched to LSOA code and we have georeferenced our data.

We can now link this data to other georeferenced datasets e.g. Indices of Multiple Deprivation and start to do analysis based on this parameter rather than postcodes.

Creating Maps.

The next step in producing something meaningful is to create maps of the data.

This is again based on the use of the LSOA data. The data can be obtained from the newly revamped data.gov.uk site at: http://data.gov.uk/dataset/lower_layer_super_output_area_lsoa_boundaries

(previously you had to send off for a CDROM of the data)

The shape file is:

“The Esri shapefile or simply a shapefile is a popular geospatial vector data format for geographic information systems software. It is developed and regulated by Esri as a (mostly) open specification for data interoperability among Esri and other software products.[1] Shapefiles spatially describe geometries: points, polylines, and polygons. These, for example, could represent water wells, rivers, and lakes, respectively. Each item may also have attributes that describe the items, such as the name or temperature”.

http://en.wikipedia.org/wiki/Shapefile

In effect these file can draw lines on your map around areas.

To be honest, this is where the process gets a little complicated!

You can use a database system such as PostgreSQL (http://www.postgresql.org) which “understands” geography as its possible to load gospatial extensions. This is extremely powerful and allows you to do a number of calculations like how many x in a particular geographical area.

I’ve had better success using a utility called ogr2ogr which is part of the GDAL set of libraries (http://www.gdal.org/index.html)

Doing it

I have a table created from the geography CD in MyMSQL database of the following format:

CREATE TABLE lsoa_geometry ( ID int(11) NOT NULL, LSOA04CD varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, LSOA04NM varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL, ogc_geom geometry DEFAULT NULL, PRIMARY KEY (ID), KEY lsoa (LSOA04CD) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

This allows linkage of the data with a subset of the geography data using an SQL query something similar to this with LSOA code and the link code.

SELECT 2010_analysis_set_LOSA.TVITD, 2010_analysis_set_LOSA.LSOA FROM geocoding.2010_analysis_set_LOSA 2010_analysis_set_LOSA, geocoding.lsoa_geometry lsoa_geometry WHERE 2010_analysis_set_LOSA.LSOA = lsoa_geometry.LSOA04CD

By doing this, this will speed up the map drawing process as you will only be drawing the LSOA areas where you have data.

We now have data, LSOA and geography parameters ready to plot onto a graph.

I have found it quicker to plot shapefile data rather than pulling data “live” from the database. The latter can be done though.

To produce a shape file you can use ogr2ogr

ogr2ogr outputfilename MYSQL:databasename,user=dbusername,password=password tablename

This will produce a shapefile with the geometry data and your laboratory data embedded in it.

Map Drawing

There are a number of map drawing packages on the market, both commercial and open source. I use the open source QGIS software http://www.qgis.org which is available for all platforms.

Once installed its also easy to use and extensible with a number of 3rd party plugins which allow further functionality.

This short video outlines the processes involved:

 

Key points are: 1. Make sure your coordinate reference system is set to do conversions on the fly 2. Make sure you use British National Grid as the CRS for the shapefile layer 3. Make sure you use the appropriate CRS for the mapping provider you use. 4. Use the Openlayers overview and openlayers plugins to get the physical maps whereby you can overlay your shape file data.

You should end up with something like this:

 

 

About the Author

Craig Webster

Craig Webster

Craig Webster

Consultant Clinical Scientist

Birmingham Heartlands Hospital

 

I'm clinical lead for Biochemistry and Blood Sciences and I am responsible for R&D in the department of Biochemistry and Immunology and I have developed a number of projects utilising LC/MSMS, including, “LCMSMS methods for catecholamine metabolites”, “measurement of drugs of abuse”, “25-OH Vitamin D”, “8-Isoprostane”, “Vitamins A and E” and “Prednisolone and other steroids”

I have a wide-ranging technical background with major interests in HPLC, mass spectrometry, capillary electrophoresis and information technology. I can get by in objective-c, ruby and html/css. You can see some of my work on github at https://github.com/cwebster/ and I have an app on the app store for calculating sigma values for clinical assays https://itunes.apple.com/gb/app/assay-specification-six-sigma/id672408581?mt=8 

I have supervised 6 students for the research component of their MSc and have gained grants for the study of glycated proteins in diabetes, hepcidin measurements in haemochromatosis, lab to lab communication and e-learning.

  • Waldron J, Webster C. Liquid chromatography-tandem mass spectrometry method for the measurement of serum mevalonic acid: a novel marker of hydroxymethylglutaryl coenzyme A reductase inhibition by statins. Ann Clin Biochem. 2011 May;48(Pt 3):223–232.
  • Marrington R, Johnston J, Knowles S, Webster C. Measurement of urinary metadrenaline and normetadrenaline by liquid chromatography tandem mass spectrometry for the diagnosis of phaeochromocytoma. Ann Clin Biochem. 2010 Sep.;47(Pt 5):467–475.
  • Sulaiman et al. Ethnic differences in umbilical cord blood vitamin D and parathyroid hormone – South Asians compared to Whites born in the UK. Journal of Maternal-Fetal and Neonatal Medicine (0) vol. 0 (0) pp. 1-3
  • Development of an ID-MS method for measurement of pentosidine, an advanced glycosylation end-product, ACB Focus 2008
  • Development of an acid hydrolysis method for the quantification of total urinary opiates by LC-MS/MS., ACB Focus 2008
  • Development of a Liquid Chromatography- Tandem Mass Spectrometry method for measuring Urinary Free Cortisol and comparison with two different cortisol immunoassays, ACB Focus 2008

Leave a comment

Please login to leave a comment.

Copyright heftpathology 2013, 2014, 2015, 2016, 2017, 2018

HTA licence number is 12366

Protection of Personal Information – Laboratory Medicine comply with the Trust Data Protection Policy and have procedures in place to allow the Directorate and it’s employees to comply with the Data Protection Act 1998 and associated best practice and guidance.

The Trust Laboratories at Heartlands Hospital, Good Hope Hospital and Solihull Hospital were awarded UKAS (United Kingdom Accreditation Service) accreditation to the internationally recognised ISO 15189 standard in May 2015. For a list of accredited tests and other information please visit the test database http://www.heftpathology.com/frontpage/test-database.html.
Tests not appearing on this scope are either under consideration or in the process of accreditation and so currently remain outside of our scope of accreditation. However, these tests have been validated to the same high standard as accredited tests and are performed by the same trained and competent staff.

For further information contact Louise Fallon, Quality Manager, 0121 424 1235

UKAS Logos