The MIMIC-III Electronic Health Record (EHR) database

In this article we will look at MIMIC-III, which is the largest publicly Electronic Health Record (EHR) database available to benchmark machine learning algorithms.

Pranath Fernando


March 14, 2022

1 Introduction

In this article we will look at MIMIC-III, which is the largest publicly Electronic Health Record (EHR) database available to benchmark machine learning algorithms. In particular, we will learn about the design of this relational database, what tools are available to query, extract and visualise descriptive analytics.

The schema and International Classification of Diseases coding is important to understand how to map research questions to data and how to extract key clinical outcomes in order to develop clinically useful machine learning algorithms.

2 Data and EHR (Electronic Health Records) in Healthcare

Enabling a digital system of Electronic Health Records provide unique opportunities in advancing clinical decision making systems. However, it also poses key challenges. In this article, we are going to talk about the main dimensions data in health care, including volume, variety, time resolution and quality. Then we are going to discuss how clinical decision making depends on a pathway of descriptive analytics to predictive analytics and finally too prescriptive analytics.

Currently, traditional healthcare models rely on disconnected systems, multiple sources of information. The new digital healthcare model will transition towards an inherent capability to ensure seamless information exchange across system. This enable data mining and machine learning approaches to successfully applied and advance our knowledge with relation to clinical decision making systems. Electronic Health Records are massively heterogeneous. They include medical images, lab tests, natural language diagnosis from doctors, medications events and hospitalizations. Often these records are unstructured and they require linkage between different sources. Health care records have a longitudinal nature. In other words, a single patient data are spread over multiple Electronic Health Records with diverse representation over time.

A fundamental principle in medical systems is that clinical data cannot be overwritten. This is an important principle when we design database to retrieve information. When any of this data are modified during further treatment or subsequent hospitalization, we need to a new extract with new data and store those again. A connection should be created to link this new information with the rest of the information available for the patient. In secondary research use of healthcare data, it is common to look for health care, quality evaluation, clinical and epidemiological studies as well as service management. In several cases the research is focused on a particular group of patients who satisfy distinct searching criteria. To understand how to extract value from big data and healthcare we need to understand their dimensions. The main characteristics of big data are volume, velocity, variety, veracity and value. Big healthcare is really big. In 2013 it was estimated that the healthcare data produced globally was 153 billion gigabytes. This is equal to 153 exabytes. This number projected to 2020 results to 2314 exabytes. Considering that data has doubled every year The velocity shows how quickly the data being created, saved, or moved.

The value of the data reflects on whether we can use them to form and test useful hypothesis. It is also important on whether the data can allow us to predict future events, and in this way, we intervene early. Viability is also a dimension that relates to value, and it reflects whether the data are relevant to the use case. Do they include all the information needed to investigate specific questions? Metadata is data about data. Sometimes it might be the file’s origin, date, time, and format. It may also include notes or comments. In healthcare, metadata is important to verify the veracity and effectively the value of the data.

We can conceptualize healthcare information retrieval processes as a pathway from descriptive analytics to diagnostic analytics, predictive analytics, and prescriptive analytics. Descriptive analytics use techniques such as data aggregation, data mining, and intuitive visualizations to provide understanding of historic data. They’re retrieving information. Common examples of descriptive analytics are reports that provide the answers to questions such as, how many patients were admitted to a hospital last year? How many patients died within 30 days? Or how many patients caught an infection? In other words, descriptive analytics offer intuitive ways to summarize the data via histograms and graphs and show the data distribution properties. In most cases, to achieve substantial insight and understanding for health delivery optimization and cost savings, dataset linking is required. In other word, it is desirable to link different sources of data. In its simplest form, this requires to link information related to a patient across all different departments in a hospital. Limitations of descriptive analytics are that it keeps limited ability to guide decision because it is based on a snapshot of the past. Although this is useful, it is not always indicative of the future. Diagnostic analytics is a form of analytics that examines data to answer the question of why something happened.

Diagnostic analytics could comprise of correlation techniques that discovers links between clinical variables, treatments, and drugs. Predictive analytics allow us to predict the outcome and likelihood of an event. We may like, for example, to predict the mortality risk of a patient, the length of hospitalization, or the risk for infection. Predictive analytics exploit historic values of the data with the aim to be able to provide useful information about critical events in the future. Predictive analytics are in demand because health care providers would like evidence based ways to predict and avoid adverse events.

In this way, they can reduce costs as well as avoid failure to control harmonic diseases. Importantly predictive analytics enable early intervention which can save patient lives and improve their quality of life. Prescriptive analytics aim to make decisions for optimal outcomes. In other words, they use all the available information to come to an optimal decision with relation to what action should be taken. Predictive analytics help us to understand the impact of an intervention in clinical care. And confirm whether the system is useful. Prescriptive analytics predicts not only what will happen but also why it will happen. In other words, prescriptive analytics is important to transition a prediction model to a decision making model.

The availability of big data provides several opportunities but it also poses important challenges. And the 1st one is interoperability. With such a diverse health care system that included the hetero continuous data sources and users like healthcare providers, clinicians, government organizations wearable technologies and so on. It is particularly challenging to maintain the high level of interoperability necessary for timely information sharing when needed.

The problem becomes even worse because of the lack of standards in the healthcare industry. Interoperability designs should also take into consideration patient safety and privacy. Lack of interoperability for example could potentially resulted to medical errors and endanger patient safety. In terms of patient safety it is also important to be able to access information quickly. The conflicting needs to share patient information in real time upon appropriate request while also making sure private patient information is kept secure. Makes management of healthcare industry especially complex. Another challenge of big data in health care is the fact that they change quickly.

Therefore it is important to know for how long the data relevant and which historic values to include in the analysis. Vulnerability refers to the fact that we need to keep the data secure and this can involve both IT infrastructure but also regular training procedures. Last but not least, the data growth and the lack of expert ties are difficult to ignore. Some are rising big data in health care presents unique opportunities and challenges. Healthcare data is a valuable asset and is defined based on the volume, variety, velocity veracity and value of the data set. Clinical decision support system exploit information in this data via a pathway from descriptive to predictive and prescriptive analytics.

3 EHR System in the UK and USA

The US and the UK health care systems are known to be run very differently. UK has the largest public sector system and invest much less on its healthcare system. On the other hand, USA has the largest private-sector system and one of the largest health care expenditure in the world. It is interesting to compare the electronic health record system adaptations in these two countries in order to understand the challenges.

Both USA and UK has succeeded in the adaptation of electronic health records in their systems. UK followed a top-down approach. The difficulty was that clinicians are not used to have technology dictated decisions to them. On the other hand, USA followed up bottom-up approach. This approach was successfully adapted by individual office-based physicians, but it was more difficult to ensure interoperability between larger facilities and hospitals. Overall, we shouldn’t underestimate the complexity of the health care system. In order to fully explore the potential of electronic health records, we need to sustain the interoperability, security, and privacy of patients information. We also need to take into account the possible usage and value of information.

4 The MIMIC Critical Care Dataset

The MIMIC-III database links data from a hospital with data from patients from the intensive care unit. The database is well maintained and it includes lab tests, medical diagnosis, vital signs, and medication. Researchers at the laboratory of computational physiology at MIT recognized the need to generate new knowledge from existing data. Big data was captured daily during care delivery in the intensive care unit. But none of this was used for further exploration. The motivation was to provide a freely accessible deidentified critical care dataset under a data user agreement. This dataset is available both for academic as well as industrial research in higher education. The health care dataset is not only large, but it also spans over a period of a decade.

This hospital data reflects one of the best examples in systematic gathering of clinical information. It is a valuable, high-quality dataset that highlights the opportunities in machine learning. It’s realistic settings also reveal the challenges in processing electronic health records. Back in 1992, there was an effort to collect multi-parameter recordings of intensive care unit patients. This created the MIMIC project, which is a collection of clinical data. MIMIC-II was the largest multi-parameter intelligent monitoring in intensive care database containing physiological signals and vital sign time series captured from patient monitored. Along with this data, there were also clinical data obtained from the hospital medical information system. Data were collected from intensive care units between 2001 and 2008. This included the medical, surgical, coronary care, and neonatal care unit. With more data updates and also adding a new monitoring system, the MIMIC-II evolved to MIMIC-III and it was published in 2016.

The MIMIC project continues to have huge success. This is obvious from the number of citations that has received over the time. Starting from 2002 with the first release of MIMIC-II and subsequently in 2009 with update version and finally with MIMIC-III in 2016, we see an exponential growth of citations. MIMIC-III had impact in several disciplines beyond medicine. We see here the number of citations that it has attracted across science. The availability of more than 40,000 patient data had an impact in computer science and machine learning. We can also measure the influence of the database in other fields such as mathematics, engineering and physics. A large amount of attention has also received in medical research and there are several articles within critical care medicine, cardiology, gerontology, pathology, neuroscience, and infectious diseases.

Not only MIMIC is impactful, but also the papers that use MIMIC are impactful. MIMIC allowed research in deep learning models that wasn’t possible before. Sophisticated models can be developed, trained, and validated with MIMIC. Furthermore, it enables research in clinical decision support systems. The database also shaped the research in big data analytics in health care. The MIMIC project is also a model that can be used in other clinical databases in order to deidentified free-text as well as other clinical information. Summarizing, MIMIC-III is a big dataset of healthcare data that includes both hospital data as well as intensive care unit data. The data has been carefully deidentified and they can be used to facilitate the reproducibility of clinical studies to develop new algorithms and new technologies. MIMIC-III is the first of its kind that is publicly available.