EHR Preprocessing
A post about preprocessing EHR data for Deep Learning.
About
EHR
- most popular are MIMIC and now eICU
FHIR
- What is it
- I chose this because it is a canonical format
Synthea
- What is it + how-tos
- I used varying sizes - all MA for now
- Data dictionary
Labels
- Most models use in-patient mortality, readmission, prolonged length of stay etc because they are hospitalization or ICU datasets
- I chose to predict chronic conditions instead for the following reasons
- Given the nature of Synthea dataset (i.e. simulating standard (all) events over a lifetime not just hospitalization)
- Also given chronic conditions account for the majority of US healthcare costs
Cleaning
- First split by patient ids
- So as to isolate other records based on pt ids ## Other cleaning Mostly standard stuff, standardizing column names etc. Dropped Encounters - Code Tables To create vocabularies Identifying START and STOP Observations is a little more complex than the rest
Inserting Age
- Years and months for now (given the nature of the Synthea dataset)
- Hours or Days as age are possible in more granular data - e.g. ICU or hospitalization
- where we are trying to predict outcomes within say 24 or 48 hours after admission
- Also as we'll see in upcoming posts, age allows some flexibility
- Initially everything started at age 0
- With a little change, I am now able to get any arbitrary age span - say month 24 to month 104 or 20 to 40 years
Extracting Labels
- Extract them from the conditions df and put in patients df for ease of use later
Creating Vocab
- A note about EmbeddingBag and Embedding
- The difference
- The idea of representing a time period with EmbeddingBag (as described in the Google paper)
- Implementation - Vocab classes
- EhrVocab class
- ObsVocab class - Observations vocab is special
- Demographics vocab is different
- Vocablist class
- Tried to use fastai Vocab, but this required quite a bit of customization, so wrote another one on similar lines
- Emb Matrix Dimensions - a convenience fn to get the dimensions as this is needed during creation of the models