Basics

First we import basics

This creates a settings.yaml file from a template in a new directory (~/.lemonpie)
Then it loads global variables needed for everything like ..
- DEVICE set to GPU if it exists, else CPU
- Default paths for data(DATA_STORE), logs(LOG_STORE), models(MODEL_STORE) & experiments(EXPERIMENT_STORE)
- And some other variables used in pre-processing

We also import fastai.imports for all other required external libs

from lemonpie.basics import *
from fastai.imports import *

DEVICE

device(type='cuda')

Setup dataset

DATA_STORE

'/home/vinod/.lemonpie/datasets'

Next we will download Synthea's 1,000 patients csv dataset into our datastore.

For more details about the smalll dataset we are downloading, read the details on the Synthea website.
The resulting directory structure must be {DATA_STORE}/synthea/1K/raw_original
We already have a global variable called PATH_1K for convenience

PATH_1K # global variable

'/home/vinod/.lemonpie/datasets/synthea/1K'

So, we create the directory structure for PATH_1K in our DATA_STORE

Path.mkdir(Path(PATH_1K), parents=True, exist_ok=True)

Next, we download the data

synthea_url = 'https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip'

import requests
data = requests.get(synthea_url)
data_file = Path(f'{PATH_1K}/data.zip')

if not data_file.exists():
    print(f'Downloading from {synthea_url}')
    with open(f'{PATH_1K}/data.zip', 'wb') as f:
        f.write(data.content)
else:
    print('File exists so skipping download')
print('Done!')

Downloading from https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip
Done!

And unzip

from zipfile import ZipFile
with ZipFile(f'{PATH_1K}/data.zip', 'r') as zipObj:
    zipObj.extractall(PATH_1K)

Synthea zip creates a csv directory, the library requires it to be named raw_original, so just renaming ..

os.listdir(PATH_1K)

['csv', 'data.zip']

os.rename(f'{PATH_1K}/csv', f'{PATH_1K}/raw_original')

os.listdir(PATH_1K)

['data.zip', 'raw_original']

os.listdir(f'{PATH_1K}/raw_original')

['patients.csv',
 'observations.csv',
 'allergies.csv',
 'payers.csv',
 'careplans.csv',
 'medications.csv',
 'devices.csv',
 'organizations.csv',
 'imaging_studies.csv',
 'procedures.csv',
 'payer_transitions.csv',
 'supplies.csv',
 'conditions.csv',
 'providers.csv',
 'encounters.csv',
 'immunizations.csv']

Run pre-processing

Before we pre-process the dataset, we need to decide which conditions will be populated in the pre-processed patients.
Then when we train the models, the labels we train them on, will be a subset (or full set) of these pre-processed conditions.
An initial set of conditions are provided in the CONDITIONS dictionary that was created when we imported basics and created an initial settings file above.

CONDITIONS

{'diabetes': '44054006',
 'stroke': '230690007',
 'alzheimers': '26929004',
 'coronary_heart': '53741008',
 'lung_cancer': '254637007',
 'breast_cancer': '254837009',
 'rheumatoid_arthritis': '69896004',
 'epilepsy': '84757009'}

Next run preprocessing

from lemonpie.preprocessing.transform import *

preprocess_ehr_dataset(PATH_1K, today=pd.Timestamp.today(), conditions_dict=CONDITIONS, from_raw_data=True)

------------------- Splitting and cleaning raw dataset -------------------
Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test
Saved cleaned "train" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train
Saved vocab code tables to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train/codes
Saved cleaned "valid" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/valid
Saved cleaned "test" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/test
------------------- Creating vocab lists -------------------
Saved vocab lists to /home/vinod/.lemonpie/datasets/synthea/1K/processed
------------------- Creating patient lists -------------------
702 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/train
234 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/valid
235 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/test

The default settings for pre-processing generates patient data from 0 to 20 years of age, this can be changed by passing in a different age span (in years or months) to this function.

Run models

Assemble everything needed for training and run the models

Before we run the models, we need to decide which labels we want to train the models on.
And these labels must be a subset of the conditions we used when pre-processing the dataset (as mentioned above).
Say we pick the following subset

labels = ['diabetes', 'stroke', 'alzheimers', 'coronary_heart', 'breast_cancer', 'epilepsy']

Next, create the data object

This provides data management tools like data loaders etc.

from lemonpie.data import *

ehr_1K_data = EHRData(PATH_1K, labels)

Load vocabs and their dimensions

These were created in the pre-processing step above

from lemonpie.preprocessing.vocab import *

demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd = get_all_emb_dims(EhrVocabList.load(PATH_1K))

Get DataLoaders

train_dl, valid_dl, train_pos_wts, valid_pos_wts = ehr_1K_data.get_data()

Loss functions

from lemonpie.learn import *

train_loss_fn, valid_loss_fn = get_loss_fn(train_pos_wts), get_loss_fn(valid_pos_wts)

LSTM

from lemonpie.models import *

model = EHR_LSTM(demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd, len(labels)).to(DEVICE)

Optimizer

optimizer = torch.optim.Adagrad(model.parameters())

Then run fit

h = RunHistory(labels)

from lemonpie.metrics import *

%time h = fit(5, h, model, train_loss_fn, valid_loss_fn, optimizer, auroc_score, \
              train_dl, valid_dl, to_chkpt_path=MODEL_STORE, from_chkpt_path=None, verbosity=1)

epoch |     train loss |     train aurocs                  valid loss |     valid aurocs    
----------------------------------------------------------------------------------------------------
    0 |          4.441 | [0.631 0.591 0.621 0.495]              1.328 | [0.666 0.716 0.972 0.861]
    1 |          1.175 | [0.740 0.818 0.936 0.637]              1.045 | [0.672 0.739 0.967 0.852]
    2 |          1.115 | [0.744 0.850 0.883 0.717]              1.065 | [0.654 0.714 0.971 0.859]
    3 |          0.973 | [0.732 0.879 0.927 0.784]              1.033 | [0.675 0.712 0.967 0.853]
    4 |          0.829 | [0.792 0.905 0.971 0.744]              1.043 | [0.646 0.727 0.953 0.860]
Checkpointed to "/home/vinod/.lemonpie/models/checkpoint.tar"
CPU times: user 12.9 s, sys: 154 ms, total: 13 s
Wall time: 13 s

plot_fit_results(h, labels)

Run inference on the test set

test_dl, test_pos_wts = ehr_1K_data.get_test_data()

test_loss_fn = get_loss_fn(test_pos_wts)

h = predict(h, model, test_loss_fn, auroc_score, test_dl, chkpt_path=MODEL_STORE)

From "/home/vinod/.lemonpie/models/checkpoint.tar", loading model ...
test loss = 0.9992311596870422
test aurocs = [0.783869 0.898945 0.928675 0.847808 0.751073 0.575107]

h = summarize_prediction(h, labels)

Prediction Summary ...
                auroc_score  optimal_threshold     auroc_95_ci
diabetes           0.783869           0.575259   (0.693, 0.86)
stroke             0.898945           0.779456    (0.8, 0.969)
alzheimers         0.928675           0.834349  (0.878, 0.972)
coronary_heart     0.847808           0.651642  (0.733, 0.938)
breast_cancer      0.751073           0.313323  (0.532, 0.953)
epilepsy           0.575107           0.526724  (0.513, 0.632)

h.prediction_summary

Important: Make sure to only use labels that have atleast 1 "True" value in each split. That is, "y_true" contains both false and at least one true.

Else AUROC score calculation is not possible resulting in this error ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

The way to find out is to get prevalence counts after creating the data object. See following example which uses the data object we created above.

ehr_1K_data.load_splits()

ehr_1K_data.splits.get_label_counts(list(CONDITIONS.keys()))

In this small 1K dataset, 'lung_cancer' and 'rheumatoid_arthritis' have single classes in some splits (e.g. no lung_cancer patients in validation set) as seen in the prevalence counts above and would result in the above failure when fit is run.

However, in large datasets the possibility of this is very low, but its something to watch out for.

CNN

model = EHR_CNN(demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd, num_labels=len(labels)).to(DEVICE)

h2 = RunHistory(labels)

h2 = fit(5, h, model, train_loss_fn, valid_loss_fn, optimizer, auroc_score, \
              train_dl, valid_dl, to_chkpt_path=None, from_chkpt_path=None, verbosity=0.5)

epoch |     train loss |     train aurocs                  valid loss |     valid aurocs    
----------------------------------------------------------------------------------------------------
    0 |          1.506 | [0.556 0.585 0.715 0.545]              1.435 | [0.605 0.374 0.649 0.290]
    4 |          1.754 | [0.518 0.457 0.512 0.489]              1.435 | [0.605 0.374 0.649 0.290]

	auroc_score	optimal_threshold	auroc_95_ci
diabetes	0.783869	0.575259	(0.693, 0.86)
stroke	0.898945	0.779456	(0.8, 0.969)
alzheimers	0.928675	0.834349	(0.878, 0.972)
coronary_heart	0.847808	0.651642	(0.733, 0.938)
breast_cancer	0.751073	0.313323	(0.532, 0.953)
epilepsy	0.575107	0.526724	(0.513, 0.632)

Quick Walkthrough

Basics

Setup dataset

Run pre-processing

Run models

LSTM

CNN