A quick end to end walk through - of running a small dataset through the various steps.
 

Basics

First we import basics

  • This creates a settings.yaml file from a template in a new directory (~/.lemonpie)
  • Then it loads global variables needed for everything like ..

We also import fastai.imports for all other required external libs

from lemonpie.basics import *
from fastai.imports import *
DEVICE
device(type='cuda')

Setup dataset

DATA_STORE
'/home/vinod/.lemonpie/datasets'

Next we will download Synthea's 1,000 patients csv dataset into our datastore.

  • For more details about the smalll dataset we are downloading, read the details on the Synthea website.
  • The resulting directory structure must be {DATA_STORE}/synthea/1K/raw_original
  • We already have a global variable called PATH_1K for convenience
PATH_1K # global variable
'/home/vinod/.lemonpie/datasets/synthea/1K'

So, we create the directory structure for PATH_1K in our DATA_STORE

Path.mkdir(Path(PATH_1K), parents=True, exist_ok=True)

Next, we download the data

synthea_url = 'https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip'

import requests
data = requests.get(synthea_url)
data_file = Path(f'{PATH_1K}/data.zip')

if not data_file.exists():
    print(f'Downloading from {synthea_url}')
    with open(f'{PATH_1K}/data.zip', 'wb') as f:
        f.write(data.content)
else:
    print('File exists so skipping download')
print('Done!')
Downloading from https://storage.googleapis.com/synthea-public/synthea_sample_data_csv_apr2020.zip
Done!

And unzip

from zipfile import ZipFile
with ZipFile(f'{PATH_1K}/data.zip', 'r') as zipObj:
    zipObj.extractall(PATH_1K)

Synthea zip creates a csv directory, the library requires it to be named raw_original, so just renaming ..

os.listdir(PATH_1K)
['csv', 'data.zip']
os.rename(f'{PATH_1K}/csv', f'{PATH_1K}/raw_original')
os.listdir(PATH_1K)
['data.zip', 'raw_original']
os.listdir(f'{PATH_1K}/raw_original')
['patients.csv',
 'observations.csv',
 'allergies.csv',
 'payers.csv',
 'careplans.csv',
 'medications.csv',
 'devices.csv',
 'organizations.csv',
 'imaging_studies.csv',
 'procedures.csv',
 'payer_transitions.csv',
 'supplies.csv',
 'conditions.csv',
 'providers.csv',
 'encounters.csv',
 'immunizations.csv']

Run pre-processing

  • Before we pre-process the dataset, we need to decide which conditions will be populated in the pre-processed patients.
  • Then when we train the models, the labels we train them on, will be a subset (or full set) of these pre-processed conditions.
  • An initial set of conditions are provided in the CONDITIONS dictionary that was created when we imported basics and created an initial settings file above.
CONDITIONS
{'diabetes': '44054006',
 'stroke': '230690007',
 'alzheimers': '26929004',
 'coronary_heart': '53741008',
 'lung_cancer': '254637007',
 'breast_cancer': '254837009',
 'rheumatoid_arthritis': '69896004',
 'epilepsy': '84757009'}

Next run preprocessing

from lemonpie.preprocessing.transform import *
preprocess_ehr_dataset(PATH_1K, today=pd.Timestamp.today(), conditions_dict=CONDITIONS, from_raw_data=True)
------------------- Splitting and cleaning raw dataset -------------------
Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test
Saved cleaned "train" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train
Saved vocab code tables to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train/codes
Saved cleaned "valid" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/valid
Saved cleaned "test" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/test
------------------- Creating vocab lists -------------------
Saved vocab lists to /home/vinod/.lemonpie/datasets/synthea/1K/processed
------------------- Creating patient lists -------------------
702 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/train
234 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/valid
235 total patients completed, saved patient list to /home/vinod/.lemonpie/datasets/synthea/1K/processed/years_0_to_20/test

The default settings for pre-processing generates patient data from 0 to 20 years of age, this can be changed by passing in a different age span (in years or months) to this function.

Run models

Assemble everything needed for training and run the models

  • Before we run the models, we need to decide which labels we want to train the models on.
  • And these labels must be a subset of the conditions we used when pre-processing the dataset (as mentioned above).
  • Say we pick the following subset
labels = ['diabetes', 'stroke', 'alzheimers', 'coronary_heart', 'breast_cancer', 'epilepsy']

Next, create the data object

  • This provides data management tools like data loaders etc.
from lemonpie.data import *
ehr_1K_data = EHRData(PATH_1K, labels)

Load vocabs and their dimensions

  • These were created in the pre-processing step above
from lemonpie.preprocessing.vocab import *
demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd = get_all_emb_dims(EhrVocabList.load(PATH_1K))

Get DataLoaders

train_dl, valid_dl, train_pos_wts, valid_pos_wts = ehr_1K_data.get_data()

Loss functions

from lemonpie.learn import *
train_loss_fn, valid_loss_fn = get_loss_fn(train_pos_wts), get_loss_fn(valid_pos_wts)

LSTM

from lemonpie.models import *
model = EHR_LSTM(demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd, len(labels)).to(DEVICE)

Optimizer

optimizer = torch.optim.Adagrad(model.parameters())

Then run fit

h = RunHistory(labels)
from lemonpie.metrics import *
%time h = fit(5, h, model, train_loss_fn, valid_loss_fn, optimizer, auroc_score, \
              train_dl, valid_dl, to_chkpt_path=MODEL_STORE, from_chkpt_path=None, verbosity=1)
epoch |     train loss |     train aurocs                  valid loss |     valid aurocs    
----------------------------------------------------------------------------------------------------
    0 |          4.441 | [0.631 0.591 0.621 0.495]              1.328 | [0.666 0.716 0.972 0.861]
    1 |          1.175 | [0.740 0.818 0.936 0.637]              1.045 | [0.672 0.739 0.967 0.852]
    2 |          1.115 | [0.744 0.850 0.883 0.717]              1.065 | [0.654 0.714 0.971 0.859]
    3 |          0.973 | [0.732 0.879 0.927 0.784]              1.033 | [0.675 0.712 0.967 0.853]
    4 |          0.829 | [0.792 0.905 0.971 0.744]              1.043 | [0.646 0.727 0.953 0.860]
Checkpointed to "/home/vinod/.lemonpie/models/checkpoint.tar"
CPU times: user 12.9 s, sys: 154 ms, total: 13 s
Wall time: 13 s
plot_fit_results(h, labels)

Run inference on the test set

test_dl, test_pos_wts = ehr_1K_data.get_test_data()
test_loss_fn = get_loss_fn(test_pos_wts)
h = predict(h, model, test_loss_fn, auroc_score, test_dl, chkpt_path=MODEL_STORE)
From "/home/vinod/.lemonpie/models/checkpoint.tar", loading model ...
test loss = 0.9992311596870422
test aurocs = [0.783869 0.898945 0.928675 0.847808 0.751073 0.575107]
h = summarize_prediction(h, labels)
Prediction Summary ...
                auroc_score  optimal_threshold     auroc_95_ci
diabetes           0.783869           0.575259   (0.693, 0.86)
stroke             0.898945           0.779456    (0.8, 0.969)
alzheimers         0.928675           0.834349  (0.878, 0.972)
coronary_heart     0.847808           0.651642  (0.733, 0.938)
breast_cancer      0.751073           0.313323  (0.532, 0.953)
epilepsy           0.575107           0.526724  (0.513, 0.632)
h.prediction_summary
auroc_score optimal_threshold auroc_95_ci
diabetes 0.783869 0.575259 (0.693, 0.86)
stroke 0.898945 0.779456 (0.8, 0.969)
alzheimers 0.928675 0.834349 (0.878, 0.972)
coronary_heart 0.847808 0.651642 (0.733, 0.938)
breast_cancer 0.751073 0.313323 (0.532, 0.953)
epilepsy 0.575107 0.526724 (0.513, 0.632)

Else AUROC score calculation is not possible resulting in this error ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

The way to find out is to get prevalence counts after creating the data object. See following example which uses the data object we created above.

ehr_1K_data.load_splits()
ehr_1K_data.splits.get_label_counts(list(CONDITIONS.keys()))
train valid test total
diabetes 43 14 19 76
stroke 30 7 11 48
alzheimers 12 7 6 25
coronary_heart 39 11 11 61
lung_cancer 12 0 2 14
breast_cancer 11 8 2 21
rheumatoid_arthritis 2 0 0 2
epilepsy 15 5 2 22

In this small 1K dataset, 'lung_cancer' and 'rheumatoid_arthritis' have single classes in some splits (e.g. no lung_cancer patients in validation set) as seen in the prevalence counts above and would result in the above failure when fit is run.

However, in large datasets the possibility of this is very low, but its something to watch out for.

CNN

model = EHR_CNN(demograph_dims, rec_dims, demograph_dims_wd, rec_dims_wd, num_labels=len(labels)).to(DEVICE)
h2 = RunHistory(labels)
h2 = fit(5, h, model, train_loss_fn, valid_loss_fn, optimizer, auroc_score, \
              train_dl, valid_dl, to_chkpt_path=None, from_chkpt_path=None, verbosity=0.5)
epoch |     train loss |     train aurocs                  valid loss |     valid aurocs    
----------------------------------------------------------------------------------------------------
    0 |          1.506 | [0.556 0.585 0.715 0.545]              1.435 | [0.605 0.374 0.649 0.290]
    4 |          1.754 | [0.518 0.457 0.512 0.489]              1.435 | [0.605 0.374 0.649 0.290]