Setup GPU, default paths & global variables.

Every file in the library imports this, so all global set up required everywhere can be added here.

  1. Sets up device to GPU if available.
  2. Defines default paths for different stores - so that they are out of version control by default.
  3. Global scope variables - for convenience in other modules.

GPU

get_device[source]

get_device()

Checks to see if GPU is available and sets device to GPU or CPU

def get_device():
    '''Checks to see if GPU is available and sets device to GPU or CPU'''
    use_cuda = torch.cuda.is_available()
    if use_cuda:
        assert torch.backends.cudnn.enabled == True
        torch.backends.cudnn.benchmark = True #Enable cuDNN auto-tuner - perf benefit for convs
        device = torch.device("cuda")
    else:
        device = torch.device("cpu")
    return device

Settings File

A YAML file called settings.yaml is created (from a template) the first time the library is used.

settings_template[source]

settings_template()

Create initial settings for library

read_settings[source]

read_settings()

Read settings file at "~/.lemonpie/settings.yaml", if doesnt exist, create it from template

Global Scope Variables

DEVICE = get_device()
settings = read_settings()

DATA_STORE         = settings.STORES.DATA_STORE
LOG_STORE          = settings.STORES.LOG_STORE
MODEL_STORE        = settings.STORES.MODEL_STORE
EXPERIMENT_STORE   = settings.STORES.EXPERIMENT_STORE

PATH_1K   = f'{DATA_STORE}/synthea/1K'
PATH_10K  = f'{DATA_STORE}/synthea/10K'
PATH_20K  = f'{DATA_STORE}/synthea/20K'
PATH_100K = f'{DATA_STORE}/synthea/100K'

FILENAMES = settings.FILENAMES

SYNTHEA_DATAGEN_DATES = settings.SYNTHEA_DATAGEN_DATES

CONDITIONS = settings.CONDITIONS

LOG_NUMERICALIZE_EXCEP = settings.LOG_NUMERICALIZE_EXCEP
No settings file found, so creating from template ..

These are global variables with default used for convenience in many places in the library. They can be over-ridden by passing in non-default values where needed.

CONDITIONS
{'diabetes': '44054006',
 'stroke': '230690007',
 'alzheimers': '26929004',
 'coronary_heart': '53741008',
 'lung_cancer': '254637007',
 'breast_cancer': '254837009',
 'rheumatoid_arthritis': '69896004',
 'epilepsy': '84757009'}
  • These conditions defined in the CONDITIONS dictionary are used during pre-processing to identify & label patients that have these conditions
  • After pre-processing, a subset of these (some or all of them) are used as labels to train the deep learning models
  • Thus to train on a different set of labels / conditions
    • First pre-process the dataset using the new conditions
    • And then proceed to training the models
FILENAMES
['patients',
 'observations',
 'allergies',
 'careplans',
 'medications',
 'imaging_studies',
 'procedures',
 'conditions',
 'immunizations']

FILENAMES is the list of files in the dataset that this library current runs pre-processing on.

The following 2 global variables need to be changed in the ~/.lemonpie/settings.yaml file based on your specific needs

SYNTHEA_DATAGEN_DATES
{'1K': '04-01-2021',
 '10K': '04-01-2021',
 '20K': '04-01-2021',
 '100K': '04-01-2021',
 '250K': '04-01-2021'}
  • A few sample entries are provided to serve as examples and all dates are set to the first time the library was run.
  • Please update these based on when you generate a particular dataset.
  • These dates are important to calculate patient age.

Change - Default STORE Paths

DATA_STORE, MODEL_STORE, EXPERIMENT_STORE, LOG_STORE
('/home/vinod/.lemonpie/datasets',
 '/home/vinod/.lemonpie/models',
 '/home/vinod/.lemonpie/experiments',
 '/home/vinod/.lemonpie/logs')

Please change these paths to defaults in your specific configuration if desired

  • All of these artifacts need to be in some form of failsafe storage, but not all need to be in version control.
  • Also, some of them are likely to get big and version control might not be the ideal location (e.g. data, logs and models).
    • Experiments on the other hand, as designed here, tend to be small-sized enough and can be stored in github or some other version control system (VCS).
    • Each Experiment will keep track of the model it runs and saves it separately in the model store.
    • Given the nature of the dataset in this release of the library (synthetic / Synthea), it can be easily re-generated in case of a loss.

So, its left to the user to decide which store needs to be where, depending upon your decision, change the default paths here.
Recommendation is to store experiments in some VCS and data & models in some type of failsafe storage; logs are used minimally and not that important (atleast in this release).

Setup Synthea

Set up Synthea so you can generate different types of synthetic EHR data per your need.
Synthea - Wiki has details about the project and how to get started and generate the data.

Here are condensed instructions for basic setup of Synthea for getting you up and running quickly. They also have an option for a developer setup, instructions for which are on the same webpage.

Download Synthea

  • Download the binary (from the basic setup link above) to a local directory
    • Don't run it yet
  • Create a file in the same directory called synthea.properties and add the following lines into it and save it
    exporter.years_of_history = 0
    exporter.fhir.export = false
    exporter.fhir.transaction_bundle = false
    exporter.hospital.fhir.export = false
    exporter.practitioner.fhir.export = false
    exporter.csv.export = true

Generate Data

  • Once Synthea is set up, the following script will generate the data.
  • Its important to record the run dates (data generation dates each time you generate a new dataset with Synthea) as mentioned above, we will need this during preprocessing.
    • Basic setup run command is: java -jar synthea-with-dependencies.jar
    • Developer setup run command is: ./run_synthea
  • Run with the -p switch to control population of patients generated as shown in examples below.

For example to generate 10,000 patients ..

java -jar synthea-with-dependencies.jar -c synthea.properties -s 12345 -p 10000

  • run date: 03/16/2021
  • Records: total=11833, alive=10000, dead=1833

Copy Into DataStore

  • Synthea will save the generated dataset into the output directory in the same location (for basic setup).
  • Copy the csv directory to the location pointed to by the DATASTORE global variable
    • for example ~/.lemonpie/datasets
  • Rename the csv directory to raw_original, make sure the directory structure looks like this ..
    • for 10K data - ~/.lemonpie/dataset/synthea/10K/raw_original
    • Note - Synthea outputs all csv files in a folder called csv; after copying into the datastore, the csv files must be in the raw_original folder, where this library expects it for preprocessing.

Update settings.yaml

  • Go to your lemonpie settings file (~/.lemonpie/settings.yaml) and add an entry (or update the entry) for the dataset you just generated
  • For example for 10K data