Every file in the library imports this, so all global set up required everywhere can be added here.
- Sets up device to GPU if available.
- Defines default paths for different stores - so that they are out of version control by default.
- Global scope variables - for convenience in other modules.
def get_device():
'''Checks to see if GPU is available and sets device to GPU or CPU'''
use_cuda = torch.cuda.is_available()
if use_cuda:
assert torch.backends.cudnn.enabled == True
torch.backends.cudnn.benchmark = True #Enable cuDNN auto-tuner - perf benefit for convs
device = torch.device("cuda")
else:
device = torch.device("cpu")
return device
A YAML file called settings.yaml
is created (from a template) the first time the library is used.
DEVICE = get_device()
settings = read_settings()
DATA_STORE = settings.STORES.DATA_STORE
LOG_STORE = settings.STORES.LOG_STORE
MODEL_STORE = settings.STORES.MODEL_STORE
EXPERIMENT_STORE = settings.STORES.EXPERIMENT_STORE
PATH_1K = f'{DATA_STORE}/synthea/1K'
PATH_10K = f'{DATA_STORE}/synthea/10K'
PATH_20K = f'{DATA_STORE}/synthea/20K'
PATH_100K = f'{DATA_STORE}/synthea/100K'
FILENAMES = settings.FILENAMES
SYNTHEA_DATAGEN_DATES = settings.SYNTHEA_DATAGEN_DATES
CONDITIONS = settings.CONDITIONS
LOG_NUMERICALIZE_EXCEP = settings.LOG_NUMERICALIZE_EXCEP
These are global variables with default used for convenience in many places in the library. They can be over-ridden by passing in non-default values where needed.
CONDITIONS
- These conditions defined in the
CONDITIONS
dictionary are used during pre-processing to identify & label patients that have these conditions - After pre-processing, a subset of these (some or all of them) are used as labels to train the deep learning models
- Thus to train on a different set of labels / conditions
- First pre-process the dataset using the new conditions
- And then proceed to training the models
FILENAMES
FILENAMES
is the list of files in the dataset that this library current runs pre-processing on.
The following 2 global variables need to be changed in the ~/.lemonpie/settings.yaml
file based on your specific needs
Change SYNTHEA_DATAGEN_DATES
SYNTHEA_DATAGEN_DATES
- A few sample entries are provided to serve as examples and all dates are set to the first time the library was run.
- Please update these based on when you generate a particular dataset.
- These dates are important to calculate patient age.
DATA_STORE, MODEL_STORE, EXPERIMENT_STORE, LOG_STORE
Please change these paths to defaults in your specific configuration if desired
- All of these artifacts need to be in some form of failsafe storage, but not all need to be in version control.
- Also, some of them are likely to get big and version control might not be the ideal location (e.g. data, logs and models).
- Experiments on the other hand, as designed here, tend to be small-sized enough and can be stored in github or some other version control system (VCS).
- Each Experiment will keep track of the model it runs and saves it separately in the model store.
- Given the nature of the dataset in this release of the library (synthetic / Synthea), it can be easily re-generated in case of a loss.
So, its left to the user to decide which store needs to be where, depending upon your decision, change the default paths here.
Recommendation is to store experiments in some VCS and data & models in some type of failsafe storage; logs are used minimally and not that important (atleast in this release).
Set up Synthea so you can generate different types of synthetic EHR data per your need.
Synthea - Wiki has details about the project and how to get started and generate the data.
Here are condensed instructions for basic setup of Synthea for getting you up and running quickly. They also have an option for a developer setup, instructions for which are on the same webpage.
Download Synthea
- Download the binary (from the basic setup link above) to a local directory
- Don't run it yet
- Create a file in the same directory called
synthea.properties
and add the following lines into it and save itexporter.years_of_history = 0 exporter.fhir.export = false exporter.fhir.transaction_bundle = false exporter.hospital.fhir.export = false exporter.practitioner.fhir.export = false exporter.csv.export = true
Generate Data
- Once Synthea is set up, the following script will generate the data.
- Its important to record the run dates (data generation dates each time you generate a new dataset with Synthea) as mentioned above, we will need this during preprocessing.
- Basic setup run command is:
java -jar synthea-with-dependencies.jar
- Developer setup run command is:
./run_synthea
- Basic setup run command is:
- Run with the
-p
switch to control population of patients generated as shown in examples below.
For example to generate 10,000 patients ..
java -jar synthea-with-dependencies.jar -c synthea.properties -s 12345 -p 10000
- run date: 03/16/2021
- Records: total=11833, alive=10000, dead=1833
Copy Into DataStore
- Synthea will save the generated dataset into the
output
directory in the same location (for basic setup). - Copy the
csv
directory to the location pointed to by theDATASTORE
global variable- for example
~/.lemonpie/datasets
- for example
- Rename the
csv
directory toraw_original
, make sure the directory structure looks like this ..- for 10K data -
~/.lemonpie/dataset/synthea/10K/raw_original
- Note - Synthea outputs all csv files in a folder called
csv
; after copying into the datastore, the csv files must be in theraw_original
folder, where this library expects it for preprocessing.
- for 10K data -
Update settings.yaml
- Go to your lemonpie settings file (~/.lemonpie/settings.yaml) and add an entry (or update the entry) for the dataset you just generated
- For example for 10K data
- Under
SYNTHEA_DATAGEN_DATES
create the following '10K': '12-19-2020'
- Under