Split

Once dataset is assembled, the folder will look as follows ..

DATA_STORE

'/home/vinod/.lemonpie/datasets'

PATH_1K

'/home/vinod/.lemonpie/datasets/synthea/1K'

os.listdir(f'{PATH_1K}/raw_original')

['patients.csv',
 'observations.csv',
 'allergies.csv',
 'payers.csv',
 'careplans.csv',
 'medications.csv',
 'devices.csv',
 'organizations.csv',
 'imaging_studies.csv',
 'procedures.csv',
 'payer_transitions.csv',
 'supplies.csv',
 'conditions.csv',
 'providers.csv',
 'encounters.csv',
 'immunizations.csv']

dfs = read_raw_ehrdata(f'{PATH_1K}/raw_original')

patients, observations, allergies, careplans, medications, imaging_studies, procedures, conditions, immunizations = dfs

train, valid, test = split_patients(patients, .2,.1)

Splits:: train: 0.7, valid: 0.2, test: 0.1

len(patients), len(train), len(valid), len(test)

(1171, 819, 234, 118)

assert len(patients) == len(train)+len(valid)+len(test)

split_ehr_dataset(PATH_1K) #will use default values for split percents

Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test

Clean

patients data frame looks like this before cleanup..

patients.head()

pt_data = cleanup_pts(patients, is_train=True, today=SYNTHEA_DATAGEN_DATES['1K'])
patients, pt_demo, pt_codes = pt_data[0], pt_data[1], pt_data[2]

Our cleanup function produces the following 3 dfs - patients, pt_demographics, pt_codes

for df in pt_data:
    display(df.head())

The case for keeping a record of the data generation date

Also note the difference in age_now if it were calculated based on default (pd.Timestamp.today()) vs SYNTHEA_DATAGEN_DATES['1K'] which is the data generation date for this 1K dataset.

(pd.to_datetime(pd.Timestamp.today()) - patients.iloc[2])[0].days, (pd.to_datetime(SYNTHEA_DATAGEN_DATES['1K']) - patients.iloc[2])[0].days

(10530, 10513)

SYNTHEA_DATAGEN_DATES['1K'], pd.Timestamp.today()

('03-15-2021', Timestamp('2021-04-01 14:18:36.220590'))

That is - 1K dataset data generation date is set to March 15th while today is March 31.

Drops rows with null in the VALUE column
Creates a new code column with a concatenation of code, value, units and type
- so that we can use the following logic during vocab creation for observations (further detailed in the vocab documentation)

For numeric

for 'numeric'
    get unique 'codes'
    for each unique code
        get unique 'units'
            for each unique unit
                bucketize 'values'
                create vocab entry for each 'bucket' -- code||value_bucket||units

For text

for 'text'
    get unique 'codes'
    for each unique code
        get unique 'units' #this will be null
            for each unique unit
                get unique 'values'
                create vocab entry for each -- code||value||units

'observations' df before cleanup ..

observations.head()

obs_data = cleanup_obs(observations, is_train=True)

after cleanup..

for df in obs_data:
    display(df.head())

allergies have a start and stop date in the same row indicating when an allergy (indicated by its code) started and stopped (or not) for a patient.
So in the cleanup, we flatten that out, meaning create new rows for stop dates.
The dataframe looks as follows before cleanup..

allergies.head()

alg_data = cleanup_algs(allergies, is_train=True)

Resulting in the following output after cleanup..

for df in alg_data:
    display(df.head(3))
    display(df.tail(3))

careplans.head()

crpl_data = cleanup_crpls(careplans, is_train=True)

for df in crpl_data:
    display(df.head(3))
    display(df.tail(3))

medications.head()

med_data = cleanup_meds(medications, is_train=True)

for df in med_data:
    display(df.head(3))
    display(df.tail(3))

imaging_studies.head()

img_data = cleanup_img(imaging_studies, is_train=True)

for df in img_data:
    display(df.head(3))

procedures.head()

proc_data = cleanup_procs(procedures, is_train=True)

for df in proc_data:
    display(df.head(3))

conditions.head()

cnd_data = cleanup_cnds(conditions, is_train=True)

for df in cnd_data:
    display(df.head(3))
    display(df.tail(3))

immunizations.head()

imm_data = cleanup_immns(immunizations, is_train=True)

for df in imm_data:
    display(df.head(3))

Clean all

data_tables, code_tables = cleanup_dataset(f'{PATH_1K}/raw_split/train', is_train=True)

patients, pt_demographics, observations, allergies, \
careplans, medications, imaging_studies, procedures, conditions, immunizations = data_tables

pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, imm_codes = code_tables

conditions.count()

date    7666
code    7666
dtype: int64

obs_codes.count()

orig_code    173312
desc         173312
value        173312
units        173312
type         173312
dtype: int64

Extract Labels (y)

The labels we intend to predict are conditions and must be in the CONDITIONS dict

Adding them to the patients df
And adding the patient's age when the particular condition was recorded

for key in CONDITIONS.keys():
    print(key,"::",f'{CONDITIONS[key]}||START')

diabetes :: 44054006||START
stroke :: 230690007||START
alzheimers :: 26929004||START
coronary_heart :: 53741008||START
lung_cancer :: 254637007||START
breast_cancer :: 254837009||START
rheumatoid_arthritis :: 69896004||START
epilepsy :: 84757009||START

tmp_pts = extract_ys(patients, conditions, cnd_dict=CONDITIONS)

tmp_pts.count()

birthdate                   702
diabetes                    702
diabetes_age                 43
stroke                      702
stroke_age                   30
alzheimers                  702
alzheimers_age               12
coronary_heart              702
coronary_heart_age           39
lung_cancer                 702
lung_cancer_age              12
breast_cancer               702
breast_cancer_age            11
rheumatoid_arthritis        702
rheumatoid_arthritis_age      2
epilepsy                    702
epilepsy_age                 15
dtype: int64

Insert Age

Inserting patient's age in months and years into each record df

this can be modified to records the patient's age in days or even hours that might be more relevant for datasets involving hospitalizations or ER admissions

Do-All Functions

The actual functions that will be called from other modules

clean_raw_ehrdata(PATH_1K, 0.2, 0.2, CONDITIONS, SYNTHEA_DATAGEN_DATES['1K'])

Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test
Saved cleaned "train" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train
Saved vocab code tables to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train/codes
Saved cleaned "valid" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/valid
Saved cleaned "test" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/test

train_dfs, valid_dfs, test_dfs = load_cleaned_ehrdata(PATH_1K)
code_dfs = load_ehr_vocabcodes(PATH_1K)

#     display(df.head())

thispt = train_dfs[0].iloc[20]

thispt

patient                     10134dbf-72d1-4381-b8f3-9530cca6622a
birthdate                                             1958-09-08
diabetes                                                    True
diabetes_age                                                52.0
stroke                                                      True
stroke_age                                                  60.0
alzheimers                                                 False
alzheimers_age                                               NaN
coronary_heart                                             False
coronary_heart_age                                           NaN
lung_cancer                                                False
lung_cancer_age                                              NaN
breast_cancer                                              False
breast_cancer_age                                            NaN
rheumatoid_arthritis                                       False
rheumatoid_arthritis_age                                     NaN
epilepsy                                                   False
epilepsy_age                                                 NaN
Name: 20, dtype: object

#     display(df.head())

Making sure condition counts match - after extracting y for each patient

CONDITIONS

OrderedDict([('diabetes', '44054006'),
             ('stroke', '230690007'),
             ('alzheimers', '26929004'),
             ('coronary_heart', '53741008'),
             ('lung_cancer', '254637007'),
             ('breast_cancer', '254837009'),
             ('rheumatoid_arthritis', '69896004'),
             ('epilepsy', '84757009')])

patients dfs after cleaning, with y extracted

pts_train, pts_valid, pts_test = train_dfs[0], valid_dfs[0], test_dfs[0]

conditions dfs

cnd_train, cnd_valid, cnd_test = train_dfs[8], valid_dfs[8], test_dfs[8]

Counts for each condition in conditions and patients dfs in each split

for pts, cnds, split in zip([pts_train, pts_valid, pts_test],[cnd_train, cnd_valid, cnd_test], ['train','valid','test']):
    print('\n',split)
    print('diabetes:: ', len(cnds[cnds['code'] == '44054006||START']), len(pts[pts['diabetes'] == 1])) 
    print('stroke:: ', len(cnds[cnds['code'] == '230690007||START']), len(pts[pts['stroke'] == 1]))
    print('alzheimers:: ', len(cnds[cnds['code'] == '26929004||START']), len(pts[pts['alzheimers'] == 1]))
    print('coronary_heart:: ', len(cnds[cnds['code'] == '53741008||START']), len(pts[pts['coronary_heart'] == 1]))
    print('lung_cancer:: ', len(cnds[cnds['code'] == '254637007||START']), len(pts[pts['lung_cancer'] == 1]))
    print('breast_cancer:: ', len(cnds[cnds['code'] == '254837009||START']), len(pts[pts['breast_cancer'] == 1]))
    print('rheumatoid_arthritis:: ', len(cnds[cnds['code'] == '69896004||START']), len(pts[pts['rheumatoid_arthritis'] == 1]))
    print('epilepsy:: ', len(cnds[cnds['code'] == '84757009||START']), len(pts[pts['epilepsy'] == 1]))

 train
diabetes::  43 43
stroke::  30 30
alzheimers::  12 12
coronary_heart::  39 39
lung_cancer::  12 12
breast_cancer::  11 11
rheumatoid_arthritis::  2 2
epilepsy::  15 15

 valid
diabetes::  14 14
stroke::  7 7
alzheimers::  7 7
coronary_heart::  11 11
lung_cancer::  0 0
breast_cancer::  8 8
rheumatoid_arthritis::  0 0
epilepsy::  5 5

 test
diabetes::  19 19
stroke::  11 11
alzheimers::  6 6
coronary_heart::  11 11
lung_cancer::  2 2
breast_cancer::  2 2
rheumatoid_arthritis::  0 0
epilepsy::  2 2

for pts, cnds, split in zip([pts_train, pts_valid, pts_test],[cnd_train, cnd_valid, cnd_test], ['train','valid','test']):
    assert len(cnds[cnds['code'] == '44054006||START']) == len(pts[pts['diabetes'] == 1]), f'error in {split} for diabetes'
    assert len(cnds[cnds['code'] == '230690007||START']) == len(pts[pts['stroke'] == 1]), f'error in {split} for stroke'
    assert len(cnds[cnds['code'] == '26929004||START']) == len(pts[pts['alzheimers'] == 1]), f'error in {split} for alzheimers'
    assert len(cnds[cnds['code'] == '53741008||START']) == len(pts[pts['coronary_heart'] == 1]), f'error in {split} for coronary_heart'
    assert len(cnds[cnds['code'] == '254637007||START']) == len(pts[pts['lung_cancer'] == 1]), f'error in {split} for lung_cancer'
    assert len(cnds[cnds['code'] == '254837009||START']) == len(pts[pts['breast_cancer'] == 1]), f'error in {split} for breast_cancer'
    assert len(cnds[cnds['code'] == '69896004||START']) == len(pts[pts['rheumatoid_arthritis'] == 1]), f'error in {split} for rheumatoid_arthritis'
    assert len(cnds[cnds['code'] == '84757009||START']) == len(pts[pts['epilepsy'] == 1]), f'error in {split} for epilepsy'

	Id	BIRTHDATE	DEATHDATE	SSN	DRIVERS	PASSPORT	PREFIX	FIRST	LAST	SUFFIX	...	BIRTHPLACE	ADDRESS	CITY	STATE	COUNTY	ZIP	LAT	LON	HEALTHCARE_EXPENSES	HEALTHCARE_COVERAGE
0	1d604da9-9a81-4ba9-80c2-de3375d59b40	1989-05-25	NaN	999-76-6866	S99984236	X19277260X	Mr.	José Eduardo181	Gómez206	NaN	...	Marigot Saint Andrew Parish DM	427 Balistreri Way Unit 19	Chicopee	Massachusetts	Hampden County	1013.0	42.228354	-72.562951	271227.08	1334.88
1	034e9e3b-2def-4559-bb2a-7850888ae060	1983-11-14	NaN	999-73-5361	S99962402	X88275464X	Mr.	Milo271	Feil794	NaN	...	Danvers Massachusetts US	422 Farrell Path Unit 69	Somerville	Massachusetts	Middlesex County	2143.0	42.360697	-71.126531	793946.01	3204.49
2	10339b10-3cd1-4ac3-ac13-ec26728cb592	1992-06-02	NaN	999-27-3385	S99972682	X73754411X	Mr.	Jayson808	Fadel536	NaN	...	Springfield Massachusetts US	1056 Harris Lane Suite 70	Chicopee	Massachusetts	Hampden County	1020.0	42.181642	-72.608842	574111.90	2606.40
3	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	1978-05-27	NaN	999-85-4926	S99974448	X40915583X	Mrs.	Mariana775	Rutherford999	NaN	...	Yarmouth Massachusetts US	999 Kuhn Forge	Lowell	Massachusetts	Middlesex County	1851.0	42.636143	-71.343255	935630.30	8756.19
4	f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	1996-10-18	NaN	999-60-7372	S99915787	X86772962X	Mr.	Gregorio366	Auer97	NaN	...	Patras Achaea GR	1050 Lindgren Extension Apt 38	Boston	Massachusetts	Suffolk County	2135.0	42.352434	-71.028610	598763.07	3772.20

	birthdate
patient
1d604da9-9a81-4ba9-80c2-de3375d59b40	1989-05-25
034e9e3b-2def-4559-bb2a-7850888ae060	1983-11-14
10339b10-3cd1-4ac3-ac13-ec26728cb592	1992-06-02
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	1978-05-27
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	1996-10-18

	birthdate	marital	race	ethnicity	gender	birthplace	city	state	zip	age_now_days
patient
1d604da9-9a81-4ba9-80c2-de3375d59b40	1989-05-25	M	white	hispanic	M	Marigot Saint Andrew Parish DM	Chicopee	Massachusetts	1013	11617
034e9e3b-2def-4559-bb2a-7850888ae060	1983-11-14	M	white	nonhispanic	M	Danvers Massachusetts US	Somerville	Massachusetts	2143	13636
10339b10-3cd1-4ac3-ac13-ec26728cb592	1992-06-02	M	white	nonhispanic	M	Springfield Massachusetts US	Chicopee	Massachusetts	1020	10513
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	1978-05-27	M	white	nonhispanic	F	Yarmouth Massachusetts US	Lowell	Massachusetts	1851	15633
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	1996-10-18	xxxnan	white	nonhispanic	M	Patras Achaea GR	Boston	Massachusetts	2135	8914

	birthdate	marital	race	ethnicity	gender	birthplace	city	state	zip	age_now_days
0	1989-05-25	M	white	hispanic	M	Marigot Saint Andrew Parish DM	Chicopee	Massachusetts	1013	11617
1	1983-11-14	M	white	nonhispanic	M	Danvers Massachusetts US	Somerville	Massachusetts	2143	13636
2	1992-06-02	M	white	nonhispanic	M	Springfield Massachusetts US	Chicopee	Massachusetts	1020	10513
3	1978-05-27	M	white	nonhispanic	F	Yarmouth Massachusetts US	Lowell	Massachusetts	1851	15633
4	1996-10-18	xxxnan	white	nonhispanic	M	Patras Achaea GR	Boston	Massachusetts	2135	8914

	DATE	PATIENT	ENCOUNTER	CODE	DESCRIPTION	VALUE	UNITS	TYPE
0	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	8302-2	Body Height	193.3	cm	numeric
1	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	72514-3	Pain severity - 0-10 verbal numeric rating [Sc...	2.0	{score}	numeric
2	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	29463-7	Body Weight	87.8	kg	numeric
3	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	39156-5	Body Mass Index	23.5	kg/m2	numeric
4	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	8462-4	Diastolic Blood Pressure	82.0	mm[Hg]	numeric

Clean

Split

`read_raw_ehrdata`[source]

`split_patients`[source]

`split_ehr_dataset`[source]

Clean

`cleanup_pts`[source]

`cleanup_obs`[source]

`cleanup_algs`[source]

`cleanup_crpls`[source]

`cleanup_meds`[source]

`cleanup_img`[source]

`cleanup_procs`[source]

`cleanup_cnds`[source]

`cleanup_immns`[source]

Clean all

`cleanup_dataset`[source]

Extract Labels (y)

`extract_ys`[source]

Insert Age

`insert_age`[source]

Do-All Functions

`clean_raw_ehrdata`[source]

`load_cleaned_ehrdata`[source]

`load_ehr_vocabcodes`[source]

	date	code
patient
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	8302-2\|\|193.3\|\|cm\|\|numeric
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	72514-3\|\|2.0\|\|{score}\|\|numeric
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	29463-7\|\|87.8\|\|kg\|\|numeric
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	39156-5\|\|23.5\|\|kg/m2\|\|numeric
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	8462-4\|\|82.0\|\|mm[Hg]\|\|numeric

	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION
0	1982-10-25	NaN	76982e06-f8b8-4509-9ca3-65a99c8650fe	b896bf40-8b72-42b7-b205-142ee3a56b55	300916003	Latex allergy
1	1982-10-25	NaN	76982e06-f8b8-4509-9ca3-65a99c8650fe	b896bf40-8b72-42b7-b205-142ee3a56b55	300913006	Shellfish allergy
2	2002-01-25	NaN	71ba0469-f0cc-4177-ac70-ea07cb01c8b8	7be1a590-4239-4826-9872-031327f3c368	419474003	Allergy to mould
3	2002-01-25	NaN	71ba0469-f0cc-4177-ac70-ea07cb01c8b8	7be1a590-4239-4826-9872-031327f3c368	232347008	Dander (animal) allergy
4	2002-01-25	NaN	71ba0469-f0cc-4177-ac70-ea07cb01c8b8	7be1a590-4239-4826-9872-031327f3c368	418689008	Allergy to grass pollen

	date	code
patient
96942a16-75bc-4026-bd63-e985b0ca1d6d	2016-09-18	418689008\|\|STOP
96942a16-75bc-4026-bd63-e985b0ca1d6d	2016-09-18	419263009\|\|STOP
e6ff4bf9-09c2-4976-aa84-cca142207cf8	2016-06-25	300916003\|\|STOP

	code	desc
0	300916003\|\|START	Latex allergy
1	300913006\|\|START	Shellfish allergy
2	419474003\|\|START	Allergy to mould

	Id	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION	REASONCODE	REASONDESCRIPTION
0	d2500b8c-e830-433a-8b9d-368d30741520	2010-01-23	2012-01-23	034e9e3b-2def-4559-bb2a-7850888ae060	d0c40d10-8d87-447e-836e-99d26ad52ea5	53950000	Respiratory therapy	10509002.0	Acute bronchitis (disorder)
1	07d9ddd8-dfa1-4e43-9bfe-39f63f4ace15	2011-05-13	2011-08-02	10339b10-3cd1-4ac3-ac13-ec26728cb592	e1ab4933-07a1-49f0-b4bd-05500919061d	53950000	Respiratory therapy	10509002.0	Acute bronchitis (disorder)
2	a3bb6e99-3b99-44b3-974c-e230b4511b5c	2011-12-31	2012-11-30	f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	16300c56-a035-4126-a656-68c093da6dfc	53950000	Respiratory therapy	10509002.0	Acute bronchitis (disorder)
3	9f5284b7-425a-486a-b36e-ab818c018f2f	2016-12-29	2017-01-05	034e9e3b-2def-4559-bb2a-7850888ae060	3b639086-5fbc-4720-8c31-e8c8c0f1d660	53950000	Respiratory therapy	10509002.0	Acute bronchitis (disorder)
4	47ede16c-c216-4f81-a16b-0e858de9cdc3	2017-01-22	2017-02-12	10339b10-3cd1-4ac3-ac13-ec26728cb592	4ec8d55b-05fc-42a5-bfa3-1e233874a362	225358003	Wound care	284551006.0	Laceration of foot

	date	code
patient
6d048a56-edb8-4f29-891d-7a84d75a8e78	2002-11-30	53950000\|\|STOP
fca3178e-fb68-41c3-8598-702d3ca68b96	1983-09-29	91251008\|\|STOP
fca3178e-fb68-41c3-8598-702d3ca68b96	1984-11-22	385691007\|\|STOP

	code	desc
0	53950000\|\|START	Respiratory therapy
1	53950000\|\|START	Respiratory therapy
2	53950000\|\|START	Respiratory therapy

	code	desc
5431	53950000\|\|STOP	Respiratory therapy
5432	91251008\|\|STOP	Physical therapy procedure
5433	385691007\|\|STOP	Fracture care

	START	STOP	PATIENT	PAYER	ENCOUNTER	CODE	DESCRIPTION	BASE_COST	DISPENSES	TOTALCOST	REASONCODE	REASONDESCRIPTION
0	2010-05-05T00:26:23Z	2011-04-30T00:26:23Z	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	b1c428d6-4f07-31e0-90f0-68ffa6ff8c76	1e0d6b0e-1711-4a25-99f9-b1c700c9b260	389221	Etonogestrel 68 MG Drug Implant	677.08	12	8124.96	NaN	NaN
1	2011-04-30T00:26:23Z	2012-04-24T00:26:23Z	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	b1c428d6-4f07-31e0-90f0-68ffa6ff8c76	6aa37300-d1b4-48e7-a2f8-5e0f70f48f38	389221	Etonogestrel 68 MG Drug Implant	624.09	12	7489.08	NaN	NaN
2	2012-04-24T00:26:23Z	2013-04-19T00:26:23Z	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	b1c428d6-4f07-31e0-90f0-68ffa6ff8c76	7253a9f9-6f6d-429a-926a-7b1d424eae3f	748856	Yaz 28 Day Pack	43.32	12	519.84	NaN	NaN
3	2011-05-13T12:58:08Z	2011-05-27T12:58:08Z	10339b10-3cd1-4ac3-ac13-ec26728cb592	d47b3510-2895-3b70-9897-342d681c769d	e1ab4933-07a1-49f0-b4bd-05500919061d	313782	Acetaminophen 325 MG Oral Tablet	8.14	1	8.14	10509002.0	Acute bronchitis (disorder)
4	2011-12-08T15:02:18Z	2011-12-22T15:02:18Z	1d604da9-9a81-4ba9-80c2-de3375d59b40	b1c428d6-4f07-31e0-90f0-68ffa6ff8c76	792fae81-a007-44b0-8221-46953737b089	562251	Amoxicillin 250 MG / Clavulanate 125 MG Oral T...	11.91	1	11.91	444814009.0	Viral sinusitis (disorder)

	date	code
patient
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	2010-05-05 00:26:23	389221\|\|START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	2011-04-30 00:26:23	389221\|\|START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	2012-04-24 00:26:23	748856\|\|START

	code	desc
84080	2123111\|\|STOP	NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ...
84081	243670\|\|STOP	Aspirin 81 MG Oral Tablet
84082	313782\|\|STOP	Acetaminophen 325 MG Oral Tablet

	Id	DATE	PATIENT	ENCOUNTER	BODYSITE_CODE	BODYSITE_DESCRIPTION	MODALITY_CODE	MODALITY_DESCRIPTION	SOP_CODE	SOP_DESCRIPTION
0	d3e49b38-7634-4416-879d-7bc68bf3e7df	2014-07-08T15:35:36Z	b58731cc-2d8b-4c2d-b327-4cab771af3ef	3a36836d-da25-4e73-808b-972b669b7e4e	40983000	Arm	DX	Digital Radiography	1.2.840.10008.5.1.4.1.1.1.1	Digital X-Ray Image Storage
1	46baf530-4941-40ab-8219-685a08fd9086	2014-01-22T18:58:37Z	2ffe9369-24e4-414b-8973-258fad09313a	33b71e4b-0690-4fe9-897a-dc3b2ff9215c	40983000	Arm	DX	Digital Radiography	1.2.840.10008.5.1.4.1.1.1.1	Digital X-Ray Image Storage
2	b8fb8a6e-a2f5-46c9-8b3f-a35aa982efcd	2001-12-01T02:08:27Z	86b97fc7-ae8f-4e0d-8e66-db68f36e7a76	e42d1046-568d-46c2-b0a5-d910b2f3bd1d	8205005	Wrist	DX	Digital Radiography	1.2.840.10008.5.1.4.1.1.1.1	Digital X-Ray Image Storage
3	10c8a016-4504-4653-bddf-2dd3610886c8	2004-07-03T20:46:46Z	71ba0469-f0cc-4177-ac70-ea07cb01c8b8	323fca87-817f-4d58-8486-ba92ea739399	51299004	Clavicle	DX	Digital Radiography	1.2.840.10008.5.1.4.1.1.1.1	Digital X-Ray Image Storage
4	4221534c-d379-4c6b-a22e-d7eae3fa2609	2017-02-08T08:42:44Z	d49f748f-928d-40e8-92c8-73e4c5679711	cfef48b3-b769-4794-a3e7-f57f7ba8d387	344001	Ankle	DX	Digital Radiography	1.2.840.10008.5.1.4.1.1.1.1	Digital X-Ray Image Storage

	date	code
patient
b58731cc-2d8b-4c2d-b327-4cab771af3ef	2014-07-08 15:35:36	40983000
2ffe9369-24e4-414b-8973-258fad09313a	2014-01-22 18:58:37	40983000
86b97fc7-ae8f-4e0d-8e66-db68f36e7a76	2001-12-01 02:08:27	8205005

	code	desc
0	40983000	Arm
1	40983000	Arm
2	8205005	Wrist

	date	code
patient
6d048a56-edb8-4f29-891d-7a84d75a8e78	2005-12-31 17:27:52	2123111\|\|STOP
fca3178e-fb68-41c3-8598-702d3ca68b96	1983-09-29 17:27:52	243670\|\|STOP
fca3178e-fb68-41c3-8598-702d3ca68b96	1984-11-22 17:27:52	313782\|\|STOP

	code	desc
0	169553002	Insertion of subcutaneous contraceptive
1	430193006	Medication Reconciliation (procedure)
2	430193006	Medication Reconciliation (procedure)

	START	STOP	PATIENT	ENCOUNTER	CODE	DESCRIPTION
0	2001-05-01	NaN	1d604da9-9a81-4ba9-80c2-de3375d59b40	8f104aa7-4ca9-4473-885a-bba2437df588	40055000	Chronic sinusitis (disorder)
1	2011-08-09	2011-08-16	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	9d35ec9f-352a-4629-92ef-38eae38437e7	444814009	Viral sinusitis (disorder)
2	2011-11-16	2011-11-26	8d4c4326-e9de-4f45-9a4c-f8c36bff89ae	ae7555a9-eaff-4c09-98a7-21bc6ed1b1fd	195662009	Acute viral pharyngitis (disorder)
3	2011-05-13	2011-05-27	10339b10-3cd1-4ac3-ac13-ec26728cb592	e1ab4933-07a1-49f0-b4bd-05500919061d	10509002	Acute bronchitis (disorder)
4	2011-02-06	2011-02-14	f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	b8f76eba-7795-4dcd-a544-f27ac2ef3d46	195662009	Acute viral pharyngitis (disorder)

	date	code
patient
fca3178e-fb68-41c3-8598-702d3ca68b96	1986-03-02	43878008\|\|STOP
fc817953-cc8b-45db-9c85-7c0ced8fa90d	2010-11-25	444814009\|\|STOP
fc817953-cc8b-45db-9c85-7c0ced8fa90d	2012-05-14	444814009\|\|STOP

	code	desc
0	40055000\|\|START	Chronic sinusitis (disorder)
1	444814009\|\|START	Viral sinusitis (disorder)
2	195662009\|\|START	Acute viral pharyngitis (disorder)

	code	desc
12938	43878008\|\|STOP	Streptococcal sore throat (disorder)
12939	444814009\|\|STOP	Viral sinusitis (disorder)
12940	444814009\|\|STOP	Viral sinusitis (disorder)

	DATE	PATIENT	ENCOUNTER	CODE	DESCRIPTION	BASE_COST
0	2010-07-27T12:58:08Z	10339b10-3cd1-4ac3-ac13-ec26728cb592	dae2b7cb-1316-4b78-954f-fa610a6c6d0e	140	Influenza seasonal injectable preservative ...	140.52
1	2010-11-20T03:04:34Z	f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	7ff86631-0378-4bfc-92ce-1edd697eb18e	140	Influenza seasonal injectable preservative ...	140.52
2	2012-01-23T17:45:28Z	034e9e3b-2def-4559-bb2a-7850888ae060	e88bc3a9-007c-405e-aabc-792a38f4aa2b	140	Influenza seasonal injectable preservative ...	140.52
3	2011-11-26T03:04:34Z	f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	1923c698-accd-4d70-ba09-e1938f6e96d1	140	Influenza seasonal injectable preservative ...	140.52
4	2011-07-28T15:02:18Z	1d604da9-9a81-4ba9-80c2-de3375d59b40	b85c339a-6076-43ed-b9d0-9cf013dec49d	140	Influenza seasonal injectable preservative ...	140.52

	date	code
patient
10339b10-3cd1-4ac3-ac13-ec26728cb592	2010-07-27 12:58:08	140
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a	2010-11-20 03:04:34	140
034e9e3b-2def-4559-bb2a-7850888ae060	2012-01-23 17:45:28	140

Clean

Split

read_raw_ehrdata[source]

split_patients[source]

split_ehr_dataset[source]

Clean

cleanup_pts[source]

cleanup_obs[source]

cleanup_algs[source]

cleanup_crpls[source]

cleanup_meds[source]

cleanup_img[source]

cleanup_procs[source]

cleanup_cnds[source]

cleanup_immns[source]

Clean all

cleanup_dataset[source]

Extract Labels (y)

extract_ys[source]

Insert Age

insert_age[source]

Do-All Functions

clean_raw_ehrdata[source]

load_cleaned_ehrdata[source]

load_ehr_vocabcodes[source]

`read_raw_ehrdata`[source]

`split_patients`[source]

`split_ehr_dataset`[source]

`cleanup_pts`[source]

`cleanup_obs`[source]

`cleanup_algs`[source]

`cleanup_crpls`[source]

`cleanup_meds`[source]

`cleanup_img`[source]

`cleanup_procs`[source]

`cleanup_cnds`[source]

`cleanup_immns`[source]

`cleanup_dataset`[source]

`extract_ys`[source]

`insert_age`[source]

`clean_raw_ehrdata`[source]

`load_cleaned_ehrdata`[source]

`load_ehr_vocabcodes`[source]