Functions to split the raw EHR dataset, clean and save for further processing & vocab creation.
 

Split

Once dataset is assembled, the folder will look as follows ..

DATA_STORE
'/home/vinod/.lemonpie/datasets'
PATH_1K
'/home/vinod/.lemonpie/datasets/synthea/1K'
os.listdir(f'{PATH_1K}/raw_original')
['patients.csv',
 'observations.csv',
 'allergies.csv',
 'payers.csv',
 'careplans.csv',
 'medications.csv',
 'devices.csv',
 'organizations.csv',
 'imaging_studies.csv',
 'procedures.csv',
 'payer_transitions.csv',
 'supplies.csv',
 'conditions.csv',
 'providers.csv',
 'encounters.csv',
 'immunizations.csv']

read_raw_ehrdata[source]

read_raw_ehrdata(path, csv_names=['patients', 'observations', 'allergies', 'careplans', 'medications', 'imaging_studies', 'procedures', 'conditions', 'immunizations'])

Read raw EHR data

dfs = read_raw_ehrdata(f'{PATH_1K}/raw_original')
patients, observations, allergies, careplans, medications, imaging_studies, procedures, conditions, immunizations = dfs

split_patients[source]

split_patients(patients, valid_pct=0.2, test_pct=0.2, random_state=1234)

Split the patients dataframe

train, valid, test = split_patients(patients, .2,.1)
Splits:: train: 0.7, valid: 0.2, test: 0.1
len(patients), len(train), len(valid), len(test)
(1171, 819, 234, 118)
assert len(patients) == len(train)+len(valid)+len(test)

split_ehr_dataset[source]

split_ehr_dataset(path, valid_pct=0.2, test_pct=0.2, random_state=1234)

Split EHR dataset into train, valid, test and save

split_ehr_dataset(PATH_1K) #will use default values for split percents
Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test

Clean

cleanup_pts[source]

cleanup_pts(pts, is_train, today=None)

Clean patients df

patients data frame looks like this before cleanup..

patients.head()
Id BIRTHDATE DEATHDATE SSN DRIVERS PASSPORT PREFIX FIRST LAST SUFFIX ... BIRTHPLACE ADDRESS CITY STATE COUNTY ZIP LAT LON HEALTHCARE_EXPENSES HEALTHCARE_COVERAGE
0 1d604da9-9a81-4ba9-80c2-de3375d59b40 1989-05-25 NaN 999-76-6866 S99984236 X19277260X Mr. José Eduardo181 Gómez206 NaN ... Marigot Saint Andrew Parish DM 427 Balistreri Way Unit 19 Chicopee Massachusetts Hampden County 1013.0 42.228354 -72.562951 271227.08 1334.88
1 034e9e3b-2def-4559-bb2a-7850888ae060 1983-11-14 NaN 999-73-5361 S99962402 X88275464X Mr. Milo271 Feil794 NaN ... Danvers Massachusetts US 422 Farrell Path Unit 69 Somerville Massachusetts Middlesex County 2143.0 42.360697 -71.126531 793946.01 3204.49
2 10339b10-3cd1-4ac3-ac13-ec26728cb592 1992-06-02 NaN 999-27-3385 S99972682 X73754411X Mr. Jayson808 Fadel536 NaN ... Springfield Massachusetts US 1056 Harris Lane Suite 70 Chicopee Massachusetts Hampden County 1020.0 42.181642 -72.608842 574111.90 2606.40
3 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 1978-05-27 NaN 999-85-4926 S99974448 X40915583X Mrs. Mariana775 Rutherford999 NaN ... Yarmouth Massachusetts US 999 Kuhn Forge Lowell Massachusetts Middlesex County 1851.0 42.636143 -71.343255 935630.30 8756.19
4 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 1996-10-18 NaN 999-60-7372 S99915787 X86772962X Mr. Gregorio366 Auer97 NaN ... Patras Achaea GR 1050 Lindgren Extension Apt 38 Boston Massachusetts Suffolk County 2135.0 42.352434 -71.028610 598763.07 3772.20

5 rows × 25 columns

pt_data = cleanup_pts(patients, is_train=True, today=SYNTHEA_DATAGEN_DATES['1K'])
patients, pt_demo, pt_codes = pt_data[0], pt_data[1], pt_data[2]

Our cleanup function produces the following 3 dfs - patients, pt_demographics, pt_codes

for df in pt_data:
    display(df.head())
birthdate
patient
1d604da9-9a81-4ba9-80c2-de3375d59b40 1989-05-25
034e9e3b-2def-4559-bb2a-7850888ae060 1983-11-14
10339b10-3cd1-4ac3-ac13-ec26728cb592 1992-06-02
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 1978-05-27
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 1996-10-18
birthdate marital race ethnicity gender birthplace city state zip age_now_days
patient
1d604da9-9a81-4ba9-80c2-de3375d59b40 1989-05-25 M white hispanic M Marigot Saint Andrew Parish DM Chicopee Massachusetts 1013 11617
034e9e3b-2def-4559-bb2a-7850888ae060 1983-11-14 M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts 2143 13636
10339b10-3cd1-4ac3-ac13-ec26728cb592 1992-06-02 M white nonhispanic M Springfield Massachusetts US Chicopee Massachusetts 1020 10513
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 1978-05-27 M white nonhispanic F Yarmouth Massachusetts US Lowell Massachusetts 1851 15633
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 1996-10-18 xxxnan white nonhispanic M Patras Achaea GR Boston Massachusetts 2135 8914
birthdate marital race ethnicity gender birthplace city state zip age_now_days
0 1989-05-25 M white hispanic M Marigot Saint Andrew Parish DM Chicopee Massachusetts 1013 11617
1 1983-11-14 M white nonhispanic M Danvers Massachusetts US Somerville Massachusetts 2143 13636
2 1992-06-02 M white nonhispanic M Springfield Massachusetts US Chicopee Massachusetts 1020 10513
3 1978-05-27 M white nonhispanic F Yarmouth Massachusetts US Lowell Massachusetts 1851 15633
4 1996-10-18 xxxnan white nonhispanic M Patras Achaea GR Boston Massachusetts 2135 8914

The case for keeping a record of the data generation date

Also note the difference in age_now if it were calculated based on default (pd.Timestamp.today()) vs SYNTHEA_DATAGEN_DATES['1K'] which is the data generation date for this 1K dataset.

(pd.to_datetime(pd.Timestamp.today()) - patients.iloc[2])[0].days, (pd.to_datetime(SYNTHEA_DATAGEN_DATES['1K']) - patients.iloc[2])[0].days
(10530, 10513)
SYNTHEA_DATAGEN_DATES['1K'], pd.Timestamp.today()
('03-15-2021', Timestamp('2021-04-01 14:18:36.220590'))

That is - 1K dataset data generation date is set to March 15th while today is March 31.

cleanup_obs[source]

cleanup_obs(obs, is_train)

Clean observations df

  • Drops rows with null in the VALUE column
  • Creates a new code column with a concatenation of code, value, units and type
    • so that we can use the following logic during vocab creation for observations (further detailed in the vocab documentation)

For numeric

for 'numeric'
    get unique 'codes'
    for each unique code
        get unique 'units'
            for each unique unit
                bucketize 'values'
                create vocab entry for each 'bucket' -- code||value_bucket||units

For text

for 'text'
    get unique 'codes'
    for each unique code
        get unique 'units' #this will be null
            for each unique unit
                get unique 'values'
                create vocab entry for each -- code||value||units

'observations' df before cleanup ..

observations.head()
DATE PATIENT ENCOUNTER CODE DESCRIPTION VALUE UNITS TYPE
0 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 8302-2 Body Height 193.3 cm numeric
1 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 72514-3 Pain severity - 0-10 verbal numeric rating [Sc... 2.0 {score} numeric
2 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 29463-7 Body Weight 87.8 kg numeric
3 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 39156-5 Body Mass Index 23.5 kg/m2 numeric
4 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 8462-4 Diastolic Blood Pressure 82.0 mm[Hg] numeric
obs_data = cleanup_obs(observations, is_train=True)

after cleanup..

for df in obs_data:
    display(df.head())
date code
patient
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 8302-2||193.3||cm||numeric
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 72514-3||2.0||{score}||numeric
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 29463-7||87.8||kg||numeric
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 39156-5||23.5||kg/m2||numeric
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 8462-4||82.0||mm[Hg]||numeric
orig_code desc value units type
0 8302-2 Body Height 193.3 cm numeric
1 72514-3 Pain severity - 0-10 verbal numeric rating [Sc... 2.0 {score} numeric
2 29463-7 Body Weight 87.8 kg numeric
3 39156-5 Body Mass Index 23.5 kg/m2 numeric
4 8462-4 Diastolic Blood Pressure 82.0 mm[Hg] numeric

cleanup_algs[source]

cleanup_algs(allergies, is_train)

Clean allergies df

allergies have a start and stop date in the same row indicating when an allergy (indicated by its code) started and stopped (or not) for a patient.
So in the cleanup, we flatten that out, meaning create new rows for stop dates.
The dataframe looks as follows before cleanup..

allergies.head()
START STOP PATIENT ENCOUNTER CODE DESCRIPTION
0 1982-10-25 NaN 76982e06-f8b8-4509-9ca3-65a99c8650fe b896bf40-8b72-42b7-b205-142ee3a56b55 300916003 Latex allergy
1 1982-10-25 NaN 76982e06-f8b8-4509-9ca3-65a99c8650fe b896bf40-8b72-42b7-b205-142ee3a56b55 300913006 Shellfish allergy
2 2002-01-25 NaN 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 7be1a590-4239-4826-9872-031327f3c368 419474003 Allergy to mould
3 2002-01-25 NaN 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 7be1a590-4239-4826-9872-031327f3c368 232347008 Dander (animal) allergy
4 2002-01-25 NaN 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 7be1a590-4239-4826-9872-031327f3c368 418689008 Allergy to grass pollen
alg_data = cleanup_algs(allergies, is_train=True)

Resulting in the following output after cleanup..

for df in alg_data:
    display(df.head(3))
    display(df.tail(3))
date code
patient
76982e06-f8b8-4509-9ca3-65a99c8650fe 1982-10-25 300916003||START
76982e06-f8b8-4509-9ca3-65a99c8650fe 1982-10-25 300913006||START
71ba0469-f0cc-4177-ac70-ea07cb01c8b8 2002-01-25 419474003||START
date code
patient
96942a16-75bc-4026-bd63-e985b0ca1d6d 2016-09-18 418689008||STOP
96942a16-75bc-4026-bd63-e985b0ca1d6d 2016-09-18 419263009||STOP
e6ff4bf9-09c2-4976-aa84-cca142207cf8 2016-06-25 300916003||STOP
code desc
0 300916003||START Latex allergy
1 300913006||START Shellfish allergy
2 419474003||START Allergy to mould
code desc
658 418689008||STOP Allergy to grass pollen
659 419263009||STOP Allergy to tree pollen
660 300916003||STOP Latex allergy

cleanup_crpls[source]

cleanup_crpls(careplans, is_train)

Clean careplans df

careplans.head()
Id START STOP PATIENT ENCOUNTER CODE DESCRIPTION REASONCODE REASONDESCRIPTION
0 d2500b8c-e830-433a-8b9d-368d30741520 2010-01-23 2012-01-23 034e9e3b-2def-4559-bb2a-7850888ae060 d0c40d10-8d87-447e-836e-99d26ad52ea5 53950000 Respiratory therapy 10509002.0 Acute bronchitis (disorder)
1 07d9ddd8-dfa1-4e43-9bfe-39f63f4ace15 2011-05-13 2011-08-02 10339b10-3cd1-4ac3-ac13-ec26728cb592 e1ab4933-07a1-49f0-b4bd-05500919061d 53950000 Respiratory therapy 10509002.0 Acute bronchitis (disorder)
2 a3bb6e99-3b99-44b3-974c-e230b4511b5c 2011-12-31 2012-11-30 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 16300c56-a035-4126-a656-68c093da6dfc 53950000 Respiratory therapy 10509002.0 Acute bronchitis (disorder)
3 9f5284b7-425a-486a-b36e-ab818c018f2f 2016-12-29 2017-01-05 034e9e3b-2def-4559-bb2a-7850888ae060 3b639086-5fbc-4720-8c31-e8c8c0f1d660 53950000 Respiratory therapy 10509002.0 Acute bronchitis (disorder)
4 47ede16c-c216-4f81-a16b-0e858de9cdc3 2017-01-22 2017-02-12 10339b10-3cd1-4ac3-ac13-ec26728cb592 4ec8d55b-05fc-42a5-bfa3-1e233874a362 225358003 Wound care 284551006.0 Laceration of foot
crpl_data = cleanup_crpls(careplans, is_train=True)
for df in crpl_data:
    display(df.head(3))
    display(df.tail(3))
date code
patient
034e9e3b-2def-4559-bb2a-7850888ae060 2010-01-23 53950000||START
10339b10-3cd1-4ac3-ac13-ec26728cb592 2011-05-13 53950000||START
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2011-12-31 53950000||START
date code
patient
6d048a56-edb8-4f29-891d-7a84d75a8e78 2002-11-30 53950000||STOP
fca3178e-fb68-41c3-8598-702d3ca68b96 1983-09-29 91251008||STOP
fca3178e-fb68-41c3-8598-702d3ca68b96 1984-11-22 385691007||STOP
code desc
0 53950000||START Respiratory therapy
1 53950000||START Respiratory therapy
2 53950000||START Respiratory therapy
code desc
5431 53950000||STOP Respiratory therapy
5432 91251008||STOP Physical therapy procedure
5433 385691007||STOP Fracture care

cleanup_meds[source]

cleanup_meds(medications, is_train)

Clean medications df

medications.head()
START STOP PATIENT PAYER ENCOUNTER CODE DESCRIPTION BASE_COST PAYER_COVERAGE DISPENSES TOTALCOST REASONCODE REASONDESCRIPTION
0 2010-05-05T00:26:23Z 2011-04-30T00:26:23Z 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae b1c428d6-4f07-31e0-90f0-68ffa6ff8c76 1e0d6b0e-1711-4a25-99f9-b1c700c9b260 389221 Etonogestrel 68 MG Drug Implant 677.08 0.0 12 8124.96 NaN NaN
1 2011-04-30T00:26:23Z 2012-04-24T00:26:23Z 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae b1c428d6-4f07-31e0-90f0-68ffa6ff8c76 6aa37300-d1b4-48e7-a2f8-5e0f70f48f38 389221 Etonogestrel 68 MG Drug Implant 624.09 0.0 12 7489.08 NaN NaN
2 2012-04-24T00:26:23Z 2013-04-19T00:26:23Z 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae b1c428d6-4f07-31e0-90f0-68ffa6ff8c76 7253a9f9-6f6d-429a-926a-7b1d424eae3f 748856 Yaz 28 Day Pack 43.32 0.0 12 519.84 NaN NaN
3 2011-05-13T12:58:08Z 2011-05-27T12:58:08Z 10339b10-3cd1-4ac3-ac13-ec26728cb592 d47b3510-2895-3b70-9897-342d681c769d e1ab4933-07a1-49f0-b4bd-05500919061d 313782 Acetaminophen 325 MG Oral Tablet 8.14 0.0 1 8.14 10509002.0 Acute bronchitis (disorder)
4 2011-12-08T15:02:18Z 2011-12-22T15:02:18Z 1d604da9-9a81-4ba9-80c2-de3375d59b40 b1c428d6-4f07-31e0-90f0-68ffa6ff8c76 792fae81-a007-44b0-8221-46953737b089 562251 Amoxicillin 250 MG / Clavulanate 125 MG Oral T... 11.91 0.0 1 11.91 444814009.0 Viral sinusitis (disorder)
med_data = cleanup_meds(medications, is_train=True)
for df in med_data:
    display(df.head(3))
    display(df.tail(3))
date code
patient
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2010-05-05 00:26:23 389221||START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2011-04-30 00:26:23 389221||START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2012-04-24 00:26:23 748856||START
date code
patient
6d048a56-edb8-4f29-891d-7a84d75a8e78 2005-12-31 17:27:52 2123111||STOP
fca3178e-fb68-41c3-8598-702d3ca68b96 1983-09-29 17:27:52 243670||STOP
fca3178e-fb68-41c3-8598-702d3ca68b96 1984-11-22 17:27:52 313782||STOP
code desc
0 389221||START Etonogestrel 68 MG Drug Implant
1 389221||START Etonogestrel 68 MG Drug Implant
2 748856||START Yaz 28 Day Pack
code desc
84080 2123111||STOP NDA020503 200 ACTUAT Albuterol 0.09 MG/ACTUAT ...
84081 243670||STOP Aspirin 81 MG Oral Tablet
84082 313782||STOP Acetaminophen 325 MG Oral Tablet

cleanup_img[source]

cleanup_img(imaging_studies, is_train)

Clean imaging df

imaging_studies.head()
Id DATE PATIENT ENCOUNTER BODYSITE_CODE BODYSITE_DESCRIPTION MODALITY_CODE MODALITY_DESCRIPTION SOP_CODE SOP_DESCRIPTION
0 d3e49b38-7634-4416-879d-7bc68bf3e7df 2014-07-08T15:35:36Z b58731cc-2d8b-4c2d-b327-4cab771af3ef 3a36836d-da25-4e73-808b-972b669b7e4e 40983000 Arm DX Digital Radiography 1.2.840.10008.5.1.4.1.1.1.1 Digital X-Ray Image Storage
1 46baf530-4941-40ab-8219-685a08fd9086 2014-01-22T18:58:37Z 2ffe9369-24e4-414b-8973-258fad09313a 33b71e4b-0690-4fe9-897a-dc3b2ff9215c 40983000 Arm DX Digital Radiography 1.2.840.10008.5.1.4.1.1.1.1 Digital X-Ray Image Storage
2 b8fb8a6e-a2f5-46c9-8b3f-a35aa982efcd 2001-12-01T02:08:27Z 86b97fc7-ae8f-4e0d-8e66-db68f36e7a76 e42d1046-568d-46c2-b0a5-d910b2f3bd1d 8205005 Wrist DX Digital Radiography 1.2.840.10008.5.1.4.1.1.1.1 Digital X-Ray Image Storage
3 10c8a016-4504-4653-bddf-2dd3610886c8 2004-07-03T20:46:46Z 71ba0469-f0cc-4177-ac70-ea07cb01c8b8 323fca87-817f-4d58-8486-ba92ea739399 51299004 Clavicle DX Digital Radiography 1.2.840.10008.5.1.4.1.1.1.1 Digital X-Ray Image Storage
4 4221534c-d379-4c6b-a22e-d7eae3fa2609 2017-02-08T08:42:44Z d49f748f-928d-40e8-92c8-73e4c5679711 cfef48b3-b769-4794-a3e7-f57f7ba8d387 344001 Ankle DX Digital Radiography 1.2.840.10008.5.1.4.1.1.1.1 Digital X-Ray Image Storage
img_data = cleanup_img(imaging_studies, is_train=True)
for df in img_data:
    display(df.head(3))
date code
patient
b58731cc-2d8b-4c2d-b327-4cab771af3ef 2014-07-08 15:35:36 40983000
2ffe9369-24e4-414b-8973-258fad09313a 2014-01-22 18:58:37 40983000
86b97fc7-ae8f-4e0d-8e66-db68f36e7a76 2001-12-01 02:08:27 8205005
code desc
0 40983000 Arm
1 40983000 Arm
2 8205005 Wrist

cleanup_procs[source]

cleanup_procs(procedures, is_train)

Clean procedures df

procedures.head()
DATE PATIENT ENCOUNTER CODE DESCRIPTION BASE_COST REASONCODE REASONDESCRIPTION
0 2011-04-30T00:26:23Z 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 6aa37300-d1b4-48e7-a2f8-5e0f70f48f38 169553002 Insertion of subcutaneous contraceptive 14896.56 NaN NaN
1 2010-07-27T12:58:08Z 10339b10-3cd1-4ac3-ac13-ec26728cb592 dae2b7cb-1316-4b78-954f-fa610a6c6d0e 430193006 Medication Reconciliation (procedure) 726.51 NaN NaN
2 2010-11-20T03:04:34Z f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 7ff86631-0378-4bfc-92ce-1edd697eb18e 430193006 Medication Reconciliation (procedure) 788.50 NaN NaN
3 2011-02-07T03:04:34Z f5dcd418-09fe-4a2f-baa0-3da800bd8c3a b8f76eba-7795-4dcd-a544-f27ac2ef3d46 117015009 Throat culture (procedure) 2070.44 195662009.0 Acute viral pharyngitis (disorder)
4 2011-04-19T03:04:34Z f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 640837d9-845a-433c-9fad-47426664a69d 117015009 Throat culture (procedure) 2479.39 195662009.0 Acute viral pharyngitis (disorder)
proc_data = cleanup_procs(procedures, is_train=True)
for df in proc_data:
    display(df.head(3))
date code
patient
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2011-04-30 00:26:23 169553002
10339b10-3cd1-4ac3-ac13-ec26728cb592 2010-07-27 12:58:08 430193006
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2010-11-20 03:04:34 430193006
code desc
0 169553002 Insertion of subcutaneous contraceptive
1 430193006 Medication Reconciliation (procedure)
2 430193006 Medication Reconciliation (procedure)

cleanup_cnds[source]

cleanup_cnds(conditions, is_train)

Clean conditions df

conditions.head()
START STOP PATIENT ENCOUNTER CODE DESCRIPTION
0 2001-05-01 NaN 1d604da9-9a81-4ba9-80c2-de3375d59b40 8f104aa7-4ca9-4473-885a-bba2437df588 40055000 Chronic sinusitis (disorder)
1 2011-08-09 2011-08-16 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 9d35ec9f-352a-4629-92ef-38eae38437e7 444814009 Viral sinusitis (disorder)
2 2011-11-16 2011-11-26 8d4c4326-e9de-4f45-9a4c-f8c36bff89ae ae7555a9-eaff-4c09-98a7-21bc6ed1b1fd 195662009 Acute viral pharyngitis (disorder)
3 2011-05-13 2011-05-27 10339b10-3cd1-4ac3-ac13-ec26728cb592 e1ab4933-07a1-49f0-b4bd-05500919061d 10509002 Acute bronchitis (disorder)
4 2011-02-06 2011-02-14 f5dcd418-09fe-4a2f-baa0-3da800bd8c3a b8f76eba-7795-4dcd-a544-f27ac2ef3d46 195662009 Acute viral pharyngitis (disorder)
cnd_data = cleanup_cnds(conditions, is_train=True)
for df in cnd_data:
    display(df.head(3))
    display(df.tail(3))
date code
patient
1d604da9-9a81-4ba9-80c2-de3375d59b40 2001-05-01 40055000||START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2011-08-09 444814009||START
8d4c4326-e9de-4f45-9a4c-f8c36bff89ae 2011-11-16 195662009||START
date code
patient
fca3178e-fb68-41c3-8598-702d3ca68b96 1986-03-02 43878008||STOP
fc817953-cc8b-45db-9c85-7c0ced8fa90d 2010-11-25 444814009||STOP
fc817953-cc8b-45db-9c85-7c0ced8fa90d 2012-05-14 444814009||STOP
code desc
0 40055000||START Chronic sinusitis (disorder)
1 444814009||START Viral sinusitis (disorder)
2 195662009||START Acute viral pharyngitis (disorder)
code desc
12938 43878008||STOP Streptococcal sore throat (disorder)
12939 444814009||STOP Viral sinusitis (disorder)
12940 444814009||STOP Viral sinusitis (disorder)

cleanup_immns[source]

cleanup_immns(immunizations, is_train)

Clean immunizations df

immunizations.head()
DATE PATIENT ENCOUNTER CODE DESCRIPTION BASE_COST
0 2010-07-27T12:58:08Z 10339b10-3cd1-4ac3-ac13-ec26728cb592 dae2b7cb-1316-4b78-954f-fa610a6c6d0e 140 Influenza seasonal injectable preservative ... 140.52
1 2010-11-20T03:04:34Z f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 7ff86631-0378-4bfc-92ce-1edd697eb18e 140 Influenza seasonal injectable preservative ... 140.52
2 2012-01-23T17:45:28Z 034e9e3b-2def-4559-bb2a-7850888ae060 e88bc3a9-007c-405e-aabc-792a38f4aa2b 140 Influenza seasonal injectable preservative ... 140.52
3 2011-11-26T03:04:34Z f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 1923c698-accd-4d70-ba09-e1938f6e96d1 140 Influenza seasonal injectable preservative ... 140.52
4 2011-07-28T15:02:18Z 1d604da9-9a81-4ba9-80c2-de3375d59b40 b85c339a-6076-43ed-b9d0-9cf013dec49d 140 Influenza seasonal injectable preservative ... 140.52
imm_data = cleanup_immns(immunizations, is_train=True)
for df in imm_data:
    display(df.head(3))
date code
patient
10339b10-3cd1-4ac3-ac13-ec26728cb592 2010-07-27 12:58:08 140
f5dcd418-09fe-4a2f-baa0-3da800bd8c3a 2010-11-20 03:04:34 140
034e9e3b-2def-4559-bb2a-7850888ae060 2012-01-23 17:45:28 140
code desc
0 140 Influenza seasonal injectable preservative ...
1 140 Influenza seasonal injectable preservative ...
2 140 Influenza seasonal injectable preservative ...

Clean all

cleanup_dataset[source]

cleanup_dataset(path, is_train, today=None)

Clean all dfs in a split

data_tables, code_tables = cleanup_dataset(f'{PATH_1K}/raw_split/train', is_train=True)

patients, pt_demographics, observations, allergies, \
careplans, medications, imaging_studies, procedures, conditions, immunizations = data_tables

pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, imm_codes = code_tables
conditions.count()
date    7666
code    7666
dtype: int64
obs_codes.count()
orig_code    173312
desc         173312
value        173312
units        173312
type         173312
dtype: int64

Extract Labels (y)

The labels we intend to predict are conditions and must be in the CONDITIONS dict

  • Adding them to the patients df
  • And adding the patient's age when the particular condition was recorded
for key in CONDITIONS.keys():
    print(key,"::",f'{CONDITIONS[key]}||START')
diabetes :: 44054006||START
stroke :: 230690007||START
alzheimers :: 26929004||START
coronary_heart :: 53741008||START
lung_cancer :: 254637007||START
breast_cancer :: 254837009||START
rheumatoid_arthritis :: 69896004||START
epilepsy :: 84757009||START

extract_ys[source]

extract_ys(patients, conditions, cnd_dict)

Extract labels from conditions df and add them to patients df with age

tmp_pts = extract_ys(patients, conditions, cnd_dict=CONDITIONS)
tmp_pts.count()
birthdate                   702
diabetes                    702
diabetes_age                 43
stroke                      702
stroke_age                   30
alzheimers                  702
alzheimers_age               12
coronary_heart              702
coronary_heart_age           39
lung_cancer                 702
lung_cancer_age              12
breast_cancer               702
breast_cancer_age            11
rheumatoid_arthritis        702
rheumatoid_arthritis_age      2
epilepsy                    702
epilepsy_age                 15
dtype: int64

Insert Age

Inserting patient's age in months and years into each record df

  • this can be modified to records the patient's age in days or even hours that might be more relevant for datasets involving hospitalizations or ER admissions

insert_age[source]

insert_age(df, pts_df)

Insert age in years and months into each of the rec dfs

Do-All Functions

The actual functions that will be called from other modules

clean_raw_ehrdata[source]

clean_raw_ehrdata(path, valid_pct, test_pct, conditions_dict, today=None)

Split, clean, preprocess & save raw EHR data

load_cleaned_ehrdata[source]

load_cleaned_ehrdata(path)

Load cleaned, age-filtered EHR data

load_ehr_vocabcodes[source]

load_ehr_vocabcodes(path)

Load codes for vocabs

clean_raw_ehrdata(PATH_1K, 0.2, 0.2, CONDITIONS, SYNTHEA_DATAGEN_DATES['1K'])
Splits:: train: 0.6, valid: 0.2, test: 0.2
Split patients into:: Train: 702, Valid: 234, Test: 235 -- Total before split: 1171
Saved train data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/train
Saved valid data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/valid
Saved test data to /home/vinod/.lemonpie/datasets/synthea/1K/raw_split/test
Saved cleaned "train" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train
Saved vocab code tables to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/train/codes
Saved cleaned "valid" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/valid
Saved cleaned "test" data to /home/vinod/.lemonpie/datasets/synthea/1K/cleaned/test
train_dfs, valid_dfs, test_dfs = load_cleaned_ehrdata(PATH_1K)
code_dfs = load_ehr_vocabcodes(PATH_1K)
#     display(df.head())
thispt = train_dfs[0].iloc[20]
thispt
patient                     10134dbf-72d1-4381-b8f3-9530cca6622a
birthdate                                             1958-09-08
diabetes                                                    True
diabetes_age                                                52.0
stroke                                                      True
stroke_age                                                  60.0
alzheimers                                                 False
alzheimers_age                                               NaN
coronary_heart                                             False
coronary_heart_age                                           NaN
lung_cancer                                                False
lung_cancer_age                                              NaN
breast_cancer                                              False
breast_cancer_age                                            NaN
rheumatoid_arthritis                                       False
rheumatoid_arthritis_age                                     NaN
epilepsy                                                   False
epilepsy_age                                                 NaN
Name: 20, dtype: object
#     display(df.head())

Making sure condition counts match - after extracting y for each patient

CONDITIONS
OrderedDict([('diabetes', '44054006'),
             ('stroke', '230690007'),
             ('alzheimers', '26929004'),
             ('coronary_heart', '53741008'),
             ('lung_cancer', '254637007'),
             ('breast_cancer', '254837009'),
             ('rheumatoid_arthritis', '69896004'),
             ('epilepsy', '84757009')])

patients dfs after cleaning, with y extracted

pts_train, pts_valid, pts_test = train_dfs[0], valid_dfs[0], test_dfs[0]

conditions dfs

cnd_train, cnd_valid, cnd_test = train_dfs[8], valid_dfs[8], test_dfs[8]

Counts for each condition in conditions and patients dfs in each split

for pts, cnds, split in zip([pts_train, pts_valid, pts_test],[cnd_train, cnd_valid, cnd_test], ['train','valid','test']):
    print('\n',split)
    print('diabetes:: ', len(cnds[cnds['code'] == '44054006||START']), len(pts[pts['diabetes'] == 1])) 
    print('stroke:: ', len(cnds[cnds['code'] == '230690007||START']), len(pts[pts['stroke'] == 1]))
    print('alzheimers:: ', len(cnds[cnds['code'] == '26929004||START']), len(pts[pts['alzheimers'] == 1]))
    print('coronary_heart:: ', len(cnds[cnds['code'] == '53741008||START']), len(pts[pts['coronary_heart'] == 1]))
    print('lung_cancer:: ', len(cnds[cnds['code'] == '254637007||START']), len(pts[pts['lung_cancer'] == 1]))
    print('breast_cancer:: ', len(cnds[cnds['code'] == '254837009||START']), len(pts[pts['breast_cancer'] == 1]))
    print('rheumatoid_arthritis:: ', len(cnds[cnds['code'] == '69896004||START']), len(pts[pts['rheumatoid_arthritis'] == 1]))
    print('epilepsy:: ', len(cnds[cnds['code'] == '84757009||START']), len(pts[pts['epilepsy'] == 1]))
 train
diabetes::  43 43
stroke::  30 30
alzheimers::  12 12
coronary_heart::  39 39
lung_cancer::  12 12
breast_cancer::  11 11
rheumatoid_arthritis::  2 2
epilepsy::  15 15

 valid
diabetes::  14 14
stroke::  7 7
alzheimers::  7 7
coronary_heart::  11 11
lung_cancer::  0 0
breast_cancer::  8 8
rheumatoid_arthritis::  0 0
epilepsy::  5 5

 test
diabetes::  19 19
stroke::  11 11
alzheimers::  6 6
coronary_heart::  11 11
lung_cancer::  2 2
breast_cancer::  2 2
rheumatoid_arthritis::  0 0
epilepsy::  2 2
for pts, cnds, split in zip([pts_train, pts_valid, pts_test],[cnd_train, cnd_valid, cnd_test], ['train','valid','test']):
    assert len(cnds[cnds['code'] == '44054006||START']) == len(pts[pts['diabetes'] == 1]), f'error in {split} for diabetes'
    assert len(cnds[cnds['code'] == '230690007||START']) == len(pts[pts['stroke'] == 1]), f'error in {split} for stroke'
    assert len(cnds[cnds['code'] == '26929004||START']) == len(pts[pts['alzheimers'] == 1]), f'error in {split} for alzheimers'
    assert len(cnds[cnds['code'] == '53741008||START']) == len(pts[pts['coronary_heart'] == 1]), f'error in {split} for coronary_heart'
    assert len(cnds[cnds['code'] == '254637007||START']) == len(pts[pts['lung_cancer'] == 1]), f'error in {split} for lung_cancer'
    assert len(cnds[cnds['code'] == '254837009||START']) == len(pts[pts['breast_cancer'] == 1]), f'error in {split} for breast_cancer'
    assert len(cnds[cnds['code'] == '69896004||START']) == len(pts[pts['rheumatoid_arthritis'] == 1]), f'error in {split} for rheumatoid_arthritis'
    assert len(cnds[cnds['code'] == '84757009||START']) == len(pts[pts['epilepsy'] == 1]), f'error in {split} for epilepsy'