Classes and functions to create vocabs from cleaned EHR data.
 

nn.Embedding and nn.EmbeddingBag

nn.Embedding

emb1 = nn.Embedding(5,3)
emb1(torch.LongTensor([[0,1,2,3,4]]))
tensor([[[-1.0022, -0.7335, -0.5919],
         [ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5407,  0.5479,  0.0524],
         [ 0.5911,  0.2444, -1.1184]]], grad_fn=<EmbeddingBackward>)

An embedding matrix is a lookup table

  1. emb1 above has 5 rows, that is 5 elements
  2. but looking up an element, returns a vector for that element.

Given this embedding matrix, looking up elements 1, 2, 4 will look like this ..

input = torch.LongTensor([[1,2,4]])
emb1(input)
tensor([[[ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5911,  0.2444, -1.1184]]], grad_fn=<EmbeddingBackward>)

Batch of inputs is also possible (in this case a batch of 2, each with 3 elements being looked up)

  • Note that inputs (# of elements being looked up) in a batch have to be of the same size
input = torch.LongTensor([[1,2,4],[0,3,2]])
# input = torch.LongTensor([[1,2,4],[0,3,2,1]]) # this will fail
emb1(input)
tensor([[[ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5911,  0.2444, -1.1184]],

        [[-1.0022, -0.7335, -0.5919],
         [ 0.5407,  0.5479,  0.0524],
         [ 1.5152,  0.7996, -0.4156]]], grad_fn=<EmbeddingBackward>)

nn.EmbeddingBag

embg1 = nn.EmbeddingBag(5,3)

Exactly the same input as in case of nn.Embedding above (batch of 2)

  • but the result will be averaged across the 3 elements in a batch
  • resulting in an output of 2 vectors not 6 like above
input = torch.LongTensor([[1,2,4],[0,3,2]]) # exactly same as above, but o/p is avg'd now
embg1(input)
tensor([[-1.3234,  0.5683,  0.5218],
        [-0.6582,  1.0977,  0.2710]], grad_fn=<EmbeddingBagBackward>)

Another way to do this is to send in offsets rather than separating the inputs into 2 (or x number of) lists

input = torch.LongTensor([1,2,4,0,3,2]) #same as above - 2 of same length 3
offsets = torch.LongTensor([0,3]) # output will be avg'd by default
embg1(input, offsets)
tensor([[-1.3234,  0.5683,  0.5218],
        [-0.6582,  1.0977,  0.2710]], grad_fn=<EmbeddingBagBackward>)
input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same as above - batch of 2 inputs, but of length 4 each
offsets = torch.LongTensor([0,4])
embg1(input, offsets) #avg'd 2 outputs one for each input batch i.e. avg'd across 4 in each batch
tensor([[-1.3720,  0.7772,  0.7163],
        [-0.2795,  1.0111,  0.0585]], grad_fn=<EmbeddingBagBackward>)

Different Sizes

Offsets allow us to have input batches of different lengths

input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same input as above but .. 
offsets = torch.LongTensor([0,3,5]) #this indicates - 3 batches of different lengths (0,1,2)(3,4)(5,6,7)
embg1(input, offsets)
tensor([[-1.3234,  0.5683,  0.5218],
        [-1.4155,  1.2710,  0.6960],
        [ 0.0649,  0.9687,  0.0473]], grad_fn=<EmbeddingBagBackward>)

Application to EHR Data

TODO - Details

itoc, ctoi, ctod, numericalize, textify

These have the same meaning as in fastai v1

  • itoc = index to code
  • ctoi = the reverse of itoc
  • ctod = is a new addition - to show descriptions in case descriptions exist in some type of EHR codes
  • numericalize() = returns numericalized ids (ctoi) for a set of codes
  • textify() = reverse of numericalize() - returns the codes (itoc) for a set of numericalized ids and if descriptions exist return them too (ctod)

I tried to extend fastai vocabs, but found it easier to write from scratch, given EHR data is quite unique.

Vocabs

 
code_dfs = load_ehr_vocabcodes(PATH_1K)
pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, immn_codes = code_dfs

class EhrVocab[source]

EhrVocab(itoc, ctoi, ctod=None)

Vocab class for most EHR datatypes

EhrVocab.create[source]

EhrVocab.create(codes_df)

Create vocab object (itoc, ctoi and maybe ctod) from the codes df

EhrVocab.numericalize[source]

EhrVocab.numericalize(codes, log_excep=True, log_dir='default_log_store')

Lookup and return indices for codes

EhrVocab.textify[source]

EhrVocab.textify(indxs)

Lookup and return descriptions for codes

EhrVocab.get_emb_dims[source]

EhrVocab.get_emb_dims(αd=0.5736)

Get embedding dimensions

class ObsVocab[source]

ObsVocab(vocab_df) :: EhrVocab

Special Vocab class for Observation codes

ObsVocab.create[source]

ObsVocab.create(obs_codes, num_buckets=5)

Create vocab object from observation codes

ObsVocab.numericalize[source]

ObsVocab.numericalize(codes, log_excep=True, log_dir='default_log_store')

Numericalize observation codes (return indices for codes)

  • split incoming concated code||value||units||type string
  • get a result_df based on everything except value
  • then do an argsort() on the value column to determine closest value

ObsVocab.textify[source]

ObsVocab.textify(indxs)

Textify observation codes (returns codes and descriptions)

Note about logging numericalize errors

obs_codes.head()
orig_code desc value units type
indx
0 8302-2 Body Height 169.6 cm numeric
1 72514-3 Pain severity - 0-10 verbal numeric rating [Sc... 4.0 {score} numeric
2 29463-7 Body Weight 63.8 kg numeric
3 39156-5 Body Mass Index 22.2 kg/m2 numeric
4 59576-9 Body mass index (BMI) [Percentile] Per age and... 81.9 % numeric
obs_vocab_obj = ObsVocab.create(obs_codes)
obs_vocab_obj.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
[6, 9, 341, 16]
obs_vocab_obj.numericalize(['blah-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
[1, 9, 341, 16]
obs_vocab_obj.textify([5, 8, 200, 15])
[('8302-2||160.29999999999998||cm||numeric', 'Body Height'),
 ('72514-3||2.5||{score}||numeric',
  'Pain severity - 0-10 verbal numeric rating [Score] - Reported'),
 ('20565-8||26.75||mmol/L||numeric', 'Carbon Dioxide'),
 ('29463-7||98.09999999999998||kg||numeric', 'Body Weight')]
obs_vocab_obj.numericalize(['32465-7||Normal size prostate||{nominal}||text',"80271-0||Positive Murphy's Sign||xxxnan||text",\
                          'xxnone'])
[522, 1, 0]
obs_vocab_obj.vocab_size
536
obs_vocab_obj.textify([522, 523, 0])
[('32465-7||Normal size prostate||{nominal}||text',
  'Physical findings of Prostate'),
 ('32465-7||Prostate enlarged on PR||{nominal}||text',
  'Physical findings of Prostate'),
 ('xxnone', 'Nothing recorded')]
obs_vocab_obj.numericalize(['xxnone','xxunk','72166-2||Never smoker||xxxnan||text'])
[0, 1, 467]
obs_vocab_obj.textify([0, 1, 2, 3, 467, 497])
[('xxnone', 'Nothing recorded'),
 ('xxunk||xxunk||xxunk||xxunk', 'Unknown'),
 ('8302-2||45.1||cm||numeric', 'Body Height'),
 ('8302-2||83.5||cm||numeric', 'Body Height'),
 ('72166-2||Never smoker||xxxnan||text', 'Tobacco smoking status NHIS'),
 ('88040-1||Improving (qualifier value)||xxxnan||text',
  'Response to cancer treatment')]

VocabList

class EhrVocabList[source]

EhrVocabList(demographics_vocabs, records_vocabs, age_mean, age_std, path)

Class to create and hold all vocab objects for an entire dataset

EhrVocabList.create[source]

EhrVocabList.create(path, num_buckets=5)

Read all code dfs from the dataset path and create all vocab objects

EhrVocabList.save[source]

EhrVocabList.save()

Save vocablist (containing all vocab objects for the dataset)

EhrVocabList.load[source]

EhrVocabList.load(path)

Load previously created vocablist object (containing all vocab objects for the dataset)

vocab_list_1K = EhrVocabList.create(PATH_1K)
vocab_list_1K.save()
Saved vocab lists to /home/vinod/.lemonpie/datasets/synthea/1K/processed

Tests

vl_1K = EhrVocabList.load(PATH_1K)
obs_vocab, alg_vocab, crpl_vocab, med_vocab, img_vocab, proc_vocab, cnd_vocab, imm_vocab = vl_1K.records_vocabs
bday, bmonth, byear, marital, race, ethnicity, gender, birthplace, city, state, zipcode  = vl_1K.demographics_vocabs

records_vocabs

obs_vocab.vocab_size
536
proc_vocab.numericalize(['xxnone','65200003','428191000124101'])
[0, 67, 1]
img_vocab.numericalize(['xxnone',344001])
[0, 8]
proc_vocab.numericalize(['65200003']), proc_vocab.numericalize([65200003])
([67], [67])
img_vocab.textify([0,1,2,3,4,5])
[('xxnone', 'Nothing recorded'),
 ('xxunk', 'Unknown'),
 ('40983000', {'Arm'}),
 ('51299004', {'Clavicle'}),
 ('8205005', {'Wrist'}),
 ('72696002', {'Knee'})]
img_vocab.numericalize(['xxnone','xxunk', 51299004,51185008,12921003]) 
[0, 1, 3, 7, 9]
obs_vocab.textify([0,1,2,3,4,5])
[('xxnone', 'Nothing recorded'),
 ('xxunk||xxunk||xxunk||xxunk', 'Unknown'),
 ('8302-2||45.1||cm||numeric', 'Body Height'),
 ('8302-2||83.5||cm||numeric', 'Body Height'),
 ('8302-2||121.9||cm||numeric', 'Body Height'),
 ('8302-2||160.29999999999998||cm||numeric', 'Body Height')]
obs_vocab.textify([200])
[('20565-8||26.75||mmol/L||numeric', 'Carbon Dioxide')]
obs_vocab.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '10834-0||3.7||g/dL||numeric','29463-7||181.8||kg||numeric'])
[6, 9, 1, 16]
obs_vocab.textify([50,150,250,300])
[('8310-5||38.724999999999994||Cel||numeric', 'Body temperature'),
 ('2085-9||65.05||mg/dL||numeric', 'High Density Lipoprotein Cholesterol'),
 ('6248-9||56.175000000000004||kU/L||numeric', 'Soybean IgE Ab in Serum'),
 ('33914-3||120.875||mL/min/{1.73_m2}||numeric',
  'Estimated Glomerular Filtration Rate')]
med_vocab.textify([0,1,2,3,4])
[('xxnone', 'Nothing recorded'),
 ('xxunk', 'Unknown'),
 ('313782||START', {'Acetaminophen 325 MG Oral Tablet'}),
 ('748856||START', {'Yaz 28 Day Pack'}),
 ('1534809||START',
  {'168 HR Ethinyl Estradiol 0.00146 MG/HR / norelgestromin 0.00625 MG/HR Transdermal System'})]
med_vocab.itoc[:5]
['xxnone', 'xxunk', '313782||START', '748856||START', '1534809||START']
med_vocab.numericalize(['xxnone', 'xxunk', '834061||START','282464||START', '313782||START', '749882||START'])
[0, 1, 24, 1, 2, 66]
med_vocab.numericalize(['834061||START'])
[24]

demographics_vocabs

for vocab in vl_1K.demographics_vocabs:
    print(vocab.get_emb_dims())
(33, 8)
(14, 7)
(124, 11)
(5, 5)
(7, 6)
(4, 5)
(4, 5)
(243, 14)
(208, 13)
(3, 5)
(181, 13)
bday.numericalize(['xxnone','xxunk', 1,10,31])
[0, 1, 2, 11, 32]
bday.textify([0, 1, 2, 11, 32])
['xxnone', 'xxunk', '1', '10', '31']
bmonth.textify([13])
['12']
byear.numericalize(['1942',1947,])
[44, 49]
byear.numericalize([1948])
[50]
marital.vocab_size
5
marital.ctoi
{'xxnone': 0, 'xxunk': 1, 'M': 2, 'xxxnan': 3, 'S': 4}
marital.textify([0,1,2,3,4])
['xxnone', 'xxunk', 'M', 'xxxnan', 'S']
race.vocab_size
7
race.textify([0,1,2,3,4])
['xxnone', 'xxunk', 'black', 'white', 'asian']
vl_1K.age_mean, vl_1K.age_std
(16312.340455840456, 9600.296817631992)

Get All Embedding Dimensions

get_all_emb_dims[source]

get_all_emb_dims(EhrVocabList, αd=0.5736)

Get embedding dimensions for all vocab objects of the dataset

demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K))
demographics_dims
[(33, 8),
 (14, 7),
 (124, 11),
 (5, 5),
 (7, 6),
 (4, 5),
 (4, 5),
 (243, 14),
 (208, 13),
 (3, 5),
 (181, 13)]
recs_dims
[(536, 17),
 (26, 8),
 (50, 9),
 (226, 13),
 (11, 6),
 (137, 12),
 (184, 13),
 (20, 7)]
demographics_dims_width, recs_dims_width
(92, 85)
demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K), αd=10)
demographics_dims
[(33, 144),
 (14, 116),
 (124, 200),
 (5, 90),
 (7, 98),
 (4, 85),
 (4, 85),
 (243, 237),
 (208, 228),
 (3, 79),
 (181, 220)]
recs_dims
[(536, 289),
 (26, 135),
 (50, 160),
 (226, 233),
 (11, 109),
 (137, 205),
 (184, 221),
 (20, 127)]
demographics_dims_width, recs_dims_width
(1582, 1479)