`nn.Embedding` and `nn.EmbeddingBag`

nn.Embedding

emb1 = nn.Embedding(5,3)

emb1(torch.LongTensor([[0,1,2,3,4]]))

tensor([[[-1.0022, -0.7335, -0.5919],
         [ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5407,  0.5479,  0.0524],
         [ 0.5911,  0.2444, -1.1184]]], grad_fn=<EmbeddingBackward>)

An embedding matrix is a lookup table

emb1 above has 5 rows, that is 5 elements
but looking up an element, returns a vector for that element.

Given this embedding matrix, looking up elements 1, 2, 4 will look like this ..

input = torch.LongTensor([[1,2,4]])

emb1(input)

tensor([[[ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5911,  0.2444, -1.1184]]], grad_fn=<EmbeddingBackward>)

Batch of inputs is also possible (in this case a batch of 2, each with 3 elements being looked up)

Note that inputs (# of elements being looked up) in a batch have to be of the same size

input = torch.LongTensor([[1,2,4],[0,3,2]])
# input = torch.LongTensor([[1,2,4],[0,3,2,1]]) # this will fail

emb1(input)

tensor([[[ 1.0421, -0.1873,  0.3169],
         [ 1.5152,  0.7996, -0.4156],
         [ 0.5911,  0.2444, -1.1184]],

        [[-1.0022, -0.7335, -0.5919],
         [ 0.5407,  0.5479,  0.0524],
         [ 1.5152,  0.7996, -0.4156]]], grad_fn=<EmbeddingBackward>)

nn.EmbeddingBag

embg1 = nn.EmbeddingBag(5,3)

Exactly the same input as in case of nn.Embedding above (batch of 2)

but the result will be averaged across the 3 elements in a batch
resulting in an output of 2 vectors not 6 like above

input = torch.LongTensor([[1,2,4],[0,3,2]]) # exactly same as above, but o/p is avg'd now

embg1(input)

tensor([[-1.3234,  0.5683,  0.5218],
        [-0.6582,  1.0977,  0.2710]], grad_fn=<EmbeddingBagBackward>)

Another way to do this is to send in offsets rather than separating the inputs into 2 (or x number of) lists

input = torch.LongTensor([1,2,4,0,3,2]) #same as above - 2 of same length 3
offsets = torch.LongTensor([0,3]) # output will be avg'd by default

embg1(input, offsets)

tensor([[-1.3234,  0.5683,  0.5218],
        [-0.6582,  1.0977,  0.2710]], grad_fn=<EmbeddingBagBackward>)

input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same as above - batch of 2 inputs, but of length 4 each
offsets = torch.LongTensor([0,4])

embg1(input, offsets) #avg'd 2 outputs one for each input batch i.e. avg'd across 4 in each batch

tensor([[-1.3720,  0.7772,  0.7163],
        [-0.2795,  1.0111,  0.0585]], grad_fn=<EmbeddingBagBackward>)

Different Sizes

Offsets allow us to have input batches of different lengths

input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same input as above but .. 
offsets = torch.LongTensor([0,3,5]) #this indicates - 3 batches of different lengths (0,1,2)(3,4)(5,6,7)

embg1(input, offsets)

tensor([[-1.3234,  0.5683,  0.5218],
        [-1.4155,  1.2710,  0.6960],
        [ 0.0649,  0.9687,  0.0473]], grad_fn=<EmbeddingBagBackward>)

Application to EHR Data

TODO - Details

`itoc`, `ctoi`, `ctod`, `numericalize`, `textify`

These have the same meaning as in fastai v1

itoc = index to code
ctoi = the reverse of itoc
ctod = is a new addition - to show descriptions in case descriptions exist in some type of EHR codes

numericalize() = returns numericalized ids (ctoi) for a set of codes
textify() = reverse of numericalize() - returns the codes (itoc) for a set of numericalized ids and if descriptions exist return them too (ctod)

I tried to extend fastai vocabs, but found it easier to write from scratch, given EHR data is quite unique.

Vocabs

code_dfs = load_ehr_vocabcodes(PATH_1K)

pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, immn_codes = code_dfs

split incoming concated code||value||units||type string
get a result_df based on everything except value
then do an argsort() on the value column to determine closest value
- based on example given in pandas docs - cookbook
  - cookbook example that uses loc doesnt work, instead iloc works
- argsort() - Returns the indices that would sort this array
- [:1] on that returns the one row with the closest match, index of that is what we want

Note about logging numericalize errors

obs_codes.head()

obs_vocab_obj = ObsVocab.create(obs_codes)

obs_vocab_obj.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])

[6, 9, 341, 16]

obs_vocab_obj.numericalize(['blah-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])

[1, 9, 341, 16]

obs_vocab_obj.textify([5, 8, 200, 15])

[('8302-2||160.29999999999998||cm||numeric', 'Body Height'),
 ('72514-3||2.5||{score}||numeric',
  'Pain severity - 0-10 verbal numeric rating [Score] - Reported'),
 ('20565-8||26.75||mmol/L||numeric', 'Carbon Dioxide'),
 ('29463-7||98.09999999999998||kg||numeric', 'Body Weight')]

obs_vocab_obj.numericalize(['32465-7||Normal size prostate||{nominal}||text',"80271-0||Positive Murphy's Sign||xxxnan||text",\
                          'xxnone'])

[522, 1, 0]

obs_vocab_obj.vocab_size

536

obs_vocab_obj.textify([522, 523, 0])

[('32465-7||Normal size prostate||{nominal}||text',
  'Physical findings of Prostate'),
 ('32465-7||Prostate enlarged on PR||{nominal}||text',
  'Physical findings of Prostate'),
 ('xxnone', 'Nothing recorded')]

obs_vocab_obj.numericalize(['xxnone','xxunk','72166-2||Never smoker||xxxnan||text'])

[0, 1, 467]

obs_vocab_obj.textify([0, 1, 2, 3, 467, 497])

[('xxnone', 'Nothing recorded'),
 ('xxunk||xxunk||xxunk||xxunk', 'Unknown'),
 ('8302-2||45.1||cm||numeric', 'Body Height'),
 ('8302-2||83.5||cm||numeric', 'Body Height'),
 ('72166-2||Never smoker||xxxnan||text', 'Tobacco smoking status NHIS'),
 ('88040-1||Improving (qualifier value)||xxxnan||text',
  'Response to cancer treatment')]

VocabList

vocab_list_1K = EhrVocabList.create(PATH_1K)

vocab_list_1K.save()

Saved vocab lists to /home/vinod/.lemonpie/datasets/synthea/1K/processed

Tests

vl_1K = EhrVocabList.load(PATH_1K)
obs_vocab, alg_vocab, crpl_vocab, med_vocab, img_vocab, proc_vocab, cnd_vocab, imm_vocab = vl_1K.records_vocabs
bday, bmonth, byear, marital, race, ethnicity, gender, birthplace, city, state, zipcode  = vl_1K.demographics_vocabs

records_vocabs

obs_vocab.vocab_size

536

proc_vocab.numericalize(['xxnone','65200003','428191000124101'])

[0, 67, 1]

img_vocab.numericalize(['xxnone',344001])

[0, 8]

proc_vocab.numericalize(['65200003']), proc_vocab.numericalize([65200003])

([67], [67])

img_vocab.textify([0,1,2,3,4,5])

[('xxnone', 'Nothing recorded'),
 ('xxunk', 'Unknown'),
 ('40983000', {'Arm'}),
 ('51299004', {'Clavicle'}),
 ('8205005', {'Wrist'}),
 ('72696002', {'Knee'})]

img_vocab.numericalize(['xxnone','xxunk', 51299004,51185008,12921003])

[0, 1, 3, 7, 9]

obs_vocab.textify([0,1,2,3,4,5])

[('xxnone', 'Nothing recorded'),
 ('xxunk||xxunk||xxunk||xxunk', 'Unknown'),
 ('8302-2||45.1||cm||numeric', 'Body Height'),
 ('8302-2||83.5||cm||numeric', 'Body Height'),
 ('8302-2||121.9||cm||numeric', 'Body Height'),
 ('8302-2||160.29999999999998||cm||numeric', 'Body Height')]

obs_vocab.textify([200])

[('20565-8||26.75||mmol/L||numeric', 'Carbon Dioxide')]

obs_vocab.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '10834-0||3.7||g/dL||numeric','29463-7||181.8||kg||numeric'])

[6, 9, 1, 16]

obs_vocab.textify([50,150,250,300])

[('8310-5||38.724999999999994||Cel||numeric', 'Body temperature'),
 ('2085-9||65.05||mg/dL||numeric', 'High Density Lipoprotein Cholesterol'),
 ('6248-9||56.175000000000004||kU/L||numeric', 'Soybean IgE Ab in Serum'),
 ('33914-3||120.875||mL/min/{1.73_m2}||numeric',
  'Estimated Glomerular Filtration Rate')]

med_vocab.textify([0,1,2,3,4])

[('xxnone', 'Nothing recorded'),
 ('xxunk', 'Unknown'),
 ('313782||START', {'Acetaminophen 325 MG Oral Tablet'}),
 ('748856||START', {'Yaz 28 Day Pack'}),
 ('1534809||START',
  {'168 HR Ethinyl Estradiol 0.00146 MG/HR / norelgestromin 0.00625 MG/HR Transdermal System'})]

med_vocab.itoc[:5]

['xxnone', 'xxunk', '313782||START', '748856||START', '1534809||START']

med_vocab.numericalize(['xxnone', 'xxunk', '834061||START','282464||START', '313782||START', '749882||START'])

[0, 1, 24, 1, 2, 66]

med_vocab.numericalize(['834061||START'])

[24]

demographics_vocabs

for vocab in vl_1K.demographics_vocabs:
    print(vocab.get_emb_dims())

(33, 8)
(14, 7)
(124, 11)
(5, 5)
(7, 6)
(4, 5)
(4, 5)
(243, 14)
(208, 13)
(3, 5)
(181, 13)

bday.numericalize(['xxnone','xxunk', 1,10,31])

[0, 1, 2, 11, 32]

bday.textify([0, 1, 2, 11, 32])

['xxnone', 'xxunk', '1', '10', '31']

bmonth.textify([13])

['12']

byear.numericalize(['1942',1947,])

[44, 49]

byear.numericalize([1948])

[50]

marital.vocab_size

5

marital.ctoi

{'xxnone': 0, 'xxunk': 1, 'M': 2, 'xxxnan': 3, 'S': 4}

marital.textify([0,1,2,3,4])

['xxnone', 'xxunk', 'M', 'xxxnan', 'S']

race.vocab_size

7

race.textify([0,1,2,3,4])

['xxnone', 'xxunk', 'black', 'white', 'asian']

vl_1K.age_mean, vl_1K.age_std

(16312.340455840456, 9600.296817631992)

Get All Embedding Dimensions

demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K))

demographics_dims

[(33, 8),
 (14, 7),
 (124, 11),
 (5, 5),
 (7, 6),
 (4, 5),
 (4, 5),
 (243, 14),
 (208, 13),
 (3, 5),
 (181, 13)]

recs_dims

[(536, 17),
 (26, 8),
 (50, 9),
 (226, 13),
 (11, 6),
 (137, 12),
 (184, 13),
 (20, 7)]

demographics_dims_width, recs_dims_width

(92, 85)

demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K), αd=10)

demographics_dims

[(33, 144),
 (14, 116),
 (124, 200),
 (5, 90),
 (7, 98),
 (4, 85),
 (4, 85),
 (243, 237),
 (208, 228),
 (3, 79),
 (181, 220)]

recs_dims

[(536, 289),
 (26, 135),
 (50, 160),
 (226, 233),
 (11, 109),
 (137, 205),
 (184, 221),
 (20, 127)]

demographics_dims_width, recs_dims_width

(1582, 1479)

Vocab

`nn.Embedding` and `nn.EmbeddingBag`

`itoc`, `ctoi`, `ctod`, `numericalize`, `textify`

Vocabs

`class` `EhrVocab`[source]

`EhrVocab.create`[source]

`EhrVocab.numericalize`[source]

`EhrVocab.textify`[source]

`EhrVocab.get_emb_dims`[source]

`class` `ObsVocab`[source]

`ObsVocab.create`[source]

`ObsVocab.numericalize`[source]

`ObsVocab.textify`[source]

VocabList

`class` `EhrVocabList`[source]

`EhrVocabList.create`[source]

`EhrVocabList.save`[source]

`EhrVocabList.load`[source]

Get All Embedding Dimensions

`get_all_emb_dims`[source]

	orig_code	desc	value	units	type
indx
0	8302-2	Body Height	169.6	cm	numeric
1	72514-3	Pain severity - 0-10 verbal numeric rating [Sc...	4.0	{score}	numeric
2	29463-7	Body Weight	63.8	kg	numeric
3	39156-5	Body Mass Index	22.2	kg/m2	numeric
4	59576-9	Body mass index (BMI) [Percentile] Per age and...	81.9	%	numeric

Vocab

nn.Embedding and nn.EmbeddingBag

itoc, ctoi, ctod, numericalize, textify

Vocabs

class EhrVocab[source]

EhrVocab.create[source]

EhrVocab.numericalize[source]

EhrVocab.textify[source]

EhrVocab.get_emb_dims[source]

class ObsVocab[source]

ObsVocab.create[source]

ObsVocab.numericalize[source]

ObsVocab.textify[source]

VocabList

class EhrVocabList[source]

EhrVocabList.create[source]

EhrVocabList.save[source]

EhrVocabList.load[source]

Get All Embedding Dimensions

get_all_emb_dims[source]

`nn.Embedding` and `nn.EmbeddingBag`

`itoc`, `ctoi`, `ctod`, `numericalize`, `textify`

`class` `EhrVocab`[source]

`EhrVocab.create`[source]

`EhrVocab.numericalize`[source]

`EhrVocab.textify`[source]

`EhrVocab.get_emb_dims`[source]

`class` `ObsVocab`[source]

`ObsVocab.create`[source]

`ObsVocab.numericalize`[source]

`ObsVocab.textify`[source]

`class` `EhrVocabList`[source]

`EhrVocabList.create`[source]

`EhrVocabList.save`[source]

`EhrVocabList.load`[source]

`get_all_emb_dims`[source]