nn.Embedding
emb1 = nn.Embedding(5,3)
emb1(torch.LongTensor([[0,1,2,3,4]]))
An embedding matrix is a lookup table
emb1above has 5 rows, that is 5 elements- but looking up an element, returns a vector for that element.
 
Given this embedding matrix, looking up elements 1, 2, 4 will look like this ..
input = torch.LongTensor([[1,2,4]])
emb1(input)
Batch of inputs is also possible (in this case a batch of 2, each with 3 elements being looked up)
- Note that inputs (# of elements being looked up) in a batch have to be of the same size
 
input = torch.LongTensor([[1,2,4],[0,3,2]])
# input = torch.LongTensor([[1,2,4],[0,3,2,1]]) # this will fail
emb1(input)
nn.EmbeddingBag
embg1 = nn.EmbeddingBag(5,3)
Exactly the same input as in case of nn.Embedding above (batch of 2)
- but the result will be averaged across the 3 elements in a batch
 - resulting in an output of 2 vectors not 6 like above
 
input = torch.LongTensor([[1,2,4],[0,3,2]]) # exactly same as above, but o/p is avg'd now
embg1(input)
Another way to do this is to send in offsets rather than separating the inputs into 2 (or x number of) lists
input = torch.LongTensor([1,2,4,0,3,2]) #same as above - 2 of same length 3
offsets = torch.LongTensor([0,3]) # output will be avg'd by default
embg1(input, offsets)
input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same as above - batch of 2 inputs, but of length 4 each
offsets = torch.LongTensor([0,4])
embg1(input, offsets) #avg'd 2 outputs one for each input batch i.e. avg'd across 4 in each batch
Different Sizes
Offsets allow us to have input batches of different lengths
input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same input as above but .. 
offsets = torch.LongTensor([0,3,5]) #this indicates - 3 batches of different lengths (0,1,2)(3,4)(5,6,7)
embg1(input, offsets)
Application to EHR Data
TODO - Details
These have the same meaning as in fastai v1
itoc= index to codectoi= the reverse ofitocctod= is a new addition - to show descriptions in case descriptions exist in some type of EHR codes
numericalize()= returns numericalized ids (ctoi) for a set of codestextify()= reverse ofnumericalize()- returns the codes (itoc) for a set of numericalized ids and if descriptions exist return them too (ctod)
I tried to extend fastai vocabs, but found it easier to write from scratch, given EHR data is quite unique.
 
code_dfs = load_ehr_vocabcodes(PATH_1K)
pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, immn_codes = code_dfs
- split incoming concated 
code||value||units||typestring - get a result_df based on everything except value
 - then do an 
argsort()on the value column to determine closest value- based on example given in pandas docs - cookbook
- cookbook example that uses 
locdoesnt work, insteadilocworks 
 - cookbook example that uses 
 argsort()- Returns the indices that would sort this array[:1]on that returns the one row with the closest match, index of that is what we want
 - based on example given in pandas docs - cookbook
 
Note about logging numericalize errors
obs_codes.head()
obs_vocab_obj = ObsVocab.create(obs_codes)
obs_vocab_obj.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
obs_vocab_obj.numericalize(['blah-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
obs_vocab_obj.textify([5, 8, 200, 15])
obs_vocab_obj.numericalize(['32465-7||Normal size prostate||{nominal}||text',"80271-0||Positive Murphy's Sign||xxxnan||text",\
                          'xxnone'])
obs_vocab_obj.vocab_size
obs_vocab_obj.textify([522, 523, 0])
obs_vocab_obj.numericalize(['xxnone','xxunk','72166-2||Never smoker||xxxnan||text'])
obs_vocab_obj.textify([0, 1, 2, 3, 467, 497])
vocab_list_1K = EhrVocabList.create(PATH_1K)
vocab_list_1K.save()
Tests
vl_1K = EhrVocabList.load(PATH_1K)
obs_vocab, alg_vocab, crpl_vocab, med_vocab, img_vocab, proc_vocab, cnd_vocab, imm_vocab = vl_1K.records_vocabs
bday, bmonth, byear, marital, race, ethnicity, gender, birthplace, city, state, zipcode  = vl_1K.demographics_vocabs
records_vocabs
obs_vocab.vocab_size
proc_vocab.numericalize(['xxnone','65200003','428191000124101'])
img_vocab.numericalize(['xxnone',344001])
proc_vocab.numericalize(['65200003']), proc_vocab.numericalize([65200003])
img_vocab.textify([0,1,2,3,4,5])
img_vocab.numericalize(['xxnone','xxunk', 51299004,51185008,12921003]) 
obs_vocab.textify([0,1,2,3,4,5])
obs_vocab.textify([200])
obs_vocab.numericalize(['8302-2||200.3||cm||numeric', \
                            '72514-3||4||{score}||numeric', '10834-0||3.7||g/dL||numeric','29463-7||181.8||kg||numeric'])
obs_vocab.textify([50,150,250,300])
med_vocab.textify([0,1,2,3,4])
med_vocab.itoc[:5]
med_vocab.numericalize(['xxnone', 'xxunk', '834061||START','282464||START', '313782||START', '749882||START'])
med_vocab.numericalize(['834061||START'])
demographics_vocabs
for vocab in vl_1K.demographics_vocabs:
    print(vocab.get_emb_dims())
bday.numericalize(['xxnone','xxunk', 1,10,31])
bday.textify([0, 1, 2, 11, 32])
bmonth.textify([13])
byear.numericalize(['1942',1947,])
byear.numericalize([1948])
marital.vocab_size
marital.ctoi
marital.textify([0,1,2,3,4])
race.vocab_size
race.textify([0,1,2,3,4])
vl_1K.age_mean, vl_1K.age_std
demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K))
demographics_dims
recs_dims
demographics_dims_width, recs_dims_width
demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K), αd=10)
demographics_dims
recs_dims
demographics_dims_width, recs_dims_width