nn.Embedding
emb1 = nn.Embedding(5,3)
emb1(torch.LongTensor([[0,1,2,3,4]]))
An embedding matrix is a lookup table
emb1
above has 5 rows, that is 5 elements- but looking up an element, returns a vector for that element.
Given this embedding matrix, looking up elements 1, 2, 4 will look like this ..
input = torch.LongTensor([[1,2,4]])
emb1(input)
Batch of inputs is also possible (in this case a batch of 2, each with 3 elements being looked up)
- Note that inputs (# of elements being looked up) in a batch have to be of the same size
input = torch.LongTensor([[1,2,4],[0,3,2]])
# input = torch.LongTensor([[1,2,4],[0,3,2,1]]) # this will fail
emb1(input)
nn.EmbeddingBag
embg1 = nn.EmbeddingBag(5,3)
Exactly the same input as in case of nn.Embedding
above (batch of 2)
- but the result will be averaged across the 3 elements in a batch
- resulting in an output of 2 vectors not 6 like above
input = torch.LongTensor([[1,2,4],[0,3,2]]) # exactly same as above, but o/p is avg'd now
embg1(input)
Another way to do this is to send in offsets
rather than separating the inputs into 2 (or x number of) lists
input = torch.LongTensor([1,2,4,0,3,2]) #same as above - 2 of same length 3
offsets = torch.LongTensor([0,3]) # output will be avg'd by default
embg1(input, offsets)
input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same as above - batch of 2 inputs, but of length 4 each
offsets = torch.LongTensor([0,4])
embg1(input, offsets) #avg'd 2 outputs one for each input batch i.e. avg'd across 4 in each batch
Different Sizes
Offsets allow us to have input batches of different lengths
input = torch.LongTensor([1,2,4,2,0,3,3,2]) #same input as above but ..
offsets = torch.LongTensor([0,3,5]) #this indicates - 3 batches of different lengths (0,1,2)(3,4)(5,6,7)
embg1(input, offsets)
Application to EHR Data
TODO - Details
These have the same meaning as in fastai v1
itoc
= index to codectoi
= the reverse ofitoc
ctod
= is a new addition - to show descriptions in case descriptions exist in some type of EHR codes
numericalize()
= returns numericalized ids (ctoi
) for a set of codestextify()
= reverse ofnumericalize()
- returns the codes (itoc
) for a set of numericalized ids and if descriptions exist return them too (ctod
)
I tried to extend fastai vocabs, but found it easier to write from scratch, given EHR data is quite unique.
code_dfs = load_ehr_vocabcodes(PATH_1K)
pt_codes, obs_codes, alg_codes, crpl_codes, med_codes, img_codes, proc_codes, cnd_codes, immn_codes = code_dfs
- split incoming concated
code||value||units||type
string - get a result_df based on everything except value
- then do an
argsort()
on the value column to determine closest value- based on example given in pandas docs - cookbook
- cookbook example that uses
loc
doesnt work, insteadiloc
works
- cookbook example that uses
argsort()
- Returns the indices that would sort this array[:1]
on that returns the one row with the closest match, index of that is what we want
- based on example given in pandas docs - cookbook
Note about logging numericalize errors
obs_codes.head()
obs_vocab_obj = ObsVocab.create(obs_codes)
obs_vocab_obj.numericalize(['8302-2||200.3||cm||numeric', \
'72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
obs_vocab_obj.numericalize(['blah-2||200.3||cm||numeric', \
'72514-3||4||{score}||numeric', '33756-8||21.7||mm||numeric','29463-7||181.8||kg||numeric'])
obs_vocab_obj.textify([5, 8, 200, 15])
obs_vocab_obj.numericalize(['32465-7||Normal size prostate||{nominal}||text',"80271-0||Positive Murphy's Sign||xxxnan||text",\
'xxnone'])
obs_vocab_obj.vocab_size
obs_vocab_obj.textify([522, 523, 0])
obs_vocab_obj.numericalize(['xxnone','xxunk','72166-2||Never smoker||xxxnan||text'])
obs_vocab_obj.textify([0, 1, 2, 3, 467, 497])
vocab_list_1K = EhrVocabList.create(PATH_1K)
vocab_list_1K.save()
Tests
vl_1K = EhrVocabList.load(PATH_1K)
obs_vocab, alg_vocab, crpl_vocab, med_vocab, img_vocab, proc_vocab, cnd_vocab, imm_vocab = vl_1K.records_vocabs
bday, bmonth, byear, marital, race, ethnicity, gender, birthplace, city, state, zipcode = vl_1K.demographics_vocabs
records_vocabs
obs_vocab.vocab_size
proc_vocab.numericalize(['xxnone','65200003','428191000124101'])
img_vocab.numericalize(['xxnone',344001])
proc_vocab.numericalize(['65200003']), proc_vocab.numericalize([65200003])
img_vocab.textify([0,1,2,3,4,5])
img_vocab.numericalize(['xxnone','xxunk', 51299004,51185008,12921003])
obs_vocab.textify([0,1,2,3,4,5])
obs_vocab.textify([200])
obs_vocab.numericalize(['8302-2||200.3||cm||numeric', \
'72514-3||4||{score}||numeric', '10834-0||3.7||g/dL||numeric','29463-7||181.8||kg||numeric'])
obs_vocab.textify([50,150,250,300])
med_vocab.textify([0,1,2,3,4])
med_vocab.itoc[:5]
med_vocab.numericalize(['xxnone', 'xxunk', '834061||START','282464||START', '313782||START', '749882||START'])
med_vocab.numericalize(['834061||START'])
demographics_vocabs
for vocab in vl_1K.demographics_vocabs:
print(vocab.get_emb_dims())
bday.numericalize(['xxnone','xxunk', 1,10,31])
bday.textify([0, 1, 2, 11, 32])
bmonth.textify([13])
byear.numericalize(['1942',1947,])
byear.numericalize([1948])
marital.vocab_size
marital.ctoi
marital.textify([0,1,2,3,4])
race.vocab_size
race.textify([0,1,2,3,4])
vl_1K.age_mean, vl_1K.age_std
demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K))
demographics_dims
recs_dims
demographics_dims_width, recs_dims_width
demographics_dims, recs_dims, demographics_dims_width, recs_dims_width = get_all_emb_dims(EhrVocabList.load(PATH_1K), αd=10)
demographics_dims
recs_dims
demographics_dims_width, recs_dims_width