The model vocabulary come from train data and test data? #21

JDwangmo · 2016-09-03T03:52:12Z

Hi,Kim.Thank you for sharing your code.But I have a question about your model. In your implementation (shown below), the embedding weight contain all words from train set and test set. But I think it should contain words only in train set because in a real scene you can't know test data in which maybe there are has some OOV (some words/vocabulary out of train set). if in static mode(CNN-static), this is no problem. But in a non-static mode (CNN-non-static) how can you solve this OOV problem（how to update OOV words ( not present in model vocabulary)' embedding parameters）. In brief, for words which present in word2vec model but not in origin model vocabulary, how can you solve it. Be sorry for my English is poor and expression may be not clear. Thank you.

def get_W(word_vecs, k=300):
    """
    Get word matrix. W[i] is the vector for word indexed by i
    """
    vocab_size = len(word_vecs)
    word_idx_map = dict()
    W = np.zeros(shape=(vocab_size+1, k), dtype='float32')            
    W[0] = np.zeros(k, dtype='float32')
    i = 1
    for word in word_vecs:
        W[i] = word_vecs[word]
        word_idx_map[word] = i
        i += 1
    return W, word_idx_map

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The model vocabulary come from train data and test data? #21

The model vocabulary come from train data and test data? #21

JDwangmo commented Sep 3, 2016

The model vocabulary come from train data and test data? #21

The model vocabulary come from train data and test data? #21

Comments

JDwangmo commented Sep 3, 2016