Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The model vocabulary come from train data and test data? #21

Open
JDwangmo opened this issue Sep 3, 2016 · 0 comments
Open

The model vocabulary come from train data and test data? #21

JDwangmo opened this issue Sep 3, 2016 · 0 comments

Comments

@JDwangmo
Copy link

JDwangmo commented Sep 3, 2016

Hi,Kim.Thank you for sharing your code.But I have a question about your model. In your implementation (shown below), the embedding weight contain all words from train set and test set. But I think it should contain words only in train set because in a real scene you can't know test data in which maybe there are has some OOV (some words/vocabulary out of train set). if in static mode(CNN-static), this is no problem. But in a non-static mode (CNN-non-static) how can you solve this OOV problem(how to update OOV words ( not present in model vocabulary)' embedding parameters). In brief, for words which present in word2vec model but not in origin model vocabulary, how can you solve it. Be sorry for my English is poor and expression may be not clear. Thank you.

def get_W(word_vecs, k=300):
    """
    Get word matrix. W[i] is the vector for word indexed by i
    """
    vocab_size = len(word_vecs)
    word_idx_map = dict()
    W = np.zeros(shape=(vocab_size+1, k), dtype='float32')            
    W[0] = np.zeros(k, dtype='float32')
    i = 1
    for word in word_vecs:
        W[i] = word_vecs[word]
        word_idx_map[word] = i
        i += 1
    return W, word_idx_map
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant