Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix FastText model reading of unsupported modes from Facebook's FastText #3222

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions gensim/models/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -807,6 +807,13 @@ def _load_fasttext_format(model_file, encoding='utf-8', full_model=True):
with utils.open(model_file, 'rb') as fin:
m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)

# Here we are checking for unsupported FB FT Modes
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Here we are checking for unsupported FB FT Modes
# Check for unsupported Facebook FastText modes.

if m.loss != 1 and m.loss != 2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if m.loss != 1 and m.loss != 2:
if m.loss not in (1, 2):

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #3223, negative sampling is loss=0, not loss=2. Can you please double check? What's the correct value?

Copy link

@SagarDollin SagarDollin Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @piskvorky,
I checked it. In the parameters docstring in the models/_fasttext_bin, that I have copied here:

"""Holds data loaded from the Facebook binary.
Parameters

.
.
neg : int
    If non-zero, indicates that the model uses negative sampling.
loss : int
    If equal to 1, indicates that the model uses hierarchical sampling.
model : int
    If equal to 2, indicates that the model uses skip-grams.
"""

This suggests that m.neg==0 means no negative sampling, and m.neg == n (where n is a positive integer) means negative sampling is being used.
So I think if we are checking for negative sampling, the condition should be if m.neg == 0 or if m.neg != 0 based on what we want to do. From what I have read from the #3223 I gather that negative sampling and hierarchical softmax are supported in loss. so we shouldn't really need to check for negative sampling i.e, if m.neg != 0 (not really required)

I have made a list of what's supported and what's not based on #3223 . Please let me know if they are right, ill make changes as per that:

Supported:
a. Loss

1. Negative Sampling
2. Hierarchical Softmax

b. Model

1. CBOW
2. Skip-gram

Unsupported:
a. Loss:

  1. OneVsAll_Loss

b. Model:

  1. Supervised
    Let me know if there are any inconsistencies in my understanding. I too want to help you complete this asap.

Copy link
Owner

@piskvorky piskvorky Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really understand that list. If the check is == 0 or != 0, why would we check for == 1 and == 2?

I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.

Copy link

@SagarDollin SagarDollin Apr 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't really understand that list. If the check is == 0 or != 0, why would we check for == 1 and == 2?

Hi yes you are right. We shouldn't be checking ==1 or ==2 in this case, also if we are checking for negative sampling we must not, use m.loss at all we must check for m.neg. (Just to clarify the above ==1 and ==2 was not done by me, this PR was created by @Tewatia5355, Im just trying to figure out what are the losses and models not supported in fasttext. )

I don't know the fasttext parameters, so don't ask me :) Can you please check the fasttext docs or code (not gensim) for what the parameters actually are? And then make changes here to match that. Thanks.

Hi, yes sure I can investigate that and ill create a PR accordingly. Also if you can provide the link for fasttext docs that would be great. (just to be sure that I'm referring the right docs). Thanks!

Copy link
Owner

@piskvorky piskvorky Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, please do. This PR seems incorrect so I'm removing the "Next release" tag – we need to fix it first, make the code correct.

Fasttext code is at https://github.com/facebookresearch/fastText.

Thanks!

raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise ValueError("Loss paramter value can be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")
raise ValueError("The fasttext `loss` parameter must be either 1 (for Hierarchical Softmax) or 2 (for Negative Sampling)")

elif m.model != 1 and m.model != 2:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif m.model != 1 and m.model != 2:
elif m.model not in (1, 2):

raise ValueError(
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Model parameter value can be either 1 (for Continous Bag of Words model) or 2 (for Skip-gram model)")
"The fasttext `model` parameter must be either 1 (for Continuous Bag of Words model) or 2 (for Skip-gram model)")


model = FastText(
vector_size=m.dim,
window=m.ws,
Expand Down
Loading