Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange statistics #1

Open
larsmans opened this issue May 26, 2015 · 4 comments
Open

Strange statistics #1

larsmans opened this issue May 26, 2015 · 4 comments

Comments

@larsmans
Copy link
Contributor

Reported by @dodijk: training on NL and inputting "UvA" produces

[{
        “target”:”Universiteit van Amsterdam”,
        "ngramcount”:22737,
        “linkcount":0,
        “commonness":1,
        “senseprob":0.0032546070281919337,
        “offset":0,
        “length":3
}]

Old semanticizer produced

{
        text: "UvA",
        links: [{
                linkOccCount: 79,
                text: "UvA",
                linkProbability: 0.2608695652173913,
                linkDocCount: 72,
                occCount: 375,
                id: "14815",
                senseProbability: 0.2608695652173913,
                senseOccCount: 79,
                title: "Universiteit van Amsterdam",
                url: "http://nl.wikipedia.org/wiki/Universiteit%20van%20Amsterdam",
                label: "UvA",
                senseDocCount: 72,
                priorProbability: 1,
                docCount: 276
        }],
        request_id: "9268bbee-e81e-463e-b155-3a8db256d171"
}

The ngramcount is probably too high, the sense probability too low.

@larsmans
Copy link
Contributor Author

... and the link count should not be zero, or we shouldn't get that result at all.

Trivially fixed in dcf9186.

@larsmans
Copy link
Contributor Author

Since the sense probability is derived from the ngramcount, the latter is probably off. Might be collisions in the count-min sketch.

@larsmans
Copy link
Contributor Author

larsmans commented Nov 5, 2015

Branch fix-collision gives

[
  {
    "target": "Universiteit van Amsterdam",
    "ngramcount": 16244,
    "linkcount": 69,
    "commonness": 1,
    "senseprob": 0.004247722235902487,
    "offset": 0,
    "length": 3
  }
]

Still too high...

@larsmans
Copy link
Contributor Author

larsmans commented Nov 5, 2015

15496 with a larger count-min sketch...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant