Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dates are not parsed correctly (nob) ( #10

Open
albbas opened this issue Sep 27, 2017 · 6 comments
Open

Dates are not parsed correctly (nob) ( #10

albbas opened this issue Sep 27, 2017 · 6 comments
Labels
enhancement New feature or request low priority

Comments

@albbas
Copy link
Contributor

albbas commented Sep 27, 2017

This issue was created automatically with bugzilla2github

Bugzilla Bug 2427

Date: 2017-09-27T14:10:44+02:00
From: Børre Gaup <<borre.gaup>>
To: Trond Trosterud <<trond.trosterud>>
CC: ciprian.gerstenberger, lene.antonsen

Blocker for: #2405
Last updated: 2017-10-04T16:44:47+02:00

@albbas
Copy link
Contributor Author

albbas commented Sep 27, 2017

Comment 12611

Date: 2017-09-27 14:10:44 +0200
From: Børre Gaup <<borre.gaup>>

nob:
echo "har den 4.6.2014 oppnevnt følgende" |hfst-tokenise --segment --print-all $GTHOME/langs/nob/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
har

den
4
.
6
.
2014
oppnevnt

følgende

sme:
echo "har den 4.6.2014 oppnevnt følgende" |hfst-tokenise --print-all --segment $GTHOME/langs/sme/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
har

den

4.6.2014

oppnevnt

følgende

@albbas
Copy link
Contributor Author

albbas commented Oct 4, 2017

Comment 12629

Date: 2017-10-04 10:56:53 +0200
From: Trond Trosterud <<trond.trosterud>>

Det fungerer for "4.6. 2014" men ikkje for "4.6.2014."

Eg skal sjå på fst-en.

4.6.2014
""
"den" Det Dem Sg MF <W:0.0000000000>
"den" Pron Pers Sg3 <W:0.0000000000>
: 4
"<.>"
"." CLB <W:0.0000000000>
:6
"<.>"
"." CLB <W:0.0000000000>
:2014

"<23.3.>"
"23.3" A"+Ord" <W:0.0000000000>
"<>"
V Imp <W:0.0000000000>
V Inf <W:0.0000000000>
"<" PUNCT LEFT <W:0.0000000000>
">" PUNCT LEFT <W:0.0000000000>
"«" PUNCT LEFT <W:0.0000000000>
"»" PUNCT RIGHT <W:0.0000000000>
: 1995

@albbas
Copy link
Contributor Author

albbas commented Oct 4, 2017

Comment 12630

Date: 2017-10-04 11:36:54 +0200
From: Trond Trosterud <<trond.trosterud>>

Eg får ikkje repetert dette med --segment --print-all.
For dei parametra er sma og nob identisk:

tf-hsl-m0016:nob ttr000$ e "2.2.1234."|hfst-tokenise --segment --print-all tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234.

tf-hsl-m0016:sma ttr000$ e "2.2.1234."|hfst-tokenise --segment --print-all tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234.

Derimot er det ein skilnad for --giella-cg:

tf-hsl-m0016:sma ttr000$ e "2.2.1234."|hfst-tokenise --giella-cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<2.2.1234.>"
"2.2.1234" A Ord Attr <W:0.0000000000>
:\n
tf-hsl-m0016:sma ttr000$ cd ../nob
tf-hsl-m0016:nob ttr000$ e "2.2.1234."|hfst-tokenise --giella-cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
"<2.2.1234.>"
"2.2.1234" A"+Ord" <W:0.0000000000>
"<>"
V Imp <W:0.0000000000>
V Inf <W:0.0000000000>
"<" PUNCT LEFT <W:0.0000000000>
">" PUNCT LEFT <W:0.0000000000>
"«" PUNCT LEFT <W:0.0000000000>
"»" PUNCT RIGHT <W:0.0000000000>
:\n
"<>"
V Imp <W:0.0000000000>
V Inf <W:0.0000000000>
"<" PUNCT LEFT <W:0.0000000000>
">" PUNCT LEFT <W:0.0000000000>
"«" PUNCT LEFT <W:0.0000000000>
"»" PUNCT RIGHT <W:0.0000000000>

Eg kan endre nob slik at den har same funksjon som sma (alle tal-punktum-tal-kombinasjonar er ok), men problemet med --segment står altså att som urepeterbart.

@albbas
Copy link
Contributor Author

albbas commented Oct 4, 2017

Comment 12631

Date: 2017-10-04 15:36:29 +0200
From: Børre Gaup <<borre.gaup>>

(In reply to Trond Trosterud from comment #2)

Eg får ikkje repetert dette med --segment --print-all.
For dei parametra er sma og nob identisk:

tf-hsl-m0016:nob ttr000$ e "2.2.1234."|hfst-tokenise --segment --print-all
tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234.

Jeg får samme svar med samme input som deg, men … om man fjerner punktum bak datoen, skjer dette:

nob $ echo "2.2.1234"|hfst-tokenise --segment --print-all tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2
.
2
.
1234

@albbas
Copy link
Contributor Author

albbas commented Oct 4, 2017

Comment 12633

Date: 2017-10-04 16:02:08 +0200
From: Trond Trosterud <<trond.trosterud>>

Aha, takk. Eg får same resultat som deg. ==> Eg skal endre nob-fst-en slik at den oppfører seg som sma.

@albbas
Copy link
Contributor Author

albbas commented Oct 4, 2017

Comment 12639

Date: 2017-10-04 16:44:47 +0200
From: Trond Trosterud <<trond.trosterud>>

tf-hsl-m0016:nob ttr000$ svn ci -m "Oppdatering for bug #2427: Same handsaming av numeralia som for sørsamisk" src/morphology/stems/numerals.lexc

tf-hsl-m0016:nob ttr000$ e "2.2.1234"|hfst-tokenise --segment --print-all tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234
tf-hsl-m0016:nob ttr000$ e "2.2.1234."|hfst-tokenise --segment --print-all tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234
.

tf-hsl-m0016:nob ttr000$ e "2.2.1234"|hfst-tokenise --segment --print-all ../sma/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234
tf-hsl-m0016:nob ttr000$ e "2.2.1234."|hfst-tokenise --segment --print-all ../sma/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst
2.2.1234.

La oss sjå om dette hjelper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low priority
Projects
None yet
Development

No branches or pull requests

1 participant