Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse fails when files have multiple timestamps for single line of text #40

Open
1hitsong opened this issue Mar 7, 2023 · 35 comments
Open
Labels
bug Something isn't working

Comments

@1hitsong
Copy link

1hitsong commented Mar 7, 2023

When attempting to parse a lyric file that has the following line, the parser failed.

[00:51.00][01:29.99][01:48.29][02:31.00][02:41.99]You gotta fight !

The expected outcome was for the parser to use the text "You gotta fight !" at each of the five time stamps.

@andy840119
Copy link
Member

andy840119 commented Mar 8, 2023

Thanks for report the issue!
This issue will be fixed in the #41 and make a new release soon 🎉 .

@andy840119 andy840119 added the bug Something isn't working label Mar 8, 2023
@1hitsong
Copy link
Author

1hitsong commented Mar 8, 2023

Awesome! Thanks for hopping on this so quickly

Looking at the PR for the fix, would this result in the output being option 1 or option 2?

Option 1
[00:51.00]You gotta fight !

or Option 2
[00:51.00]You gotta fight !
[01:29.99]You gotta fight !
[01:48.29]You gotta fight !
[02:31.00]You gotta fight !
[02:41.99]You gotta fight !

@andy840119
Copy link
Member

The result will be the option 1.

@andy840119
Copy link
Member

And note that new NuGet package has been released!

see:
https://www.nuget.org/packages/LrcParser

@1hitsong
Copy link
Author

1hitsong commented Mar 8, 2023

The result will be the option 1.

Cool. Just making sure I'm following.

I've seen a lot of sites that use the multiple timestamps in a way that means option 2, but If that's not to the LRC standard, then option 1 makes the most sense.

@andy840119
Copy link
Member

Guess this issue is fixed 🚀

@Shadowghost
Copy link

Shadowghost commented May 3, 2024

@andy840119 in my opinion this is not the correct fix for the issue.
Implementing Option 2 mentioned above would make the lib adhere to the basic LRC format.
Multiple timestamps on the same line should result in that line appearing on those timestamp. Dropping all except the first timestamp removes quite any chorus lyrics from songs after the first occurrence.

@andy840119
Copy link
Member

andy840119 commented May 3, 2024

https://en.wikipedia.org/wiki/LRC_(file_format)
Seems there's two usage for .lrc format

One is for sceeen text:

[00:21.10][00:45.10]Repeating lyrics (e.g. chorus)

Another one is for karaoke format:

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word <mm:ss.xx>

But (as i remember) some software might export the .lrc with those format as well:

[mm:ss.xx] [mm:ss.xx] line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

Also, did you have idea for handling those cases?
And will you willing to make the PR for that because currently I'm focusing on another project😢
Might have no enough time to solve this issue.

https://suwa.pupu.jp/RhythmicaLyrics.html
This is the software I made the .lrc file before.

@andy840119 andy840119 reopened this May 3, 2024
@Shadowghost
Copy link

I already took a look at the codebase and it likely requires some API changes. I'll try to come up with a PR over the weekend.

@andy840119
Copy link
Member

andy840119 commented May 3, 2024

I already took a look at the codebase and it likely requires some API changes. I'll try to come up with a PR over the weekend.

I think in this case it should not repeat:

[mm:ss.xx][mm:ss.xx] line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

Did you have any idea for handle this case?
Or can you describe how will you do for this case in order to reduce the review time?

Api change is OK for me, but would be better to explain why 👌🏼

@Shadowghost
Copy link

This is taking abit longer than expected, but what I did up until now:

  • Move the timestamp into TextIndex since the current structure does not allow multipletimestamps for the same starting index (we are using a dict)
  • Add handling for multi occurrence.

Looking at your post above I've got a question:
According to wikipedia karaoke format is like this (duration infront of word)

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word

But according to your example it can also be this:

[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word <mm:ss.xx>

What are the timestamps in this case for?

@Chaphasilor
Copy link

Just wanted to chime it that I had a user whose files contained lines in the format of [ms:ss.xxx]text, so with three decimals for milliseconds and no space between timestamp and text. They were not parsed correctly with Jellyfin 10.9, and getting rid of the third decimal place fixed the issue.
I'm guessing this is also an issue with this library, and could easily be fixed as part of these improvements :)

@andy840119
Copy link
Member

andy840119 commented May 9, 2024

https://github.com/karaoke-dev/sample-beatmap/blob/master/v1/%E6%9E%AF%E3%82%8C%E3%81%AA%E3%81%84%E8%8A%B1/lyric.lrc
@Shadowghost this is the sample .lrc created by RhythmicaLyrics, a tool to create karaoke format .lrc file.

I' not sure if official .lrc format allows [mm:ss.xx] at the end of the text, but you can see the japanese karoake song and last time at the end of the lyric might means it's the time blue area covers the whole lyric.

@Shadowghost
Copy link

@Chaphasilor already took care of that by extending the regex.

@andy840119 seems like we need to differentiate between these layouts:

  1. Core LRC: Single timestamp at the beginning
[mm:ss.xx] text
  1. Core LRC: Multiple timestamps at the beginning (resulting in multiple appearances of the line)
[mm:ss.xx][mm:ss.xx][mm:ss.xx] text
  1. Enhanced LRC (A2): Single timestamp at the beginning, followed by others denoted by <...> - within them is the DURATION, therefore NO need for a timestamp at the end
[mm:ss.xx] <mm:ss.xx> line 1 word 1 <mm:ss.xx> line 1 word 2 <mm:ss.xx> ... line 1 last word
  1. Enhanced LRC (RythmicaLyrics): Single timestamp at the beginning, followed by others denoted by [...] - within them are ABSOLUTE timestamps, therfore we need an end timestamp to know where the displaying of the last character ends
[mm:ss.xx]line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

My plan is to convert all of these into the following object:

TextIndex:
    public int Index { get; }
    public int? Time { get; }
    public IndexState State { get; }

IndexState:
    Start
    End

Always using the full timestamp for time and IndexState Start, for cases 1 and 2 we have no entry with IndexState End, for cases 3 and 4 we have

@Shadowghost
Copy link

Shadowghost commented May 9, 2024

Logic for knowing which of the above we are dealing with (based on regexing one line a time):

  • Regex only returns one result -> case 1
  • Regex returns multiple results and their concatenated length is the same as the index of the end of the last match -> case 2
  • Regex returns multiple results and the above does not apply -> case 4
  • Case 1 applies and we check the resulting text for <> timestamps -> case 3

@andy840119
Copy link
Member

@Shadowghost

image
Finally I got some times to check what's happening in the RythmicaLyrics
This tool have two way to generate the file:

  1. save as(A)
  2. export(O)

image
If press the save button, it can support save as different type of format, but the contents are the same:

[mm:ss.xx]line 1 word 1 [mm:ss.xx] line 1 word 2 [mm:ss.xx] ... line 1 last word [mm:ss.xx]

image
And the save format match to the export as text2ass with .kar extension.

The conclusion:

  1. RythmicaLyrics does not actually support export the .lrc format,they only support another file format (.kar)
  2. I think should be better to rename the origin LrcParser into KarParser, and make new parser for the Core LRC and Enhanced LRC (A2)

@andy840119
Copy link
Member

Maybe it's a good start:

  1. Create a first PR to create the KarParser(just copy the conten/tests in the LrcParser)
  2. Create another PR to adjust the LrcParser, let it fully matched the Core LRC and Enhanced LRC (A2) spec.

@Shadowghost
Copy link

Fine for me, will do so.
One last question: is the ruby tag stuff also KOR specific?

@andy840119
Copy link
Member

There's not much documentation about .kar file format and maybe it's not as popular as .lrc file

I think is's OK to treat it as official spec
And ruby tag might not supported in the .lrc file

@Reapor-Yurnero
Copy link

Maybe it's a good start:

  1. Create a first PR to create the KarParser(just copy the conten/tests in the LrcParser)
  2. Create another PR to adjust the LrcParser, let it fully matched the Core LRC and Enhanced LRC (A2) spec.

Since quite a few other projects are relying on this project to parse LRC (as its name indicates), it makes sense to maintain the functionality of LRC parsing in this lib and move the kar part to a seperate one like .

BTW I was not able to find a PR on this created by @Shadowghost and wondering what's the current status of it. A service I'm using is relying on the above discussed items to be incorporated and would love to help!

@andy840119
Copy link
Member

hmmmm....
Guess i need to reserve some time on this library.

@Shadowghost are you still interested on fixing this issue? I'll take over this issue if you have no time to fix it ;_;

@Shadowghost
Copy link

Sorry, I've been busy over the last weeks, I can push what I have if you want to take over.

@andy840119
Copy link
Member

Sorry, I've been busy over the last weeks, I can push what I have if you want to take over.

Nice, you can give me the branch and i can cherry-pick if possible to use 🙏🏼

@Reapor-Yurnero
Copy link

Reapor-Yurnero commented Jul 15, 2024

Great! Being able to handle A2 extensions (word by word timestamps) and support millesecond timestamps [mm:ss.xxx] or <mm:ss.xxx> in A2 (as mentioned also by @Chaphasilor) is the most desired feature from the downstream projects. Appreciate your help.

@Shadowghost
Copy link

@andy840119 there you go: https://github.com/Shadowghost/LrcParser/tree/extend
Sadly it is very WIP, lmk if you have questions.

andy840119 added a commit to andy840119/LrcParser that referenced this issue Jul 21, 2024
Because we need to adjust the lrc file format

see:
karaoke-dev#40
@andy840119
Copy link
Member

@1hitsong @Shadowghost @Reapor-Yurnero @Chaphasilor

Here's the new release follow the .lrc format in the wiki 🎉
https://www.nuget.org/packages/LrcParser

Only few changes and should be easy to adjust
https://github.com/karaoke-dev/LrcParser/blob/main/CHANGELOG.md

It's time to give it a try, report feedback is welcome ❤️

@Shadowghost
Copy link

@andy840119 thanks for the update, working fine to far on my local tests. I do have one question though: What was the reason for you to not re-use the TimeTags for multiple occurrences?

@andy840119
Copy link
Member

What was the reason for you to not re-use the TimeTags for multiple occurrences?

What does that means?
Using TimeTag class in the LrcLyric
Or Allow have multiple time-tags between lyric text?

@Shadowghost
Copy link

Shadowghost commented Jul 28, 2024

I think I misunderstood what TimeTags is for.
Why I'm mentioning it is this: Before it was possible to get LRC metadata tags like these:

[ti:Fortnight]
[ar:Taylor Swift/Post Malone]
[al:THE TORTURED POETS DEPARTMENT (Explicit)]
[offset:0]

by decoding the LRC (result would have all lines in the Lyrics object), ckecking the lines for TimeTags.Count == 0 and then parsing their text content. Now these lines are completely omitted and TimeTags.Count is always 0 if the LRC has timestamps. Only way to get them would be to parse the full LRC text.

We don't actually need them but it would be nice for completeness to have them available somewhere

@andy840119
Copy link
Member

andy840119 commented Jul 28, 2024

[ti:Fortnight]
[ar:Taylor Swift/Post Malone]
[al:THE TORTURED POETS DEPARTMENT (Explicit)]
[offset:0]

That's called metadata, and i guess should make another parser for it (e.g. LrcMatadataParser in the LrcParser/Parser/Lrc/Lines/)

ckecking the lines for TimeTags.Count == 0 and then parsing their text content.

I'm not sure what it's means (sorry for my bad english understanding), but TimeTags in the Lyric.cs still needed if lyric contains the A2 extension formant

e.g.

[00:13.34] <00:14.32> Don't <00:14.73> you <00:15.14> want <00:15.57> somebody <00:16.09> to <00:16.46> love

Lyric's start time will be [00:13.34]
And Lyric's first time-tag will be [00:13.34] + <00:14.32> in 2024.728.1

[update]
Wait, i found a bug in this version.
<mm:ss.xx> seems should be absolute time, not relative time Orz.

[update]
https://www.nuget.org/packages/LrcParser New version released (2024.728.2)

@Shadowghost
Copy link

I'm not sure what it's means (sorry for my bad english understanding), but TimeTags in the Lyric.cs still needed if lyric contains the A2 extension formant

Yes, that's what I mean with I misunderstood their usage 😅

That's called metadata, and i guess should make another parser for it (e.g. LrcMatadataParser in the LrcParser/Parser/Lrc/Lines/)

Alternatively you could parse it in the existing parser and add a Metadata object to the parser result in addition to Lyrics - would preent us from reading the file two times if we want both.

@Reapor-Yurnero
Copy link

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it. Metadata should in the end be embedded in the music file itself rather than being double embedded in the lyrics.

@andy840119
Copy link
Member

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it.

I think the next step might let LrcParser able to prese metadata, for prevent treating metadata as lyric text.
As your mention, song info is able to get from the music file itself, so there's no need to show in the lyric again.

@Reapor-Yurnero
Copy link

In fact, i think it does have its merit to have some metadata shown in the lyrics such as composer, lyrics writer etc. which are typically not part of the music metadata itself.

We may just skip these none timestamp brackets and do not parse them at all.

@Shadowghost
Copy link

Personally I don't think it's very necessary to parse metadata in the lyrics. We can show them in the lyrics text but we don't need to parse it. Metadata should in the end be embedded in the music file itself rather than being double embedded in the lyrics.

As your mention, song info is able to get from the music file itself, so there's no need to show in the lyric again.

Usually lyrics are not shared with the associated media file due to copyrights. Adding the metadata to the music file is therefore not really a valid argument.
I'd even argue that lyrics without such metadata are kind of useless if different edits of a song exist.

We may just skip these none timestamp brackets and do not parse them at all.

That is already the case for synced lyrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants