-
-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing of link destinations that look like code
or <html>
#137
Conversation
6384ae9
to
96ef383
Compare
7ccb68c
to
f8b163e
Compare
@jgm other than the merge conflict in regression.md (a simple rebase), is there anything blocking this fix? |
This probably just arrived at a busy time; I will look at. It would help if you could rebase. |
f8b163e
to
abfc152
Compare
Fixes jgm#136 This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen). Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example: [x](`) <a href="`"> ^-----------^ If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk). On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans: $first[$](about) ^------^ this is a math span, not a link [first]($)$5/8$ ^-^ this is an analogue of the original bug it shouldn't be a math span, but looks like one
abfc152
to
fe3bee4
Compare
Okay, it’s rebased. |
Any feedback on this PR? |
Sorry, I was on vacation when this posted and it got ignored. I'll try to have a look in the near future! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to do the job, but the code is pretty complex and it's hard to survey. I suggest adding some comments to explain what is going on, and maybe using intermediate definitions to make the code clearer. That will make it easier to maintain this code in the future.
commonmark/src/Commonmark/Inlines.hs
Outdated
let go chunks toks' bottoms = do | ||
chunks' <- {-# SCC parseChunks #-} evalStateT | ||
(parseChunks bracketedSpecs formattingSpecs ilParsers | ||
attrParser rm toks') defaultEnders | ||
case chunks' of | ||
Left err -> return $ Left err | ||
Right chunks'' -> | ||
case (processBrackets bracketedSpecs rm (chunks ++ chunks'') bottoms) of | ||
Left st -> go ((reverse . befores . rightCursor) st) (mconcat (map chunkToks $ (maybeToList . center $ rightCursor st) ++ (afters $ rightCursor st))) (stackBottoms st) | ||
Right chunks''' -> return $ Right chunks''' | ||
res <- go [] ((dropWhile iswhite . reverse . dropWhile iswhite . reverse) toks) mempty | ||
return $! | ||
case res of | ||
Left err -> Left err | ||
Right chunks -> | ||
(Right . | ||
unChunks . | ||
processEmphasis . | ||
processBrackets bracketedSpecs rm) chunks | ||
case res of | ||
Left err -> Left err | ||
Right chunks -> | ||
(Right . unChunks . processEmphasis) chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has gotten pretty complex, and I have a hard time following what is going on. I'm afraid I'll be at a loss if I ever need to modify this part of the code again. Can it be made a bit more user-friendly, with some comments and simplifying definitions?
processBrackets :: IsInline a | ||
=> [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> [Chunk a] | ||
processBrackets bracketedSpecs rm xs = | ||
=> [BracketedSpec a] -> ReferenceMap -> [Chunk a] -> M.Map Text SourcePos -> Either (DState a) [Chunk a] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type has gotten more complex; it would be worth explaining in a comment what the significance of Left and Right results is, and maybe why the bottoms map is needed.
=> [BracketedSpec a] -> DState a -> [Chunk a] | ||
=> [BracketedSpec a] -> DState a -> Either (DState a) [Chunk a] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth a comment at the beginning explaining what Left and Right results mean here.
Do the new comments and variable names help? |
Yes, this looks great, thanks! |
stack bench
output:Before
After
Fixes #136
This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen).
Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example:
If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk).
On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans: