Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCX reader discards figure caption (regression) #9610

Open
frederik opened this issue Mar 27, 2024 · 3 comments
Open

DOCX reader discards figure caption (regression) #9610

frederik opened this issue Mar 27, 2024 · 3 comments
Labels

Comments

@frederik
Copy link

Problem description

The latest pandoc version(s) 3.1.12.3 (possibly .2) seems to drop figure captions from docx during import.

pandoc captions.docx -o test.json

Latest known version where it works: 3.1.12.1

Reproduction

I am attaching a docx and two json outputs from: pandoc 3.1.12.1 and 3.1.12.3

here's the diff (only the caption is missing)

115,151d114
<         },
<         {
<             "t": "Para",
<             "c": [
<                 {
<                     "t": "Str",
<                     "c": "Figure"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "1"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "A"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "figure"
<                 },
<                 {
<                     "t": "Space"
<                 },
<                 {
<                     "t": "Str",
<                     "c": "caption"
<                 }
<             ]

caption.docx
test-latest.json
test-3-1-11.json

@frederik frederik added the bug label Mar 27, 2024
@jgm
Copy link
Owner

jgm commented Mar 27, 2024

Probably due to one of these items from the .2 changelog:

  * Docx reader:

    + Ensure that table captions are counted (#9518).
    + Detect caption by style name not id (#9518).
      The styleId can change depending on the localization.
    + Avoid emitting empty paragraph where caption was.

@jgm
Copy link
Owner

jgm commented Mar 27, 2024

Here's what I'm seeing:

% pandoc ~/Downloads/caption.docx -t native
[ Para
    [ Image
        ( ""
        , []
        , [ ( "width" , "6.268055555555556in" )
          , ( "height" , "2.5944444444444446in" )
          ]
        )
        [ Str "A"
        , Space
        , Str "screenshot"
        , Space
        , Str "of"
        , Space
        , Str "a"
        , Space
        , Str "web"
        , Space
        , Str "page"
        , SoftBreak
        , Str "Description"
        , Space
        , Str "automatically"
        , Space
        , Str "generated"
        ]
        ( "media/image1.png" , "" )
    ]
]

This is not being parsed as a Figure at all, which is a separate issue perhaps.

@jgm
Copy link
Owner

jgm commented Mar 27, 2024

caption.docx has

<w:pStyle w:val="Caption"/>

then in doc.styles:

<w:style w:type="paragraph" w:styleId="Caption"><w:name w:val="caption"/>

so the caption name is caption.

Note that changing the styleId to "ImageCaption" allows the caption to be parsed as a regular paragraph.

So, I think what is going on is this: Pandoc identifies paragraphs with style 'caption' as table captions. They are not emitted as regular paragraphs, but because we do not at this point have special handling for figures with captions, the result is that it gets dropped altogether.

Obviously not a great situation, but the fix would involve proper support for captioned images as Figure elements, which we've never had.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants