[WIP] Parse content-line as text instead of just .* #27

schoettl · 2021-05-13T08:19:45Z

The goal of this PR is to enable the EBNF parser to parse details in the content after a headline, e.g. links, timestamps, …

Closes #26
Closes #30

Update parser tests
Update transformers to return :raw along with :ast
Update transformer tests
Update core tests

schoettl · 2021-05-13T11:17:19Z

Hi @branch14 ,

I think we have to update your transformers.

The transformed parse result should be something like this, right?

4e4dc58

  [[:headline {:level 1, :title "hello world"}]
   [:content
     [:raw "this is the first section\n\n"]
     [:parsed
       [:content-line [:text [:text-normal "this is the first section"]]]
       [:empty-line]]
   [:headline {:level 2, :title "and this"}]
   [:content
    [:raw "\nis another section\n"]
    [:parsed
      [:empty-line]
      [:content-line [:text [:text-normal "is another section"]]]]

https://github.com/schoettl/org-parser/blob/feat/parse-text/test/org_parser/transform_test.cljc#L40

Of course, the elements inside :parsed can and should also be transformed later. But the most important change IMO is to make the basic transformation having raw and parsed description. And maybe also make the :content part of the :headline as content always belongs to a headline?

I'm yet lost in the transformer code. So if you have time to update the transformer a little bit, it would be appreciated!

PS: How can I embed a line of code in comments like @munen did it here?

PPS: Regarding :raw: The original string can be extracted as documented here.

resources/org.ebnf

branch14 · 2021-05-14T20:30:28Z

@schoettl Yes, the transforms currently in place are just a humble beginning. They need to be extended!

I like the idea to have :raw as well as the full AST in the final result. And I would prefer calling it :ast over :parsed.

:content and :headline should actually both be part of a :section, as the headline introduces a new section in org lingo.

Using the metadata to extract the raw string is a nifty hack! I'll give it a try.

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?

schoettl · 2021-05-14T21:07:01Z

I like the idea to have :raw as well as the full AST in the final result. And I would prefer calling it :ast over :parsed.

I'm good with :ast, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?

:content and :headline should actually both be part of a :section, as the headline introduces a new section in org lingo.

I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

You mean the <hidden-tag> syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...

Somewhat related: It seems to me it might make sense to refactor the tests (and transformations) to be organized by features. That way we can later have (1) the input org, (2) the resulting AST, (3) the transformed result, and (4) the output org closer together, which will greatly reduce cognitive load while working on a feature. What do you think?

I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.

But having tests for parsing+transformation and org code generation close together (with sample input org to compare with generated org) is probably a good thing!

branch14 · 2021-05-14T21:35:50Z

I'm good with :ast, too. And that node would hold the parsed syntax tree nodes which will, step by step, be replaced with the transformed syntax tree nodes. Not just the raw parsed syntax tree, forever. Are we on the same page here?

Absolutely.

I'm not sure about this naming. In #31 , I quote the org spec which seems to distinguish between headline and section (section beeing the content below the headline).

Then I have to study the spec again. I thought something like this would be idiomatic

[:section
      {:level 1
       :title "hello world"
       :content
       {:raw ["this is the first section" "this line has *bold text*"]
        :ast [[:text [:text-normal "this is the first section"]]
              [:text
               [:text-normal "this line has "]
               [:text-styled
                [:text-sty-bold [:text-inside-sty-normal "bold text"]]]]]}}]

I like to stick to the wording of the spec where possible, but I also won't to provide a meaningful and easy to work with data structure. And I will likely not want to sacrifice the later for the former.

We could even drop the whole :section and just provide a vector of sections, right? Apart from content before the first section, everything in org is in a section by definition.

Do you think we'll need all tags in the AST of text, or can we use Instaparse's "Hide tag" feature for some early thinning?

You mean the <hidden-tag> syntax of instaparse? I would currently not change the EBNF in that way. Because for later transformations, e.g. of transformations of timestamps, we really need most of them to reason about the all timestamp properties...

Yes <hidden-tag>. My question was specifically targeted towards :text. I think [:text-sty-bold "bold text"] would do, instead of
[:text-sty-bold [:text-inside-sty-normal "bold text"]]. Or not?

I think it makes sense to sort tests by org feature. But I'm not sure about putting low-level parser tests and the tests of transformed AST in the same place. Because for some org features I have dozens of parser tests and edge cases. And many of them may not be relevant for the tests of transformations.

I concur. I'm sure we'll find a structure that makes sense here.

I'm working on this. And I'll provide a PR. But I'll need some time.

schoettl · 2021-05-14T21:50:11Z

I like your proposed structure. But how can we represent the first section without headline? Omitting the :title tag and setting :level to 0? Hm…

Regarding the hidden tags. Yeah, your example makes sense. Originally, I planned to recursively parse styled markup (i.e. markup can contain markup, timestamps, and links), but that will probably not work. Currently, only very simple markup will be parsed, so it's no problem to hide that tag.

Great! Looking forward to your PR. Did you fork or extend this PR?

branch14 · 2021-05-15T10:19:45Z

I like your proposed structure. But how can we represent the first section without headline? Omitting the :title tag and setting :level to 0? Hm…

The list of sections will anyway be embedded in some buffer context, to store In-Buffer Settings. I'd expect it to look somewhat like this

{:settings ... ;; in-buffer settings
 :preamble ... ;; content before first headline
 :sections [section-0 section-1 ...]} ;; flat list of headlines & content

and sections will be just a map

{:title "some title"
 :level 1
 :content {:raw ...
           :ast ...}
 ...}

we could even drop :content, and just have

{:title "some title"
 :level 1
 :raw ...
 :ast ...
 ...}

Great! Looking forward to your PR. Did you fork or extend this PR?

I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.

schoettl · 2021-05-15T11:07:19Z

and sections will be just a map
{:title "some title"
 :level 1
 :content {:raw ...
           :ast ...}
 ...}
we could even drop :content, and just have
{:title "some title"
 :level 1
 :raw ...
 :ast ...
 ...}

I agree. But I would keep the :content map to make more explicit what belongs to headline and what to content.

For example, we later also need :raw-title and :tags, :priority, … that all belongs more to the headline than to the content. So I would keep content in a separate map.

Great! Looking forward to your PR. Did you fork or extend this PR?

I pulled your branch and I'm working on that. My PR will merge into yours, which we'll merge to master once its feature complete. Hope that works for you.

👍

branch14 · 2021-05-15T15:56:11Z

I agree. But I would keep the :content map to make more explicit what belongs to headline and what to content.

For example, we later also need :raw-title and :tags, :priority, … that all belongs more to the headline than to the content. So I would keep content in a separate map.

Then it should probably be

:sections
[{:headline {:title ...
             :level ...
             ...}
  :content {:raw ...
            :ast ...
            ...}}
 ...

schoettl · 2021-05-15T16:01:50Z

Yes. It's a bit verbose but I think it's the cleanest solution! It will help to quickly understand the structure.

I remember that Organice's parse structure was a bit confusing for me because there is no such separation.

munen · 2021-05-17T10:00:45Z

@schoettl branch14 and me just worked ok a PR as discussed here. The PR refactors transform, uses raw, introduces integration testing and implements the suggestion from the question "How should the final transformed parse structure look like?" from #31.

I just saw that you're using a fork of org-parser at schoettl/org-parser. Hence, the PR is https://github.com/schoettl/org-parser/pull/1

Do you have a reason for using a fork? I presume it's because you still use the same repository from when you were not an org-parser maintainer(; If it works for you, I'd suggest to use the upstream repository directly. This will make at least these things easier that are now a bit complicated:

Now we have multiple repositories with Pull Requests. At 200ok-ch/org-parser, it's not visible anymore that PRs are open and which merge into which.
On your repository, there's no CI, so we cannot see at https://github.com/schoettl/org-parser/pull/1 if the changes are green. I mean, we can by pushing the changes redundantly into the upstream repository, but then the code is in two repositories and the PR still doesn't show checkmarks^^
When merging the PR, the upstream repo will not show the changes as a Pull Request at the end. So we lose some history that might prove valuable at some point in the future.

schoettl · 2021-05-17T10:41:16Z

@munen Oh, I wasn't aware of that I branched or pushed branches still to my fork repo. At least my local master has already this repo as upstream. Probably it's best, I remove my forked repo so that I cannot accidentally push to it anymore.

Can you just push the branch you worked on to 200ok-ch/org-parser?

schoettl · 2021-05-17T10:59:00Z

Or should I merge your PR at schoettl#1 into this, now?

branch14 · 2021-05-17T13:02:11Z

Or should I merge your PR at schoettl#1 into this, now?

I think the easiest is to open a new PR here, which will include #27 (which would otherwise merge https://github.com/schoettl/org-parser/tree/feat/parse-text). I'll do that now. Closing this PR in favour of #36

schoettl added 5 commits May 13, 2021 10:18

Parse content-line as text instead of just .*

25d2c28

Add comment, refactor test

f1b3c73

Update tests for parsing text instead of content-line

d279f4e

Bugfixes: parse text line-wise, stop at EOL; fix inside-style regex

fbb5e77

Simplify parsing of styles: no recursion

1a5ee18

schoettl commented May 13, 2021

View reviewed changes

resources/org.ebnf Outdated Show resolved Hide resolved

Sketch for transformed parse tree

4e4dc58

This comment has been minimized.

Sign in to view

This was referenced May 13, 2021

Link parse failure in org file #30

Closed

How should the final transformed parse structure look like? #31

Open

Simplify regex and add test

495ba87

branch14 mentioned this pull request May 17, 2021

[WIP] Parse content-line as text instead of just .* #36

Merged

4 tasks

branch14 closed this May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Parse content-line as text instead of just .* #27

[WIP] Parse content-line as text instead of just .* #27

schoettl commented May 13, 2021 •

edited

Loading

schoettl commented May 13, 2021 •

edited

Loading

This comment has been minimized.

branch14 commented May 14, 2021

schoettl commented May 14, 2021 •

edited

Loading

branch14 commented May 14, 2021

schoettl commented May 14, 2021

branch14 commented May 15, 2021 •

edited

Loading

schoettl commented May 15, 2021

branch14 commented May 15, 2021 •

edited

Loading

schoettl commented May 15, 2021

munen commented May 17, 2021 •

edited

Loading

schoettl commented May 17, 2021

schoettl commented May 17, 2021

branch14 commented May 17, 2021

[WIP] Parse content-line as text instead of just .* #27

[WIP] Parse content-line as text instead of just .* #27

Conversation

schoettl commented May 13, 2021 • edited Loading

schoettl commented May 13, 2021 • edited Loading

This comment has been minimized.

branch14 commented May 14, 2021

schoettl commented May 14, 2021 • edited Loading

branch14 commented May 14, 2021

schoettl commented May 14, 2021

branch14 commented May 15, 2021 • edited Loading

schoettl commented May 15, 2021

branch14 commented May 15, 2021 • edited Loading

schoettl commented May 15, 2021

munen commented May 17, 2021 • edited Loading

schoettl commented May 17, 2021

schoettl commented May 17, 2021

branch14 commented May 17, 2021

schoettl commented May 13, 2021 •

edited

Loading

schoettl commented May 13, 2021 •

edited

Loading

schoettl commented May 14, 2021 •

edited

Loading

branch14 commented May 15, 2021 •

edited

Loading

branch14 commented May 15, 2021 •

edited

Loading

munen commented May 17, 2021 •

edited

Loading