Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial parser. #10

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Add initial parser. #10

wants to merge 6 commits into from

Conversation

fernape
Copy link
Contributor

@fernape fernape commented Jul 8, 2022

Initial parser for sqlair DSL. Top-down parser with precedence for
disambiguation.

Address most suggestions by @benhoyt from this PR

  • Use beginChildren and endChildren.
  • Remove PERIOD and COMMA from input/output/groupedcolumns expressions.
  • Rename confusing variable names (os, l, etc).
  • Avoid self for receiver names.
  • Use strings.Builder for efficient printing.
  • Add function to pretty print the AST (needs go test -v).
  • Rework comments.
  • Properly format files.

Note:

  • Not using PassthroughExpression yet. IdentityExpression seems to
    fit the same role.
  • Input and Output expressions might be collapsed into one. The
    structure seems identical except for the initial marker.

Initial parser for sqlair DSL. Top-down parser with precedence for
disambiguation.

Address most suggestions by @benhoyt from [this PR](canonical/sqlair#8)

 * Use `beginChildren` and `endChildren`.
 * Remove PERIOD and COMMA from input/output/groupedcolumns expressions.
 * Rename confusing variable names (`os`, `l`, etc).
 * Avoid `self` for receiver names.
 * Use `strings.Builder` for efficient printing.
 * Add function to pretty print the AST (needs `go test -v`).
 * Rework comments.
 * Properly format files.

Note:
 * Not using `PassthroughExpression` yet. `IdentityExpression` seems to
   fit the same role.
 * Input and Output expressions might be collapsed into one. The
   structure seems identical except for the initial marker.
parse/ast.go Outdated Show resolved Hide resolved
parse/ast.go Outdated Show resolved Hide resolved
parse/ast.go Outdated Show resolved Hide resolved
parse/parser.go Outdated Show resolved Hide resolved
parse/parser.go Outdated Show resolved Hide resolved
parse/parser.go Outdated Show resolved Hide resolved
parse/parser.go Outdated Show resolved Hide resolved
parse/parser.go Outdated Show resolved Hide resolved
parse/parser_test.go Outdated Show resolved Hide resolved
assert.Equal(t, expected, r)
}

func TestParserSimpleInputSource(t *testing.T) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm a bit worried about here is whether we've covered all the cases, or got edge cases or termination conditions incorrect, and so on. I think we probably need a more thorough approach to testing, for example: run a large swath of valid and invalid SQL through it. Not sure where you'd get such a body: we could create our own and build on it, or maybe pull in a list from another library's tests, that sort of thing.

We could also use Go's coverage to ensure we've covered all the cases, and fuzzing to ensure we don't have unintended infinite loops or panics. Go has very good tooling for both of these, so they're easy to set up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we would have a big corpus of SQL queries with a comprehensive mix of input sources, output targets, etc.

@benhoyt
Copy link

benhoyt commented Jul 10, 2022

Thanks for the updates. Looks like a good start!

fernape added 5 commits July 12, 2022 09:24
Initial parser for sqlair DSL. Top-down parser with precedence for
disambiguation.

Address most suggestions by @benhoyt from [this PR](canonical/sqlair#8)

 * Use `beginChildren` and `endChildren`.
 * Remove PERIOD and COMMA from input/output/groupedcolumns expressions.
 * Rename confusing variable names (`os`, `l`, etc).
 * Avoid `self` for receiver names.
 * Use `strings.Builder` for efficient printing.
 * Add function to pretty print the AST (needs `go test -v`).
 * Rework comments.
 * Properly format files.
 * Use pre-check for list separators.
 * s/infixfn/infixFn and s/prefixfn/prefixFn
 * Make {In/Pre}fixFunc private
 * Make NextToken and Precedence private.
 * Make Lexer private.
 * Add more tests.
   * Check for (,,,,)
   * Check for ()
   * Check for (a, b...
   * Test for empty statement.
   * Test for missing PERIOD in output target.
 * Add makeToken helper to make tests more readable.

Note:
 * Not using `PassthroughExpression` yet. `IdentityExpression` seems to
   fit the same role.
 * Input and Output expressions might be collapsed into one. The
   structure seems identical except for the initial marker.
Unbreak parser after changes in pull request canonical#10.
Copy link
Collaborator

@manadart manadart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean about PassThroughExpression being unused. Trying to use it makes the parsing a little more complicated with this operator-precedence approach, so we can go with what we've got here for now.

I'll do a series of follow-ups to this with some sundry things.

Nice job.

Copy link

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't merge this yet. Let's please do a review pass on it together, Fernando. There are a few details that can be improved, and that will help settling a standard for future work.

Copy link

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few early points for us to brainstorm a bit on.

)

const (
LOWEST = iota

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per standard Go conventions, Go follows a camel-case style, and we typically see some indication of what the constant is about in its name.

// This parser implements a top-down parsing strategy with operator
// precedence (Pratt's parser)
// https://en.wikipedia.org/wiki/Operator-precedence_parser#Pratt_parsing
func NewParser(l *lexer) *Parser {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was the Lexer made private? This particular line shows a design problem in doing so: how can we have a public function that takes a private type?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my suggestion that we reduce the number of exported things (and the lexer is really an implementation detail). However, @niemeyer is right that we can't have an exported function taking a non-exported type. I think we should just have the parser take a []byte or string directly, and it can instantiate the lexer (implementation detail). This would also make usage (including tests) simpler.

Then again, now that these packages are internal it's much less concerning to me to have a larger exported surface area.

// Fill prefix functions according to token type.
// Token types that has prefixFunc do not care about the
// left part of the expression.
p.prefixFn = map[TokenType]prefixFunc{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixFunc or prefixFn?

lex: l,
}

// Fill prefix functions according to token type.
// Token types that has prefixFunc do not care about the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have some more clarity about what this means here, perhaps with some examples. Also, why are these maps instantiated as part of the parser for every instance? Do we plan to change them per-parser, or are they dynamic for some other reason?

Also, "has" => "have", but perhaps irrelevant if docs will change.

// Fill infix functions according to token type.
// Tokens that has infixFunc care about the left part
// of the expression.
p.infixFn = map[TokenType]infixFunc{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same notes as above.

return nil
}
p.nextToken()
field := p.parseExpression(p.precedence(p.currentToken)).(*IdentityExpression)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar concern as earlier above.

func (p *Parser) parseInputSource() Expression {
marker := p.currentToken
p.nextToken()
name := p.parseExpression(p.precedence(p.currentToken)).(*IdentityExpression)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier here and below. I won't repeat the comment, but given how widespread this is it might be worth talking about it to see if there's an easy pattern we can adopt across the code.


func (p *Parser) parseGroup() Expression {
// Skip left parenthesis
p.nextToken()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue here as earlier. There's apparently a pattern established about functions trusting external call sites to define their own internal state. This seems problematic, because it means for every function there's an implied and unspecified rule about what state the world is in so that we can call it, otherwise the parser will panic. This kind of approach does not bode well to future work on this code base, when someone else will be writing code and not know what all the expectations were. Let's please talk about this as well.

if p.currentToken.Type == RPAREN {
// Empty group.
p.errors = append(p.errors, fmt.Sprintf(
"Line: %d, column: %d: unexpected ')', expecting LITERAL",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as earlier note on the error message. There's more below, but I'll stop repeating it. Let's please just do a full pass on the error messages.

// in this statement.
for _, c := range r.Expressions() {
t.Log(reflect.TypeOf(c), reflect.TypeOf(&IdentityExpression{}))
assert.Equal(t, reflect.TypeOf(c), reflect.TypeOf(&IdentityExpression{}))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to organize these tests as a table so that it's much easier to see what's happening, and also much easier to repeat similar cases, and then improve the number of cases tested. Again, let's please chat about it and I'll present some analogous code as inspiration.

jack-w-shaw pushed a commit to jack-w-shaw/sqlair that referenced this pull request Jun 20, 2023
This adds a complete parser that parses any statement as only BypassParts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants