Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column named timestamp escapes backticks #2155

Closed
neelance opened this issue Mar 13, 2023 · 30 comments
Closed

Column named timestamp escapes backticks #2155

neelance opened this issue Mar 13, 2023 · 30 comments
Labels
bug Invalid compiler output or panic priority

Comments

@neelance
Copy link

What's up?

Version 0.6.0 seems to be more restrictive regarding column names. I have two examples that stopped working:

from test
select [type]

gives error

function call expected one of !, (, +, -, ==, [, ], an identifier or keyword case, but found keyword type

and

from test
select [timestamp]

gives error

function std.select, param `columns` expected type `column`, but found type `set`

Is there an easy way to fix the PRQL code? If yes, then maybe the error message should include a hint for the fix, since the column names type and timestamp are probably common?

@max-sixty
Copy link
Member

Thanks a lot for the report. Agree that the error messages should be much better.

Does surrounding the name with backticks work? (Can look more tomorrow)

@neelance
Copy link
Author

For type backticks seem to work, for timestamp they don't.

@max-sixty max-sixty added the bug Invalid compiler output or panic label Mar 13, 2023
@max-sixty
Copy link
Member

Great, confirmed on timestamp. For the moment, we can use an s-string, as a standard escape hatch:

from test
select [s"timestamp"]

But it's important for consistency that the backticks always work; they should take priority over any keyword.

@max-sixty
Copy link
Member

Re the solution — this is happening because after parsing, we match on keywords:

"timestamp" => TyLit::Timestamp,

Possibly we should be matching for keyword during parsing, because that's when we know whether a term is escaped with backticks or not.

If you have thoughts @aljazerzen ...

@max-sixty max-sixty changed the title Trouble with column names Column named timestamp escapes backticks Mar 13, 2023
@max-sixty
Copy link
Member

And possible https://prql-lang.org/book/syntax.html#keywords is not accurate, and type should be there? Or this is a bug?
image

@aljazerzen
Copy link
Member

Yes I added type keyword. Forgot to update the docs.

Re the timestamp: the dot prefix from #1619 would make this unambitious here: .timestamp would mean the column, timestamp would mean std.timestamp.

Also, you can use test.timestamp instead of just timestamp to refer to your column.

@priithaamer
Copy link

Another thing that broke in 0.6.0 and looks related to this

from test
derive [
  date = '2023-01-01'
]
select [
  date
]

Gives an error

Error: 
   ╭─[:6:3]
   │
 6 │   date
   ·   ──┬──  
   ·     ╰──── Ambiguous name. Could be from any of _frame.date, date
───╯

Same goes for column named time.

Is this just a bug that gets fixed or shall we escape all these fields going forward? I'd say that date and time are just too common words for column names. Would be super nice if there was no need to escape these.

Thanks for the confirmation on type keyword. Also managed to run into this.

@aljazerzen
Copy link
Member

I wouldn't promise that we will just fix this, but we do need to find a way for adding things to std without breaking existing queries.

@max-sixty
Copy link
Member

I wouldn't promise that we will just fix this

WDYM by this? (not disagreeing at all, just trying to understand the literal meaning)

@aljazerzen
Copy link
Member

I meant that we want to have std.timestamp declared and we want it to be available without using std. prefix. But there is also a possibility that we introduce some new name resolution rules or structure module std differently so name timestamp can unambiguously be used for columns.

As I wrote, the overarching problem we will have to solve is the fact that currently any new declarations in std. are a breaking change. And I don't have any good ideas on how to tackle that.

@max-sixty
Copy link
Member

As I wrote, the overarching problem we will have to solve is the fact that currently any new declarations in std. are a breaking change. And I don't have any good ideas on how to tackle that.

I'm less worried about this. I think for the moment they'll be small breaking changes, and then when we get to 1.0, we'll have versioning; same as rust... It depends on how widespread this is — if we're adding lots of keywords like date then that will be disruptive. If we add mod once, that's probably fine.


Also

  • I think we can avoid error-ing when the keyword is in backticks — do you agree?
  • Do we actually need to error when these are used as column names? This is an area of the code I still don't understand well, I can look more.

(to some extent these priorities are opposing — the first is easier if we parse keywords in the parser, the second is easier if we wait to resolve keywords at resolve time. Though atm we get neither correct :) )


Re the timestamp: the dot prefix from #1619 would make this unambitious here: .timestamp would mean the column, timestamp would mean std.timestamp.

FWIW I agree that this is one of the costs of using bare words like we do. It has benefits (I'd argue large ones, on net), but this is a cost.

@aljazerzen
Copy link
Member

As I wrote, the overarching problem we will have to solve is the fact that currently any new declarations in std. are a breaking change. And I don't have any good ideas on how to tackle that.

I'm less worried about this. I think for the moment they'll be small breaking changes, and then when we get to 1.0, we'll have versioning; same as rust... It depends on how widespread this is — if we're adding lots of keywords like date then that will be disruptive. If we add mod once, that's probably fine.

Rust does not make breaking changes. That's why all releases are 1.x and not 2.0, 3.0 and so on. If we also want that (I strongly feel that we do), we won't be able to add, for example, std.string_starts_with or std.corr. Ever.

That's my concern.


* I think we can avoid error-ing when the keyword is in backticks — do you agree?

I agree, but this only works for type. Backticks essentially say "don't treat this as a keyword, but as an ident". But timestamp is an ident already, not a keyword. The problem is that when that ident is resolved, it may match either with std.timestamp of with test.timestamp (if test is the name of your relation).

@max-sixty
Copy link
Member

I agree, but this only works for type. Backticks essentially say "don't treat this as a keyword, but as an ident". But timestamp is an ident already, not a keyword.

OK yes, I hadn't understood that distinction; I do now.

One possibility is to resolve timestamp to a type only when can be a type. But then we have more complicated scoping rules... I will think more.
Somehow we have to let date be a column, yet I'm also fairly convinced that SQL's use of bare words is a big advantage in approachability & usability, even if it's a bit less robust.


Rust does not make breaking changes. That's why all releases are 1.x and not 2.0, 3.0 and so on. If we also want that (I strongly feel that we do), we won't be able to add, for example, std.string_starts_with or std.corr. Ever.

It does add new keywords though. I get there's a subtle difference from adding things to std, but it has a similar effect — upgrading edition means a previously allowed word is no longer allowed.

@aljazerzen
Copy link
Member

It does add new keywords though. I get there's a subtle difference from adding things to std, but it has a similar effect — upgrading edition means a previously allowed word is no longer allowed.

That's true, but editions are not the same as versions. A new version or rustc must support all editions from 1.0 onwards - i'm not sure exactly how we will handle versioning post 1.0, so I guess we don't have to bother with this problem now.

@max-sixty
Copy link
Member

max-sixty commented Apr 28, 2023

How do we think most SQL compilers handle this? Because we can write:

(edit: updated from below)

SELECT 
  date,
  CAST(date_int AS DATE)
FROM foo

...i.e. it works in both places (and with any caps).

Maybe they resolve the idents later in the compilation pipeline? (Not saying this is better or worse, just trying to understand the options)

@aljazerzen
Copy link
Member

You probably mean:

SELECT 
  date,
  CAST(date_int AS DATE)
FROM foo

or

SELECT 
  date,
  date_int::DATE -- postgres cast syntax
FROM foo

The answer is that SQL has context-dependent name resolution rules - the thing that I try to avoid. I advocate that ident date cannot mean both "the column named date" and "the type named date", depending on syntactic structure it is used in.

@max-sixty
Copy link
Member

The answer is that SQL has context-dependent name resolution rules - the thing that I try to avoid.

Yeah, I def agree with this impulse...

This issue boosts the status of #1619, or something similar to it.

Possibly it's a trilemma:

  • Bare words — i.e. refer to columns as date not "date"
  • Reasonable parsing & resolving — i.e. don't resolve whether something is a type or an ident at the end
  • Flexible naming — i.e. we can name columns date!

..pick two

@max-sixty
Copy link
Member

I'm coming back to this. I wonder whether we can handle `date` as a column — i.e. backticks doesn't just mean "ident which might be a column or a type" but instead "definitely a column".

From the perspective of the language, I think this would work, and would provide a way of forcing an item to be interpreted as a column.

From the perspective of the compiler, I think It would require having a new PL type.


An alternative would be some form of #1619. I found the semantics in the original proposal quite confusing. I do find the jq approach (outlined at #1619 (comment)) quite reasonable. It would be another huge change though.

@aljazerzen
Copy link
Member

Would this also apply to other things in std? Like sum? Or select? Right now, from perspective of name resolution, they are equivalent to date.

This would be similar to how Rust also has special name resolution rules for handling types (i32 instead of std::i32, Result instead of std::Result).

Alternatively, we could formalize the _frame namespace, rename it to relation or something and make it a part of the language.

@max-sixty
Copy link
Member

Would this also apply to other things in std? Like sum? Or select? Right now, from perspective of name resolution, they are equivalent to date.

No, it would be tagged as a column.

Same as my comments in #1619 — the difference between a) columns vs. b) function&types is the difficulty, rather than between stdlib & others.

@aljazerzen
Copy link
Member

Ok, so let me get this cleared up. This is what you propose:

from x
derive {date = 1, derive = 2, sum = 3}
derive {`date` + 1, `derive` + 2, `sum` + 3}
aggregate {std.sum `date`, std.sum `sum`}

I think it would work. A bit unconventional to tangle name resolution with quoting, but it would work.

What if I have:

let `my variable` = 5

from x
select {y + `my variable`}

@max-sixty
Copy link
Member

let my variable = 5

from x
select {y + my variable}

These are good cases...

I'm thinking about `foo` as referring to a column / field in the same way that jq uses . (though jq requires it).

jq also has variables, which it doesn't prefix with .. So this would replicate that — `foo` always refers to a column, never to a variable.

let `my variable` = 5 would be an error; that's not a valid PRQL variable.

derive `x 2` = f"{x}_{x}" would be valid, since it's defining a column.


The cases are good, I'm slowing my enthusiasm a tiny bit on this, particularly when I think about name resolution. Maybe this is the price we pay for the life we choose (trilemma above).

But I'm hesitant to jump all the way to using .s ref:

I do find the jq approach (outlined at #1619 (comment)) quite reasonable. It would be another huge change though.

@aljazerzen
Copy link
Member

let `my variable` = 5 would be an error; that's not a valid PRQL variable.

That's very limiting... imagine someone is using PRQL as a target for their UI, which allows defining tables. You would not be able to define tables with spaces in their names.


How do you feel about this?

from x
derive date = created_at
select {date}
        ----
         \____ Error: ambiguous name. Did you mean std.date or relation.date?

@max-sixty
Copy link
Member

That's very limiting... imagine someone is using PRQL as a target for their UI, which allows defining tables. You would not be able to define tables with spaces in their names.

Yes, fair point. I've been thinking about it more as the query lang — same as jq — but you're right it's limiting once we expand from there.


How do you feel about this?

from x
derive date = created_at
select {date}
        ----
         \____ Error: ambiguous name. Did you mean std.date or relation.date?

I think it's a nice error message! It's maybe better than the `date` concept, given the holes in that.

It's still not great. But possibly that's the price we pay for the life we choose!


Do you have thoughts on the modified .foo idea (here)? I'm not convinced, partly because of the size of the change, but it does maaaybe synthesize the tradeoffs...

@aljazerzen
Copy link
Member

Do you have thoughts on the modified .foo idea (#1619 (comment))? I'm not convinced, partly because of the size of the change, but it does maaaybe synthesize the tradeoffs...

Hmmm. So when referring to types and functions, there is no leading ., but when referring to data (i.e. everything else) there is. This would clash with our idea that function decls are no different than variable decls.

I don't like that very much, on top of the reasons for which we abolished #1619 in the first place.


It's still not great.

I do feel the same way here - it's quite verbose.


All in all, I'm:

@max-sixty
Copy link
Member

Hmmm. So when referring to types and functions, there is no leading ., but when referring to data (i.e. everything else) there is. This would clash with our idea that function decls are no different than variable decls.

I see what you mean about treating these differently from the perspective of the compiler.

I do think it's an simple mental model for users, though — that data goes into a namespace which is queried with . (it would be the default_db namespace, or the _frame / relation namespace for columns). That was the primary issue I had with the previous proposal — that the mental model for users was awkward, even if easy for the compiler.

(Possibly I should either really propose it or drop it — rather than being on the fence but continuing discussion...)

@aljazerzen
Copy link
Member

I do think it's an simple mental model for users, though

Apparently we have a bit different mental models and the feeling of what people find "simple". My reasoning may be influenced by compiler's perspective, but this is how I think of our semantics - and I'm sure I'm not the only one.

But this does show that such feature would divide PRQL users depending on how they think about variables. And we don't want that.

@aljazerzen
Copy link
Member

Sooo... relation.timestamp?

@max-sixty
Copy link
Member

Apparently we have a bit different mental models and the feeling of what people find "simple". My reasoning may be influenced by compiler's perspective, but this is how I think of our semantics - and I'm sure I'm not the only one.

I hear that. Arguably jq is some precedent though!

Sooo... relation.timestamp?

Yes I like it!

😁

@aljazerzen
Copy link
Member

This is done. Timestamp can now be referenced with this.timestamp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Invalid compiler output or panic priority
Projects
None yet
Development

No branches or pull requests

4 participants