-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relative names #1619
Comments
I quite like the idea of a leading . for columns. I don't really know why yet but it feels like it would bring additional consistency. It also reminds me of JDOT (https://github.com/saulpw/jdot). TBH, I did not understand the name resolution explanation yet but I will try again in the morning (it's close to midnight now). For example why is it Another possible benefit could be that it might disambiguate a column named "from" from the keyword |
Perhaps the origin is jq? I think jq is a very popular language for writing queries to json. |
Sorry to take a while to respond. I think I'm understanding 85% of this,so forgive me if I'm slow. I can see two points here;
Re the discriminaring — how easy do you think it is to explain when to use a period vs. not to? I worry it's not easy! (But possibly we could make it easier). Re the periods — I don't have a strong secular objection to it. It would be a big change, and I'm not sure it gets us that much apart from the discrimination. But it is an effective way of allowing columns to be clearly different from functions. To what extent do you think it's accurate to describe emphermal variables as just having a scope that's limited to that line? |
It just one point here: use The rule for when to use the dot is simple: columns start with a dot.
That's pretty accurate. But it may be confusing because even though the scope is limited to current function, almost identical scope could be created for next function in the pipeline. |
Totally, but is there an easy way to define ephemeral variables to beginners? |
I'm saying that for beginners, ephemeral variables can be equivalent to columns. So the whole rule is columns start with a dot. And we don't even mention ephemeral variables. That's because we don't have anything other than relations that we'd want to have references into. Maybe in the future, we could add support for referencing properties of JSON objects or structs. |
Yes OK, that is complete in the examples above. How about when it's a variable; for example: func add a b -> a + b
# or
func add a b -> .a + .b
# or
func add .a .b -> .a + .b Thanks for bearing with me... |
Oh, params are scoped variables so they don't need a leading dot. So like this: func add a b -> a + b
func latest n rel -> (rel | sort [-.changed_at] | take n)
# rel and n are params -> scoped -> no dot
# .changed_at is a column (reference "into" rel) -> ephemeral variable -> dot |
OK great, I see, thanks. I think it's tractable. I don't think it's that friendly, and it's much more alien for those who are used to SQL. Do others share a concern that represents hierarchies inconsistently? For example I think this is insightful, and maybe we should discuss it more in our docs...
....I've heard this referred to as "bare words". I find it a great advantage of PRQL over something like python. It makes sense that we promote columns to not require quotes, since columns are so important in tabular data; they're almost like variables to us. As @eitsupi points out, So my current view is:
How important do you think it is for the development of the lang? Can we instead have a hierarchy of scopes (like many langs do), and resolve ephemeral variables first, and scoped variable after that? |
I recall that in dplyr, it is sometimes difficult to distinguish between variables outside the data frame and column names in the data frame, making the behavior confusing. cyl <- 10
mtcars |>
dplyr::mutate(new = cyl * 10) It can be specified explicitly by cyl <- 10
mtcars |>
dplyr::mutate(new = .data$cyl * 10) I think it is a good balance of clarity and ease of writing to always start column names with a dot. |
I've implemented the proposal and converted the tests in prql-compiler. Here are a few examples:
Here is my findings:
Possible alternatives:
|
Thanks for the list of findings, that's v helpful to anchor around.
Is this still the same for the full path of columns? Or does I think the
One lens to view this is what we'd write in the Changelog — I'm not sure what we'd write that I'd feel great about... |
Actually, this is the confusion that this issue is trying to avoid. It separates these two cases: References to things in global scope don't have a leading dot:
References into subject of the current pipeline have a leading dot:
So if you are able to refer to |
The implementation complexity hasn't changed enough to weigh into the decision here. And sharp corners that you mention are intentional - a syntactical spotlight of semantics. So they are actually the main benefit. Think of it as the borrow checker in Rust. But all that said, this change goes strongly against the concise nature of the language we've been able to maintain. So my vote is -0.5. |
Thanks for trying this out @aljazerzen . Reading through your examples in #1619 (comment) I'm also struck by how there is this inconsistency between rvalues and the lvalues in Overall, I'm still unclear on the ephemeral vs scoped variables. I was seeing the |
Yes, and it would be quite easy to do actually. I'll take the liberty to interpret @snth's comment as a vote of +0. Total tally is -1.5, which means that we will not be adding this feature. We can revisit it when there new features that would work well with this. |
Great, thanks for the productive discussion and exploration effort. |
I've been working with
So for example, the case above would be: -from albums
+from .albums
select .albums.title I think the |
As discussed on the call, I'm not sure my example was correct — instead
|
Great work |
Reopening as this is under consideration again. Possibly we start a new issue synthesizing where we're at, given the amount of history though. |
Abstract:
I propose to change references to columns from
column
to.column
.Reasoning:
I'll try to explain how resolver works and how I think about semantics of name and variables in PRQL.
During resolving, there is a major distinction between scoped and ephemeral variables:
std.sum
andstd.select
are global so they exist indefinitely, and function parameters exist only within function body.select
, all columns of the relation exist as variables during resolution of the first argument.It is beneficial to distinguish these two mechanism, because of their subtle differences. For example take this query:
Here, relation is constructed with
from
and within the relation a namealb
is assigned all column from tablealbums
. Note thatalb
is not a "real" value, it's just a namespace for the columns. When this relation is passed tomy_transform
, it is stored in therel
parameter.rel
is now a scoped variable whilealb.title
is a reference to one of its columns.If I compare this behavior with, say, Python and a dataframe library, scoped variables are all normal idents, while ephemeral variables would be represented with strings. This is a bit more verbose and cannot provide good errors, typing or autocomplete. (This is feature of PRQL that dataframe libraries cannot copy. Only a custom language for relations can construct custom rules for name resolution.)
So because there is distinction in resolving, I suggest we add a distinction in syntax:
Pros:
Cons:
The text was updated successfully, but these errors were encountered: