Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqawk #6

Open
danmbox opened this issue Apr 24, 2015 · 16 comments
Open

sqawk #6

danmbox opened this issue Apr 24, 2015 · 16 comments

Comments

@danmbox
Copy link

danmbox commented Apr 24, 2015

Have you noticed https://github.com/dbohdan/sqawk? There's a comparison out there, thechangelog/ping#132; it might make sense to adopt some features, like a column that equals the entire line (unsplit) and using regexes as field / column separators.

@tobimensch
Copy link
Owner

hi,

yes I noticed it as well as a few other alternatives.
I'm definitely looking forward to "steal" useful features from those alternatives, there are also some features which I had already planned for termsql before I saw them anywhere else, so it's really "multiple invention" rather than stealing I guess. :-)

Btw. termsql should be able to perform table joins from multiple files, but it does involve
an extra step. (1. output first table database to file with -o option 2. write second table with different name -t to same -o option database and perform the join) I'm looking to simplify this.

Allowing regex and other options for spliting the input also is a useful feature that will probably eventually end up in termsql.

As for keeping the original line in the table. ... Ok, I see how this makes sense when you name your tool sqawk and you want to emulate awk, but for what use case might this actually be useful? (Please someone give me an example) This feature could also be added to termsql, but I'd first like to know why and what for.

Next up I plan to add some further simplifications, for example I'm thinking about changing col0, col1 default names to c0,c1 simply so that people need to type less. And other nice simplifications you can see in the roadmap or that I've in my mind.

Btw. if you think you can contribute (ideas or code), you're definitely welcome.

@danmbox
Copy link
Author

danmbox commented Apr 24, 2015

I'd suggest a1, a2, a3 instead of c0, c1, c2, for convergence with sqawk :)

As for a0 (= entire line), it would be useful for sort, uniq, wc -l, cut -cM-N and similar... E.g.
select count (distinct substr(a0, 10, 3)) from a where a0 ilike 'WARNING: %'
to count 3-letter warning codes following WARNING... See also the examples on sqawk frontpage involving a0.

@danmbox
Copy link
Author

danmbox commented Apr 24, 2015

BTW, if you want to add multiple files/tables, you would also need a more sophisticated naming convention (like sqawk's a1, a2, b1, b2 etc)

@dbohdan
Copy link

dbohdan commented Apr 24, 2015

Hey, everyone. I noticed this issue referenced at thechangelog/ping#132 and thought I'd drop by. :-)

@tobimensch, if you are looking for more projects to "steal" features from you may find my list useful. I thinking of taking --merge from termsql myself. :-)

@tobimensch
Copy link
Owner

@dbohdan

If you "steal" --merge, then at least do it right. It's not merging the n last columns, it's merging all columns from the nth column to the last. The background being that filenames sometimes have spaces in them, and so it's unpredictable how many columns are created, but it is predictable in what column they start. After merging you should have the correct filenames in the table, see the example in the termsql manual.

@dbohdan
Copy link

dbohdan commented Apr 25, 2015

@tobimensch Right, that is how I would implement it. I did notice that my description at thechangelog/ping#132 (comment) was wrong, however; I have corrected it.

I think --merge can be improved a bit by letting the user specify a range of columns to merge, e.g., 3-5 or 8-. In the latter case it would merge all the columns from the eighth to the last (similar to how arguments to cut(1) work on *nix).

@danmbox
Copy link
Author

danmbox commented Apr 25, 2015

@dbohdan if we're getting fancy, the user might want to merge all but the first 5 and last 2 columns. I remember having this problem with cut. But can't this be solved by a filter prior to the sqawk command?

@dbohdan
Copy link

dbohdan commented Apr 25, 2015

@danmbox Good idea. Tcl's list range procedure lets you get that subrange of elements from a list with lrange $list 5 end-2; one could adopt the end-n notation for merge ranges. It may actually be better to integrate such a filter into the program itself since it would be specific to its field splitting mechanism.

@danmbox
Copy link
Author

danmbox commented Apr 25, 2015

... or you might want NF, in keeping with your AWK theme :)

@tobimensch
Copy link
Owner

Meanwhile I stole the split by regex feature. Not updating the manual yet as I consider it still a little experimental, but from what little testing I have done it seems to work.

The fancier --merge syntax is probably a good idea, at least 3-5 or -4 type syntax makes sense, although I'd still like to see some concrete usecases (Be it just so I can update my examples list). I think I'll leave 8- as the default when the user just inputs 8 without the -, so that it keeps being just 8. Could also support a comma separated list of merges. But that's really getting a little complicated... like -r '-2,4-6,9'

@danmbox
Copy link
Author

danmbox commented Apr 25, 2015

Thanks, regex is really useful! Without it you can't even distinguish fixed-column width and single-space-delimited formats for example.

It's always possible to leave enhancements for later, when somebody actually requires them. I remember having this problem with cut (need all but last N fields) but I can't remember why.

@dbohdan
Copy link

dbohdan commented Apr 25, 2015

@danmbox Good idea about NF. I've implemented range merging in Sqawk, albeit only for number-number ranges for now.

@tobimensch Myself, I've decided to support two syntaxes: merge=1-2,3-4,5-6 and merge=1 2 3 4 5 6. The latter is the natural list format in Tcl, so the former is transformed into it if detected.

@tobimensch
Copy link
Owner

@dbohdan
Will you keep that list updated? I realize this is just a blog post, but people might end up referencing this list in the future. A wiki would be an ideal place for something like that.

@dbohdan
Copy link

dbohdan commented Apr 27, 2015

@tobimensch

A wiki would be an ideal place for something like that.

I completely agree. I made a GitHub wiki for it at https://github.com/dbohdan/structured-text-tools/wiki with the content in the post plus an update on Sqawk and termsql. You should be able to edit the wiki as long as you have a GitHub account.

@tobimensch
Copy link
Owner

Nice :-)

@tobimensch
Copy link
Owner

@danmbox
I implemented the "entire line" feature.

Comparing with sqawk examples:

sqawk -1 -OFS ' -- ' 'select a0, count(*) from a group by a0 having count(*) > 1' < file
termsql -R 'select raw,count(*) group by raw having count(*) > 1' < file
sqawk "select count (distinct substr(a0, 10, 3)) from a where a0 like 'WARNING: %'"
termsql -R "select count (distinct substr(raw, 10, 3)) where raw like 'WARNING: %'"

By the way. You could've always simply used the --line-as-colums feature to achieve the same thing;

termsql -l1 'select col0,count(*) group by col0 having count(*) > 1' < file

This is actually closer to sqawk -1, because it doesn't split stuff into fields, while -R/--raw is closer to the default mode of sqawk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants