Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update #599

Merged
merged 30 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
bf690e8
For pkgdown syntax highlighting (https://github.com/r-lib/pkgdown/iss…
SebKrantz Jun 1, 2024
5d20bb5
Merge pull request #588 from SebKrantz/development
SebKrantz Jun 1, 2024
8aa4148
Putting syntax highlighting code on new line.
SebKrantz Jun 1, 2024
b2de3b0
Merge pull request #589 from SebKrantz/development
SebKrantz Jun 1, 2024
2f2902d
Control colours in search.
SebKrantz Jun 1, 2024
ec76ea9
Also tertiary colour.
SebKrantz Jun 1, 2024
a196fc6
Merge pull request #590 from SebKrantz/development
SebKrantz Jun 1, 2024
1d4ae9d
Make valid R code.
SebKrantz Jun 1, 2024
4d63816
Merge pull request #591 from SebKrantz/development
SebKrantz Jun 1, 2024
d9387ab
Fix bug.
SebKrantz Jun 4, 2024
ef64fcc
Better verbosity for multi-joins.
SebKrantz Jun 4, 2024
29965e2
Update join docs.
SebKrantz Jun 4, 2024
bba6611
NEWS for enhanced join() verbosity.
SebKrantz Jun 4, 2024
607ce1f
Typo.
SebKrantz Jun 4, 2024
85c71fd
Also pre-rendering collapse and sf vignette.
SebKrantz Jun 4, 2024
26db0ca
Add vignettes figure and cache subdirectories.
SebKrantz Jun 4, 2024
bb79133
Merge pull request #592 from SebKrantz/development
SebKrantz Jun 4, 2024
c377177
Only flip if multiple.
SebKrantz Jun 5, 2024
52cb00d
Merge pull request #593 from SebKrantz/development
SebKrantz Jun 5, 2024
6a6eb4a
Always provide join order statistics so far feasible.
SebKrantz Jun 5, 2024
a9ccd89
Merge pull request #594 from SebKrantz/development
SebKrantz Jun 5, 2024
1423fe1
Use round here instead of signif.
SebKrantz Jun 5, 2024
1c2a60f
Merge pull request #595 from SebKrantz/development
SebKrantz Jun 5, 2024
4ec0e7f
In collap(): creating "Function" column as factor.
SebKrantz Jun 6, 2024
7678ada
Use 2 digits instead.
SebKrantz Jun 6, 2024
e40c617
Spelling.
SebKrantz Jun 6, 2024
ecfe96b
Merge pull request #596 from SebKrantz/development
SebKrantz Jun 6, 2024
09c4c24
Remove indent.
SebKrantz Jun 9, 2024
61d7ed6
Fix docs.
SebKrantz Jun 10, 2024
0e4454b
Merge pull request #598 from SebKrantz/development
SebKrantz Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,8 @@ _snaps
^\.DS_Store$
^revdep$
\.orig$
vignettes/figure
vignettes/cache



49 changes: 36 additions & 13 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,49 @@

* `pivot()` has new arguments `FUN = "last"` and `FUN.args = NULL`, allowing wide and recast pivots with aggregation (default last value as before). `FUN` currently supports a single function returning a scalar value. *Fast Statistical Functions* receive vectorized execution. `FUN.args` can be used to supply a list of function arguments, including data-length arguments such as weights. There are also a couple of internal functions callable using function strings: `"first"`, `"last"`, `"count"`, `"sum"`, `"mean"`, `"min"`, or `"max"`. These are built into the reshaping C-code and thus extremely fast. Thanks @AdrianAntico for the request (#582).

* `join()` now provides enhanced verbosity, indicating the average order of the join between the two tables, e.g.
``` r
join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3)))
#> left join: x[id] 3/4 (75%) <1.5:1st> y[id] 2/6 (33.3%)
#> id
#> 1 1
#> 2 2
#> 3 2
#> 4 4
join(data.frame(id = c(1, 2, 2, 4)), data.frame(id = c(rep(1,4), 2:3)), multiple = TRUE)
#> left join: x[id] 3/4 (75%) <1.5:2.5> y[id] 5/6 (83.3%)
#> id
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 2
#> 6 2
#> 7 4
```

* In `collap()`, with multiple functions passed to `FUN` or `catFUN` and `return = "long"`, the `"Function"` column is now generated as a factor variable instead of character (which is more efficient).

# collapse 2.0.14

* Updated '*collapse* and *sf*' vignette to reflect the recent support for *units* objects, and added a few more examples.

* Fixed a bug in `join()` where a full join silently became a left join if there are no matches between the tables (#574). Thanks @D3SL for reporting.

* Added function `group_by_vars()`: A standard evaluation version of `fgroup_by()` that is slimmer and safer for programming, e.g. `data |> group_by_vars(ind1) |> collapg(custom = list(fmean = ind2, fsum = ind3))`. Or, using *magrittr*:
```r
library(magrittr)
set_collapse(mask = "manip") # for fgroup_vars -> group_vars

data %>%
group_by_vars(ind1) %>% {
add_vars(
group_vars(., "unique"),
get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"),
get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_")
)
}
```
```r
library(magrittr)
set_collapse(mask = "manip") # for fgroup_vars -> group_vars
data %>%
group_by_vars(ind1) %>% {
add_vars(
group_vars(., "unique"),
get_vars(., ind2) %>% fmean(keep.g = FALSE) %>% add_stub("mean_"),
get_vars(., ind3) %>% fsum(keep.g = FALSE) %>% add_stub("sum_")
)
}
```

* Added function `as_integer_factor()` to turn factors/factor columns into integer vectors. `as_numeric_factor()` already exists, but is memory inefficient for most factors where levels can be integers.

Expand Down
18 changes: 12 additions & 6 deletions R/collap.R
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,14 @@ applyfuns_internal <- function(data, by, FUN, fFUN, parallel, cores, ...) {
return(list(BY.data.frame(data, by, FUN, ..., use.g.names = FALSE, reorder = FALSE, return = "data.frame"))) # return(list(lapply(data, copysplaplfun, by, FUN, ...)))
}

rbindlist_factor <- function(l, idcol = "Function") {
nam <- names(l)
names(l) <- NULL
res <- .Call(C_rbindlist, l, TRUE, TRUE, idcol)
attr(res[[1L]], "levels") <- if (length(nam)) nam else as.character(seq_along(l))
oldClass(res[[1L]]) <- "factor"
res
}


# NOTE: CUSTOM SEPARATOR doesn't work because of unlist() !
Expand Down Expand Up @@ -284,8 +292,7 @@ collap <- function(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wF
condalcSA(c(res[[1L]], e), ax, DTl) }))
} else {
if(return != 4L) {
res <- if(!keep.by) .Call(C_rbindlist, res, TRUE, TRUE, "Function") else # data.table:::Crbindlist
.Call(C_rbindlist, lapply(res[-1L], function(e) c(res[[1L]], e)), TRUE, TRUE, "Function")
if(keep.by) res <- lapply(res[-1L], function(e) c(res[[1L]], e))
} else {
if(!ncustoml || !(nul && nnul)) stop("long_dupl is only meaningful for aggregations with both numeric and categorical data, and multiple functions used for only one of the two data types!")
mFUN <- length(FUN) > 1L
Expand All @@ -295,8 +302,8 @@ collap <- function(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wF
lapply(res[-nid], function(e) c(res[[nid]], e))
} else res <- if(mFUN) lapply(res[-c(nid, 1L)], function(e) c(res[[1L]], e, res[[nid]])) else
lapply(res[-c(nid, 1L)], function(e) c(res[[1L]], res[[nid]], e))
res <- .Call(C_rbindlist, res, FALSE, FALSE, "Function")
}
res <- rbindlist_factor(res)
if(keep.col.order) o <- if(ncustoml) forder.int(c(0L, if(!keep.by) NULL else if(!bycalll) rep(0L,length(numby)) else numby, nu, nnu)) else c(1L, o + 1L)
}
# } else message("return options other than 'wide' are only meaningful if multiple functions are used!")
Expand Down Expand Up @@ -475,8 +482,7 @@ collapv <- function(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, w
condalcSA(c(res[[1L]], e), ax, DTl) }))
} else {
if(return != 4L) {
res <- if(!keep.by) .Call(C_rbindlist, res, TRUE, TRUE, "Function") else # data.table:::Crbindlist
.Call(C_rbindlist, lapply(res[-1L], function(e) c(res[[1L]], e)), TRUE, TRUE, "Function")
if(keep.by) res <- lapply(res[-1L], function(e) c(res[[1L]], e))
} else {
if(!ncustoml || !(nul && nnul)) stop("long_dupl is only meaningful for aggregations with both numeric and categorical data, and multiple functions used for only one of the two data types!")
mFUN <- length(FUN) > 1L
Expand All @@ -486,8 +492,8 @@ collapv <- function(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, w
lapply(res[-nid], function(e) c(res[[nid]], e))
} else res <- if(mFUN) lapply(res[-c(nid, 1L)], function(e) c(res[[1L]], e, res[[nid]])) else
lapply(res[-c(nid, 1L)], function(e) c(res[[1L]], res[[nid]], e))
res <- .Call(C_rbindlist, res, FALSE, FALSE, "Function")
}
res <- rbindlist_factor(res)
if(keep.col.order) o <- if(ncustoml) forder.int(c(0L, if(!keep.by) NULL else numby, nu, nnu)) else c(1L, o + 1L)
}
# } else message("return options other than 'wide' are only meaningful if multiple functions are used!")
Expand Down
24 changes: 20 additions & 4 deletions R/join.R
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,11 @@ join <- function(x, y,

if(multiple) {
g <- group(if(rjoin) x[ixon] else y[iyon], group.sizes = TRUE)
if(verbose) mi <- m
m <- multi_match(m, g)
if(is.list(m)) {
multiple <- 2L
# TODO: Optimize if drop.dup.cols
if(rjoin) y <- .Call(C_subsetDT, y, m[[1L]], seq_along(y), FALSE)
else x <- .Call(C_subsetDT, x, m[[1L]], seq_along(x), FALSE)
m <- m[[2L]]
Expand All @@ -114,7 +116,8 @@ join <- function(x, y,
}

if(verbose) {
nx <- length(m) - attr(m, "N.nomatch")
Nx <- length(if(multiple) mi else m)
nx <- Nx - attr(m, "N.nomatch")
ny <- attr(m, "N.distinct")
Ny <- attr(m, "N.groups")
if(verbose == 2L) {
Expand All @@ -124,16 +127,29 @@ join <- function(x, y,
cin_x <- xon
cin_y <- on
}
xstat <- paste0(nx, "/", length(m), " (", signif(nx/length(m)*100, 3), "%)")
xstat <- paste0(nx, "/", Nx, " (", signif(nx/Nx*100, 3), "%)")
ystat <- paste0(ny, "/", Ny, " (", signif(ny/Ny*100, 3), "%)")
if(rjoin) {
tmp <- ystat
ystat <- xstat
xstat <- tmp
}
if(multiple) {
validate <- switch(validate,
"1:1" = "1:1",
"1:m" = paste0("1:", round((if(rjoin) nx else ny) / attr(mi, "N.distinct"), 2)),
"m:1" = paste0(round((if(rjoin) ny else nx) / attr(mi, "N.distinct"), 2), ":1"),
"m:m" = paste(round(c(nx, ny)[if(rjoin) 2:1 else 1:2] / attr(mi, "N.distinct"), 2), collapse = ":"))
} else {
validate <- switch(validate,
"1:1" = "1:1",
"1:m" = paste0("1:", if(rjoin) round(nx / ny, 2) else "1st"),
"m:1" = paste0(if(rjoin) "1st" else round(nx / ny, 2), ":1"),
"m:m" = paste(c(round(nx / ny, 2), "1st")[if(rjoin) 2:1 else 1:2], collapse = ":"))
}
cat(how, " join: ",
x_name, "[", paste(cin_x, collapse = ", "), "] ",
xstat, " <", validate ,"> ",
xstat, " <", validate , "> ",
y_name, "[", paste(cin_y, collapse = ", "), "] ",
ystat, "\n", sep = "")
}
Expand Down Expand Up @@ -269,7 +285,7 @@ join <- function(x, y,
} else matched <- "matched"
# TODO: better?
# matched <- paste0(y_name, "_", y_name)
mc <- switch(how,
mc <- switch(how, left_setrn =,
left = structure(is.na(m) + 1L, levels = c(matched, x_name), class = c("factor", "na.included")),
right = structure(is.na(m) + 1L, levels = c(matched, y_name), class = c("factor", "na.included")),
full = structure(vec(list(is.na(m) + 1L, alloc(3L, fnrow(res)-length(m)))), levels = c(matched, x_name, y_name), class = c("factor", "na.included")),
Expand Down
4 changes: 2 additions & 2 deletions man/collapse-renamed.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Renamed Functions
\description{
These functions were renamed (mostly during v1.6.0 update) to make the namespace more consistent. Except for the S3 generics of \code{fNobs}, \code{fNdistinct}, \code{fHDbetween} and \code{fHDwithin}, and functions \code{replace_NA} and \code{replace_Inf}, I intend to remove all of these functions by end of 2023. %The S3 generics and the other functions will be depreciated in 2023 for the earliest. These all now give a message reminding you not to use them in fresh code.
}
\section{Renaming}{\preformatted{
\section{Renaming}{\if{html}{\out{<div class="sourceCode r">}}\preformatted{
fNobs -> fnobs
fNdistinct -> fndistinct
pwNobs -> pwnobs
Expand All @@ -66,5 +66,5 @@ Date_vars -> date_vars
`Date_vars<-` -> `date_vars<-`
replace_NA -> replace_na
replace_Inf -> replace_inf
}
}\if{html}{\out{</div>}}
}
31 changes: 15 additions & 16 deletions man/fast-statistical-functions.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
\description{
With \code{\link{fsum}}, \code{\link{fprod}}, \code{\link{fmean}}, \code{\link{fmedian}}, \code{\link{fmode}}, \code{\link{fvar}}, \code{\link{fsd}}, \code{\link{fmin}}, \code{\link{fmax}}, \code{\link{fnth}}, \code{\link{ffirst}}, \code{\link{flast}}, \code{\link{fnobs}} and \code{\link{fndistinct}}, \emph{collapse} presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (\emph{dplyr}) and \emph{data.table}'s.
}
\section{Usage}{\preformatted{
\section{Usage}{\if{html}{\out{<div class="sourceCode r">}}\preformatted{
## All functions (FUN) follow a common syntax in 4 methods:
FUN(x, ...)

Expand All @@ -27,7 +27,7 @@ FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,]
use.g.names = FALSE, keep.group_vars = TRUE,
[keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
}
}\if{html}{\out{</div>}}
}
\section{Arguments}{
\tabular{lll}{
Expand Down Expand Up @@ -75,7 +75,7 @@ Please see the documentation of individual functions.
\seealso{
\link[=collapse-documentation]{Collapse Overview}, \link[=data-transformations]{Data Transformations}, \link[=time-series-panel-series]{Time Series and Panel Series}
}
\section{Examples}{\preformatted{
\section{Examples}{\if{html}{\out{<div class="sourceCode r">}}\preformatted{
## default vector method
mpg <- mtcars$mpg
fsum(mpg) # Simple sum
Expand Down Expand Up @@ -106,18 +106,17 @@ fmode(wlddev, wlddev$income) # Grouped statistical modes ..
m <- qM(mtcars)
fsum(m)
fsum(m, g) # ..
\donttest{ % The tidyverse regularly causes havoc to CRAN tests in other packages, therefore this is not tested

## method for grouped data frames - created with dplyr::group_by or fgroup_by
library(dplyr)
mtcars |> group_by(cyl,vs,am) |> select(mpg,carb) |> fsum()
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg,carb) |> fsum() # equivalent and faster !!
mtcars |> fgroup_by(cyl,vs,am) |> fsum(TRA = "\%")
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp) # weighted grouped mean, save sum of weights
mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp, keep.group_vars = FALSE)
}\if{html}{\out{</div>}}
}
}
}
\section{Benchmark}{\preformatted{
\section{Benchmark}{\if{html}{\out{<div class="sourceCode r">}}\preformatted{
## This compares fsum with data.table (2 threads) and base::rowsum
# Starting with small data
mtcDT <- qDT(mtcars)
Expand All @@ -128,10 +127,10 @@ microbenchmark(mtcDT[, lapply(.SD, sum), by = f],
rowsum(mtcDT, f, reorder = FALSE),
fsum(mtcDT, f, na.rm = FALSE), unit = "relative")

expr min lq mean median uq max neval cld
mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c
rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b
fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
# expr min lq mean median uq max neval cld
# mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c
# rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b
# fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a

# Now larger data
tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs
Expand All @@ -141,11 +140,11 @@ microbenchmark(tdata[, lapply(.SD, sum), by = f],
rowsum(tdata, f, reorder = FALSE),
fsum(tdata, f, na.rm = FALSE), unit = "relative")

expr min lq mean median uq max neval cld
tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c
rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b
fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
}
# expr min lq mean median uq max neval cld
# tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c
# rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b
# fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
}\if{html}{\out{</div>}}
}

\keyword{univar}
Expand Down
5 changes: 2 additions & 3 deletions man/flm.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,19 @@
\alias{flm.formula}

\title{
Fast (Weighted) Linear Model Fitting % Sparse,
Fast (Weighted) Linear Model Fitting
}
\description{
\code{flm} is a fast linear model command that (by default) only returns a coefficient matrix. 6 different efficient fitting methods are implemented: 4 using base R linear algebra, and 2 utilizing the \emph{RcppArmadillo} and \emph{RcppEigen} packages. The function itself only has an overhead of 5-10 microseconds, and is thus well suited as a bootstrap workhorse.
}
\usage{
flm(...) # Internal method dispatch: default if is.atomic(..1)

\method{flm}{default}(y, X, w = NULL, add.icpt = FALSE, return.raw = FALSE, % sparse = FALSE
\method{flm}{default}(y, X, w = NULL, add.icpt = FALSE, return.raw = FALSE,
method = c("lm", "solve", "qr", "arma", "chol", "eigen"),
eigen.method = 3L, ...)

\method{flm}{formula}(formula, data = NULL, weights = NULL, add.icpt = TRUE, ...)

}
%- maybe also 'usage' for other objects documented here.
\arguments{
Expand Down
Loading
Loading