Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue: Full R Support #1543

Open
4 of 10 tasks
roaldarbol opened this issue Jun 26, 2024 · 19 comments
Open
4 of 10 tasks

Tracking Issue: Full R Support #1543

roaldarbol opened this issue Jun 26, 2024 · 19 comments
Labels
✨ enhancement Feature request

Comments

@roaldarbol
Copy link

roaldarbol commented Jun 26, 2024

Problem description

This is a list of issues which need to be solved to have complete R support - not all are directly issues within pixi, but rather obstacles to a smooth pixi-based workflow.

Packaging

  • Ability to generate recipes from R-Universe JSONs
  • Mass packaging of R packages from R-Universe to conda-forge or r-forge
    • Most R packages are already available on conda-forge with the r- prefix (e.g. r-dplyr), so it's already ready to be used. However, not all packages are there and the recipes are not super easy to create or maintain. rattler-build is the new alternative to conda-build that is based on a new recipe format, and there's been put quite a bit of work into covering edge-cases. Packages uploaded with rattler-build are currently placed in r-forge. R-Universe offers a really nice API that's easy to parse, so the hope here is to have automated packaging of all the packages that are available on R-Universe. Once/if this can be done, I think the idea is to create it as an actual conda channel.
    • Once the automated recipe generation and updating is implemented, maybe have a coverage (how many percent of the packages on R-Universe are successfully built in r-forge).
  • Install R package from Github #2187 Ability to generate recipe from Github repository (needs a DESCRIPTION parser)

GUI

Workflow

Docs

@roaldarbol roaldarbol added the ✨ enhancement Feature request label Jun 26, 2024
@tdejager
Copy link
Contributor

This looks good! Thanks for the issue :)

@andreiprodan
Copy link

Would be awesome! Any chance it could also support installing R packages from (public) Github repo's? (i.e not on CRAN or Bioconductor).

@roaldarbol
Copy link
Author

At least initially, no. But it would include everything on R-Universe, where you can find the vast majority of packages - and creating a universe for your packages is quite simple.

I guess @wolfv would know how much more complicated it is to install directly from a Github source, and whether it's feasible. Currently it uses the the JSON from the r-universe API (e.g. https://stan-dev.r-universe.dev/api/packages/). I think, to support GH packages, a new parser would need to be written for parsing DESCRIPTION. 😊

I hope to make a write-up of all the R goodies quite soon!

@wolfv
Copy link
Member

wolfv commented Jul 5, 2024

Yeah, there is no technical limitation why it wouldn't work directly from a Github package. Right now, we just use the easy to parse JSON but you could also write a little parser for the R native file - or write the recipe yourself! :)

@notPlancha
Copy link
Contributor

Just to offer some insight, the DESCRPTION file is just a Debian Control File, except it's encoded in ASCII and does not support comments. In theory it should be as easy to use a parser for these, such as debian-control. All info about the file (and R packages) can be found on the manual

@jdblischak
Copy link
Contributor

This is awesome. Thanks for all the hard work to support R users.

A few questions:

Mass packaging of R packages from R-Universe to conda-forge or r-forge

What is r-forge? Are you planning to create a new channel on anaconda.org? I confirmed that there is currently no channel with that name: https://anaconda.org/r-forge

Also, heads up that the name could cause some potential confusion in the R community. R-Forge has already existed for years (it's an R-specific source control system). It's not as popular now that GitHub exists, but saying "download the package from r-forge" could be ambiguous.

For the mass packaging, have you coordinated with the conda-forge R maintainers (@conda-forge/r)? They do a lot of work to maintain thousands of R packages. It's a lot of work to keep up with all the conda-forge migrations.

@roaldarbol
Copy link
Author

roaldarbol commented Jul 15, 2024

@jdblischak Thanks! Really good question, thanks for asking! Most of this was implicit knowledge, but seeing as this is issue is getting a bit of traction, I've updated that section now. I haven't been in touch with the conda-forge R maintainers except from trying to create recipes for packages - and all my attempts failed, they were super helpful, but overall I can agree, it's hard work. It also seems that the packages are not updated very regularly (https://anaconda.org/r/repo).

For reference see the edit of the initial post. With rattler-build and the R-universe API, it's been super easy to create package recipes, and I think it'll be feasible to have automated packaging without too many edge cases (but @wolfv will know much better than me whether that's the case - I'm a dreamer 😉). We also talked about automatic cron jobs to check whether a package has been updated. R-universe does this once every hour - don't know if it'd need to be as often, but possibly once every day might be reasonable.

I didn't know about the other R-Forge, good to know about. I think @wolfv just made the name as a way to have language-specific forges (e.g. also see rust-forge.

@wolfv
Copy link
Member

wolfv commented Jul 15, 2024

Hey @jdblischak indeed thanks! I was working on more automatic recipe generation for R recipes from inside rattler-build. The r-forge stuff is currently here: https://github.com/wolfv/r-forge and follows a common pattern (have also tried a julia-forge and a rust-forge but DISCLAIMER this is all just prototypey.

I think for certain ecosystems (such as potentially R) it makes sense to maintain them in a more centralized manner than conda-forge. Since most recipes can be auto-generated and updated I think a mono-repo approach for most R packages could be potentially useful. But of course that's something to discuss with conda-forge and even more importantly the current maintainers of R packages in conda-forge.

The "forges" use the rattler-build --recipe-dir . --skip-existing=all functionality of rattler-build to find all packages, build them in the correct order and skip everything that has already been built.

As far as recipe-generation goes I would like to come up with a generalized "patching" functionality so that the bulk of the recipe is generated, and then enhanced with patches (e.g. to add system-dependencies).

@roaldarbol
Copy link
Author

roaldarbol commented Jul 15, 2024

@wolfv I also think it might be worth getting in touch with the conda-forge R maintainers at some point, but I reckon it's probably better you than me - I simply don't know enough about the packaging process.

@jdblischak
Copy link
Contributor

Most of this was implicit knowledge, but seeing as this is issue is getting a bit of traction, I've updated that section now.

@roaldarbol Thanks! It is much clearer now.

It also seems that the packages are not updated very regularly (https://anaconda.org/r/repo).

That is the "r" channel, parts of the "defaults" channel provided by the Anaconda developers. It has nothing to do with the community channel conda-forge.

I think for certain ecosystems (such as potentially R) it makes sense to maintain them in a more centralized manner than conda-forge. Since most recipes can be auto-generated and updated I think a mono-repo approach for most R packages could be potentially useful. But of course that's something to discuss with conda-forge and even more importantly the current maintainers of R packages in conda-forge.

@wolfv I agree there would be advantages to a centralized mono-repo approach. This has been discussed before, eg in bgruening/conda_r_skeleton_helper#48 But this would be a huge change, both technically and socially. On the technical side, we'd have to figure out how to apply the conda-forge migrations to this monorepo. On the social side, we'd have to document that R users no longer submit new recipes to staged-recipes or open an Issue on an individual feedstock, but instead must direct all their activities to the new mono-repo.

The "forges" use the rattler-build --recipe-dir . --skip-existing=all functionality of rattler-build to find all packages, build them in the correct order and skip everything that has already been built.

I also worry about duplication of effort. In addition to the existing CRAN skeleton for conda-build, grayskull now also supports R recipes. With the addition of rattler-build, there are now at least 3 different ways to generate an R recipe.

Here's an old PR to the CRAN skeleton that attempted to directly parse the SystemRequirements field (conda/conda-build#3826). A better approach today would be to use an existing database, such as https://github.com/rstudio/r-system-requirements, but even this would require mapping the linux package names to the correct corresponding conda package name.

As far as recipe-generation goes I would like to come up with a generalized "patching" functionality so that the bulk of the recipe is generated, and then enhanced with patches (e.g. to add system-dependencies).

That's how NixOS builds its R packages. The recipes are auto-generated from a script, and then system requirements and other patches are added afterwards:

https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/r-modules/default.nix

@roaldarbol
Copy link
Author

That is the "r" channel, parts of the "defaults" channel provided by the Anaconda developers. It has nothing to do with the community channel conda-forge.

Ah, my bad! No problem on that front then. 😊

I also worry about duplication of effort. In addition to the existing CRAN skeleton for conda-build, grayskull now also supports R recipes. With the addition of rattler-build, there are now at least 3 different ways to generate an R recipe.

grayskull has some shortcomings when it comes to packaging R packages, and r_conda_skeleton_helper is still the preferred way. I saw that you've been following these developments for quite sometime (bgruening/conda_r_skeleton_helper#58), so you'll know more about the history and shortcomings of the current methods than me.

@wolfv Is the plan that rattler-build will be replacing the need for grayskull for creating recipes? I noticed that there's an issue there about adding the new recipe format on grayskull.

@jdblischak
Copy link
Contributor

I think for certain ecosystems (such as potentially R) it makes sense to maintain them in a more centralized manner than conda-forge. Since most recipes can be auto-generated and updated I think a mono-repo approach for most R packages could be potentially useful. But of course that's something to discuss with conda-forge and even more importantly the current maintainers of R packages in conda-forge.

@wolfv I've been thinking about this more, and I think the key is your use of most. Some R packages require much more maintenance. I'm thinking of packages like r-arrow and r-tiledb that require careful pinning to the correct version of their corresponding C++ library. Maintainers will want to be notified when their package has an update PR, and they will not want to give up their write-access (full disclosure: I am a maintainer of r-tiledb). But with the mono-repo approach, there is no way (that I am aware of) to only receive notifications when certain files are touched by a PR, or only grant write-access to specific files within a PR.

As a first pass, I would recommend starting the mono-repo with all the conda-forge R feedstocks that only have conda-forge/r as the sole maintainer. After that is working, then you could open Issues on the remaining feedstocks offering to takeover maintenance by adding the recipe to the mono-repo and archiving the feedstock.

@roaldarbol
Copy link
Author

I'll just try to parse out which separate issues I see in this conversation so we can create separate issues for them in the appropriate location:

  1. Application of patches. Seems like a rattler-build issue.
    • Lots of good info here though, thanks for the links @jdblischak!
    • @jdblischak I'm not familiar with the intricacies of packaging complex packages, so I hope it's okay if I ask some stupid questions. I also tried reading up a bit in the Writing R Extensions manual, and learned a bit, but it's a mammoth document... Do I understand correctly that SystemRequirements is a quite unstructured field compared to Depends, License, etc.? And that https://github.com/rstudio/r-system-requirements then is a set of rules for attempting to parse the relevant system requirements and their version? Is it mostly build-time requirements that are needed? Do you have an estimate of the proportion of packages need manual patching?
  2. General packaging strategy. Could possibly be an issue on r-forge as that's currently the prototype/playground for the automated packaging.
    • For me, the intriguing aspect is the automation - currently, users are encouraged to contribute a recipe to conda-forge if it's not present, and accept responsibility to maintain it - big ask for your average R user, and unlike Python devs, R devs don't think of contributing to conda (I guess it's possible, although not mentioned, that you can request a package - but only 1 single time has that happened, not counting mfansler). The automated pull from the r-universe API would bridge that gap, and most packages do not have tricky packaging needs, and doesn't need manual maintenance. In the short/intermediate term then both conda-forge and r-forge could be used as channels and have great coverage - and due to pixi having strict channel priority, conda-forge channels would always take priority if the user wants it to.

Can I just say what I'm realising: Packaging is hard. Y'all are doing an amazing job.

@phue
Copy link

phue commented Jul 17, 2024

Instead of figuring out these all these packaging problems, I wonder if it would be a feasible alternative to support renv instead, which already implements lock files, git remotes and prebuilt packages via p3m. Something similar to how pypi dependencies are deferred to uv, if that makes sense?

@andreiprodan
Copy link

my 2 cents as a dailyrenvuser: it's IMHO the best current solution to manange R packages. The downsides of renv (cannot provide R itself or system libraries, like conda/mamba) would be nicely addressed by pixi

@jdblischak
Copy link
Contributor

Do I understand correctly that SystemRequirements is a quite unstructured field compared to Depends, License, etc.?

Correct. It is purely to inform end users. It is completely optional/voluntary, and R itself never parses it.

And that https://github.com/rstudio/r-system-requirements then is a set of rules for attempting to parse the relevant system requirements and their version?

Correct. They used to maintain an explicit database with the system requirements mappings, sysreqsdb, but from the README of r-system-requirements, apparently that manual approach was too cumbersome.

Is it mostly build-time requirements that are needed? Do you have an estimate of the proportion of packages need manual patching?

Hard to say, especially build-time versus run-time. Looking at the manual for the R package {pak}, which uses r-system-requirements, it explicitly states that it doesn't attempt to distinguish between build-time and run-time. I attempted to do a quick analysis with their function sysreqs_db_list(), but I couldn't figure it out. I did a spot check of packages with obvious build-time ({curl} requires libcurl4-openssl-dev, {xml2} requires libxml2-dev) and run-time ({rmarkdown} requires pandoc) dependencies, but these were all empty. Presumably I am doing something wrong, since there is clearly a pandoc rule in r-system-requirements.

sysreqs <- pak::sysreqs_db_list(sysreqs_platform = "ubuntu-22.04")
str(subset(sysreqs, name == "curl"))
## Classes ‘tbl’ and 'data.frame':	0 obs. of  5 variables:
##  $ name        : chr
##  $ patterns    : list()
##  $ packages    : list()
##  $ pre_install : list()
##  $ post_install: list()
str(subset(sysreqs, name == "xml2"))
## Classes ‘tbl’ and 'data.frame':	0 obs. of  5 variables:
##  $ name        : chr
##  $ patterns    : list()
##  $ packages    : list()
##  $ pre_install : list()
##  $ post_install: list()
str(subset(sysreqs, name == "rmarkdown"))
## Classes ‘tbl’ and 'data.frame':	0 obs. of  5 variables:
##  $ name        : chr
##  $ patterns    : list()
##  $ packages    : list()
##  $ pre_install : list()
##  $ post_install: list()
str(subset(sysreqs, name == "chrome"))
## Classes ‘tbl’ and 'data.frame':	1 obs. of  5 variables:
##  $ name        : chr "chrome"
##  $ patterns    :List of 1
##   ..$ : chr "\\bchrome\\b"
##  $ packages    :List of 1
##   ..$ : NULL
##  $ pre_install :List of 1
##   ..$ : chr  "[ $(which google-chrome) ] || apt-get install -y gnupg curl" "[ $(which google-chrome) ] || curl -fsSL -o /tmp/google-chrome.deb https://dl.google.com/linux/direct/google-ch"| __truncated__ "[ $(which google-chrome) ] || DEBIAN_FRONTEND='noninteractive' apt-get install -y /tmp/google-chrome.deb"
##  $ post_install:List of 1
## ..$ : chr "rm -f /tmp/google-chrome.deb"

Anyways, one useful metric is how many packages require compilation. This will give you a sense of how many are trivial to build binaries for. You'll also want to investigate any packages with restrictive licenses.

x <- as.data.frame(available.packages())
table(x$NeedsCompilation)
## 
##    no   yes 
## 16267  4760 
table(x$License_restricts_use == "yes")
## 
## FALSE  TRUE 
##     9     3 

You can also look at how many R packages that nixOS patches for a rough estimate of the number of packages that have system requirements, as well as those that they have marked as broken:

@GitHunter0
Copy link

Just wanna say that pixi was the best thing that happened to python in a very long time.

It is solving all python package dependency issues that were a nightmare for developers.

I really hope pixi keeps improving and expanding to other languages (specially R and Julia).

I'd like to thank all the contributors of this project, pixi is really incredible.

@andystopia
Copy link

andystopia commented Dec 9, 2024

Not sure it'll be useful, but hopefully can be used at least as a reference, but I wrote a DESCRIPTION file parser here at andystopia/cran-work. A crate in the repo tests the parser against the most recent version of every package's DESCRIPTION file (21821 files). Result: 0 errors.

@andystopia
Copy link

andystopia commented Dec 11, 2024

I've worked on this a little more, and my repo, andystopia/cran-work, can now generate rattler-build files from CRAN & Bioconductor DESCRIPTION files directly, including historical versions of packages!

The following command will generate an r-matrix directory with a build yaml contained within, and should be sufficient to build the latest Matrix package from the CRAN.

cargo run --release -p description-to-rattler -- cran recipe Matrix --export

You can leave off --export, if you just want the definition printed to the stdout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement Feature request
Projects
None yet
Development

No branches or pull requests

9 participants