Only download pages matching the user's languages #345

niklasmohrin · 2023-10-20T23:15:47Z

Closes #335, although there are some follow ups in the comments that we should probably open new issues for.

Also, I opened tldr-pages/tldr#11121 in the process.

dbrgn

The code is pretty nice. However, I'm not yet convinced about the usability tradeoffs of this feature.

Advantages of Per-Language Downloads

First of all, the advantages: I think the only advantage is that we don't download all languages at once and thus speed up the update in case of a slow internet connection.

Right now the complete tldr.zip download is 7.7 MiB in size. The English pages only are 1.8 MiB. German pages are 259 KiB.

(Edit: I looked at the original issue, and disk space was mentioned an issue as well. Right now the pages use 76 MiB. 17 of those could be saved by deduplicating the now duplicated pages and pages.en directories. However, both 59 or 76 MiB seem like a non-issue on today's hardware, where 1 TiB of storage costs you around 40$. People that want to save disk space should ideally pick a tldr client that fetches the pages on-demand, without a cache. Or a wrapper script could be used that removes unneeded languages after a tealdeer update.)

Disadvantages of Per-Language Downloads

I gave it a quick try, and noticed a few issues:

Messages in case of a cache miss are not clear. If I run tldr --update, only the English pages are downloaded. If I now run tldr -L fr tar, I get this error message:
```
 Warning: Page `tar` not found in cache.
 Try updating with `tldr --update`, or submit a pull request to:
 https://github.com/tldr-pages/tldr
```
...however, running tldr --update will not help, it will simply re-download the English pages. Tealdeer has no way of knowing whether a French version of that page exists (it does), so UX will always be non-ideal.

One workaround would be to suggest running tldr -L <current-language> --update if the language isn't the default language (according to the config, with fallback to English). Adds complexity though.
The feature has a very ugly interaction with auto-update. If I run tldr -L de tar with a stale cache, the old cache will be nuked and replaced with a German cache. Next time I run tldr git with the default language set to English, it will complain that the page wasn't found in the cache. However, because the cache isn't stale (it was just fetched), it will not be auto-updated either.
Depending on how many languages the user has configured, given a fast internet connection, cache updates may be slower than before (because we sequentially download and unzip the language archives).
Finally, the handling of all those archives, in combination with the update logic, adds considerable complexity to the codebase. I'm fairly sure there are other UX issues that we may have missed.

All in all, this seems a nice feature at first, but I'm not sure the (in my opinon tiny) advantages outweigh the disadvantages. What's your opinion, @niklasmohrin

dbrgn · 2023-10-21T22:24:11Z

src/main.rs

@@ -61,7 +61,7 @@ const APP_INFO: AppInfo = AppInfo {
 name: NAME,
 author: NAME,
 };
-const ARCHIVE_URL: &str = "https://tldr.sh/assets/tldr.zip";
+const ARCHIVES_URL: &str = "https://tldr.sh/assets/";


Maybe this should be called ARCHIVE_URL_PREFIX instead? (If yes, the cache update method parameter needs a rename as well.)

dbrgn · 2023-10-21T22:40:37Z

src/cache.rs

@@ -134,7 +134,7 @@ impl Cache {
 }

 /// Download the archive from the specified URL.


Maybe document the return type?

Suggested change

/// Download the archive from the specified URL.

/// Download the archive from the specified URL.

/// Return `None` if the URL results in a HTTP 404.

niklasmohrin · 2023-10-24T20:18:31Z

The main motivation for me is honestly that it just feels like the right thing to do. Downloading ~40 languages when most people only need one or two is excessive and extra work which is not in the spirit of tealdeer as I perceive it. I have to admit, this is a tough argument to bring, given that I previously argued against git in favor of simplicity, but I think that the situation is different here since the language-specific zips are only marginally more complicated than the one big zip.

None of this solves the problems you pointed out though, so here is what I think about each of those:

Agree, but could be solved very easily by adding a short "and check that you have the specified language installed"
Agree, and this one is tougher. I think the solution would be to distinguish between "cache languages" (or "update languages") and "query languages". The former is the set of all languages mentioned by the user, i.e. cli arg + config file + env, while the latter is what we currently have in the languages variable in main. (This implies that we should not release this feature until we also figured out the config file part for this feature; doesn't sound like a problem to me)
Not sure, I think we would have to measure to make any conclusion here. If this really is a problem, we could try to spawn some threads, but I doubt that any of this is noticable at all
Not sure again, I think the added complexity is contained in Cache where we reproduce the layout as we currently have it. I agree that there might be other cases where missing languages would lead to unfortunate UX, but I think most of these would be caught by adding the "make sure you configured the languages" sentence to the "page not found" text, which would result in a worst case scenario of users opening issues with suggestions for more inclusions in the "cache languages" set I described in 2.

So all in all, I agree that there are problems, but I am still convinced that language specific downloads are the way to go and that I would have implemented it like this, if I were to write tealdeer from scratch today. What do you think?

dbrgn · 2024-02-11T15:13:56Z

Agree, but could be solved very easily by adding a short "and check that you have the specified language installed"

Such a message would need to be conditional on whether or not a custom language is enabled in the settings, or via CLI parameters. Otherwise, it might be quite confusing, if I search for a page that doesn't exist in the tldr repository, and get the message that I need to "install some language".

Agree, and this one is tougher. I think the solution would be to distinguish between "cache languages" (or "update languages") and "query languages".

That might be possible, but I think that would add a lot of additional complexity in the codebase that we don't need right now. It's harder to contribute to the code without fully understanding related concepts, and bugs are more likely to happen in certain uncommon combinations / configurations.

All in all, I still think we're better off in the long run if we don't add support for this.

The few people that have such little disk space that they can't store all TLDR pages are better off with a non-caching client.
The very few people that have such little disk space and absolutely need an offline client could use the node client which implements this feature.
The very very few people that have such little disk space and absolutely need an offline client and totally want to use tealdeer for this use case could turn off auto updates and manually download and unpack the pages in their desired language to the cache directory.

I like to develop a client implementation that is fast, works for the majority of use cases, has a clean codebase and is easy to maintain. I fear that if we start adding more complexity to cover rare use cases, then the codebase becomes unnecessarily complex and the project will start to feel like a day job. And at that point, I'd lose the motivation to keep working on it 🙂

@niklasmohrin would it be OK for you if we closed this PR?

niklasmohrin · 2024-03-01T16:48:17Z

I don't agree. I think the complexity related to languages is mostly already there, we already have to parse language strings, find the correct directory name for a given language and respect the order. I agree that adding all needed changes to support the features from the issue and from what we discussed here will add complexity, but I think this is just incidental complexity that occurs because we are adding this after the fact. I think that if we sat down and maybe refactored the Cache interface we could come up with something that is harder to use wrong (than this PR) and still supports all desired features.

I suppose we could close the PR if you are unhappy with the trend the code is taking here, but I would not abandon the feature - I still think it is worthwhile to have and a behavior I would expect "good software" to have. After all, we have a lot of code for configuring the style of the output, something that I would consider less of a priority than language specific downloads if I were to write a tldr client from scratch today.

As far as vision goes, I agree that the project should stay maintainable and fast. Still, I see tealdeer as "the sane choice for tldr on the command line" and from that perspective I think that we should not refrain from implementing measures to reduce disk space and internet usage for minimalist users.

I don't know, for some reason I draw the line somewhere between this PR and git downloads :^)

EphemeralDev · 2024-03-09T03:16:10Z

On windows 11, installed tealdeer via scoop. I used the timeit command of nushell and ran tldr --update. It took 6min 2sec 768ms 912µs 800ns to update the cache. Other details, this was done over WiFi and the laptop in question does have a slower HDD.

dbrgn · 2024-03-09T21:15:34Z

I agree that adding all needed changes to support the features from the issue and from what we discussed here will add complexity, but I think this is just incidental complexity that occurs because we are adding this after the fact. I think that if we sat down and maybe refactored the Cache interface we could come up with something that is harder to use wrong (than this PR) and still supports all desired features.

To be honest, I doubt that. But if you could pull that off, I wouldn't be opposed to merging it. My main worry is really the added complexity and the increased maintenance burden.

How should we proceed?

niklasmohrin · 2024-04-09T16:30:27Z

Okay, so I will sit down and do a big refactor when I find the time for it. Let's keep this draft open for visibility

niklasmohrin · 2024-04-24T22:53:29Z

Okay, what do you think about the following:

// new_cache.rs

use std::{path::Path, time::Duration};

use crate::{cache::PageLookupResult, types::PlatformType};

use anyhow::Result;

pub struct Language<'a>(&'a str);

pub struct CacheConfig<'a> {
    pub pages_directory: &'a Path,
    pub custom_pages_directory: Option<&'a Path>,
    pub platforms: &'a [PlatformType],
    pub languages: &'a [Language<'a>],
}

/// The directory backing this cache is checked to be populated at construction.
pub struct Cache<'a> {
    config: CacheConfig<'a>,
}

impl<'a> Cache<'a> {
    pub fn open(config: CacheConfig<'a>) -> Result<Option<Self>> {
        todo!()
    }

    pub fn open_or_create(config: CacheConfig<'a>) -> Result<Self> {
        todo!()
    }

    pub fn age(&self) -> Result<Duration> {
        todo!()
    }

    pub fn list_pages(&self) -> impl IntoIterator<Item = String> {
        []
    }

    pub fn find_page(&self, command: &str) -> Option<PageLookupResult> {
        todo!()
    }

    pub fn clear(self) -> Result<()> {
        todo!()
    }

    pub fn update(&mut self, archive_url: &str) -> Result<()> {
        todo!()
    }
}

In main, we would use it like so:

let cache_config = new_cache::CacheConfig {
    pages_directory: &config.directories.cache_dir.path.join("tldr-main"),
    custom_pages_directory: config
        .directories
        .custom_pages_dir
        .as_ref()
        .map(PathWithSource::path),
    platforms: todo!(),
    languages: todo!(),
};

let cache = if args.update {
    new_cache::Cache::open_or_create(cache_config)?
} else {
    new_cache::Cache::open(cache_config)?.context("Cache not found, run `tldr --update`.")?
};

if args.clear_cache {
    cache.clear()?;

    clear_deprecated_cache(&config)?;

    return Ok(());
}

if should_update_cache(todo!(), &args, &config) {
    cache.update(todo!())?;

    clear_deprecated_cache(&config)?;
}

if args.list {
    for page in cache.list_pages() {
        println!("{}", page);
    }
    return Ok(());
}

ensure!(
    !args.command.is_empty(),
    "CLI parsing should have not allowed this!"
);

// Note: According to the TLDR client spec, page names must be transparently
// lowercased before lookup:
// https://github.com/tldr-pages/tldr/blob/main/CLIENT-SPECIFICATION.md#page-names
let command = args.command.join("-").to_lowercase();

let Some(lookup_result) = cache.find_page(&command) else {
    if !args.quiet {
        print_warning(
            enable_styles,
            &format!(
                "Page `{}` not found in cache.\n\
                     Try updating with `tldr --update`, or submit a pull request to:\n\
                     https://github.com/tldr-pages/tldr",
                &command
            ),
        );
    }

    bail!("Page not found")
};

if let Err(ref e) = print_page(&lookup_result, args.raw, enable_styles, args.pager, &config) {
    print_error(enable_styles, e);

    bail!(e)
}

Ok(())

One thing that I paid attention to is that the cache should not care about outputting message to the user - no enable_styles anywhere, only unexpected errors with Err or expected errors with Option. I think I had already written something like that at some point, but I think that we should in general refactor our user output with some kind of OutputManager or so. We could then manage what kind of messages are printed in which way in a separate module. Maybe we can even be fancy and do something similar to anyhow::Context so that we have something like

enum DomainError {
    CacheNotFound,
    PageNotFound(&str),
    // ...
}

Cache::open(cache_config)?.or_domain_error(&output_manager, DomainError::CacheNotFound)?

... but that is a topic for another time.

Another thing: I returned early from main here if for example the cache is cleared. I think it is reasonable, I don't see why tldr --clear-cache --update or tldr --clear-cache tar or tldr --list tar should be allowed. I think this makes more sense and it also makes the code easier.

Note that I already put it clear_deprecated_cache in reasonable places - the implementation is very simple:

fn clear_deprecated_cache(config: &Config) -> anyhow::Result<()> {
    let deprecated_config = new_cache::CacheConfig {
        pages_directory: &config.directories.cache_dir.path().join("tldr-master"),
        custom_pages_directory: None,
        platforms: &[],
        languages: &[],
    };

    let Some(cache) = new_cache::Cache::open(deprecated_config)? else {
        return Ok(());
    };

    cache.clear()?;

    Ok(())
}

So, what do you think? @dbrgn

Only download pages matching the user's languages

723cadf

niklasmohrin requested a review from dbrgn October 20, 2023 23:16

dbrgn reviewed Oct 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only download pages matching the user's languages #345

Only download pages matching the user's languages #345

niklasmohrin commented Oct 20, 2023

dbrgn left a comment •

edited

dbrgn Oct 21, 2023

dbrgn Oct 21, 2023

niklasmohrin commented Oct 24, 2023

dbrgn commented Feb 11, 2024

niklasmohrin commented Mar 1, 2024

EphemeralDev commented Mar 9, 2024 •

edited

dbrgn commented Mar 9, 2024

niklasmohrin commented Apr 9, 2024

niklasmohrin commented Apr 24, 2024 •

edited

		@@ -134,7 +134,7 @@ impl Cache {
		}

		/// Download the archive from the specified URL.

	/// Download the archive from the specified URL.
	/// Download the archive from the specified URL.
	/// Return `None` if the URL results in a HTTP 404.

Only download pages matching the user's languages #345

Are you sure you want to change the base?

Only download pages matching the user's languages #345

Conversation

niklasmohrin commented Oct 20, 2023

dbrgn left a comment • edited

Choose a reason for hiding this comment

dbrgn Oct 21, 2023

Choose a reason for hiding this comment

dbrgn Oct 21, 2023

Choose a reason for hiding this comment

niklasmohrin commented Oct 24, 2023

dbrgn commented Feb 11, 2024

niklasmohrin commented Mar 1, 2024

EphemeralDev commented Mar 9, 2024 • edited

dbrgn commented Mar 9, 2024

niklasmohrin commented Apr 9, 2024

niklasmohrin commented Apr 24, 2024 • edited

dbrgn left a comment •

edited

EphemeralDev commented Mar 9, 2024 •

edited

niklasmohrin commented Apr 24, 2024 •

edited