Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only download pages matching the user's languages #345

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

niklasmohrin
Copy link
Collaborator

Closes #335, although there are some follow ups in the comments that we should probably open new issues for.

Also, I opened tldr-pages/tldr#11121 in the process.

Copy link
Owner

@dbrgn dbrgn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is pretty nice. However, I'm not yet convinced about the usability tradeoffs of this feature.

Advantages of Per-Language Downloads

First of all, the advantages: I think the only advantage is that we don't download all languages at once and thus speed up the update in case of a slow internet connection.

Right now the complete tldr.zip download is 7.7 MiB in size. The English pages only are 1.8 MiB. German pages are 259 KiB.

(Edit: I looked at the original issue, and disk space was mentioned an issue as well. Right now the pages use 76 MiB. 17 of those could be saved by deduplicating the now duplicated pages and pages.en directories. However, both 59 or 76 MiB seem like a non-issue on today's hardware, where 1 TiB of storage costs you around 40$. People that want to save disk space should ideally pick a tldr client that fetches the pages on-demand, without a cache. Or a wrapper script could be used that removes unneeded languages after a tealdeer update.)

Disadvantages of Per-Language Downloads

I gave it a quick try, and noticed a few issues:

  1. Messages in case of a cache miss are not clear. If I run tldr --update, only the English pages are downloaded. If I now run tldr -L fr tar, I get this error message:

     Warning: Page `tar` not found in cache.
     Try updating with `tldr --update`, or submit a pull request to:
     https://github.com/tldr-pages/tldr
    

    ...however, running tldr --update will not help, it will simply re-download the English pages. Tealdeer has no way of knowing whether a French version of that page exists (it does), so UX will always be non-ideal.

    One workaround would be to suggest running tldr -L <current-language> --update if the language isn't the default language (according to the config, with fallback to English). Adds complexity though.

  2. The feature has a very ugly interaction with auto-update. If I run tldr -L de tar with a stale cache, the old cache will be nuked and replaced with a German cache. Next time I run tldr git with the default language set to English, it will complain that the page wasn't found in the cache. However, because the cache isn't stale (it was just fetched), it will not be auto-updated either.

  3. Depending on how many languages the user has configured, given a fast internet connection, cache updates may be slower than before (because we sequentially download and unzip the language archives).

  4. Finally, the handling of all those archives, in combination with the update logic, adds considerable complexity to the codebase. I'm fairly sure there are other UX issues that we may have missed.


All in all, this seems a nice feature at first, but I'm not sure the (in my opinon tiny) advantages outweigh the disadvantages. What's your opinion, @niklasmohrin

@@ -61,7 +61,7 @@ const APP_INFO: AppInfo = AppInfo {
name: NAME,
author: NAME,
};
const ARCHIVE_URL: &str = "https://tldr.sh/assets/tldr.zip";
const ARCHIVES_URL: &str = "https://tldr.sh/assets/";
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be called ARCHIVE_URL_PREFIX instead? (If yes, the cache update method parameter needs a rename as well.)

@@ -134,7 +134,7 @@ impl Cache {
}

/// Download the archive from the specified URL.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document the return type?

Suggested change
/// Download the archive from the specified URL.
/// Download the archive from the specified URL.
/// Return `None` if the URL results in a HTTP 404.

@niklasmohrin
Copy link
Collaborator Author

The main motivation for me is honestly that it just feels like the right thing to do. Downloading ~40 languages when most people only need one or two is excessive and extra work which is not in the spirit of tealdeer as I perceive it. I have to admit, this is a tough argument to bring, given that I previously argued against git in favor of simplicity, but I think that the situation is different here since the language-specific zips are only marginally more complicated than the one big zip.

None of this solves the problems you pointed out though, so here is what I think about each of those:

  1. Agree, but could be solved very easily by adding a short "and check that you have the specified language installed"
  2. Agree, and this one is tougher. I think the solution would be to distinguish between "cache languages" (or "update languages") and "query languages". The former is the set of all languages mentioned by the user, i.e. cli arg + config file + env, while the latter is what we currently have in the languages variable in main. (This implies that we should not release this feature until we also figured out the config file part for this feature; doesn't sound like a problem to me)
  3. Not sure, I think we would have to measure to make any conclusion here. If this really is a problem, we could try to spawn some threads, but I doubt that any of this is noticable at all
  4. Not sure again, I think the added complexity is contained in Cache where we reproduce the layout as we currently have it. I agree that there might be other cases where missing languages would lead to unfortunate UX, but I think most of these would be caught by adding the "make sure you configured the languages" sentence to the "page not found" text, which would result in a worst case scenario of users opening issues with suggestions for more inclusions in the "cache languages" set I described in 2.

So all in all, I agree that there are problems, but I am still convinced that language specific downloads are the way to go and that I would have implemented it like this, if I were to write tealdeer from scratch today. What do you think?

@dbrgn
Copy link
Owner

dbrgn commented Feb 11, 2024

Agree, but could be solved very easily by adding a short "and check that you have the specified language installed"

Such a message would need to be conditional on whether or not a custom language is enabled in the settings, or via CLI parameters. Otherwise, it might be quite confusing, if I search for a page that doesn't exist in the tldr repository, and get the message that I need to "install some language".

Agree, and this one is tougher. I think the solution would be to distinguish between "cache languages" (or "update languages") and "query languages".

That might be possible, but I think that would add a lot of additional complexity in the codebase that we don't need right now. It's harder to contribute to the code without fully understanding related concepts, and bugs are more likely to happen in certain uncommon combinations / configurations.

All in all, I still think we're better off in the long run if we don't add support for this.

  • The few people that have such little disk space that they can't store all TLDR pages are better off with a non-caching client.
  • The very few people that have such little disk space and absolutely need an offline client could use the node client which implements this feature.
  • The very very few people that have such little disk space and absolutely need an offline client and totally want to use tealdeer for this use case could turn off auto updates and manually download and unpack the pages in their desired language to the cache directory.

I like to develop a client implementation that is fast, works for the majority of use cases, has a clean codebase and is easy to maintain. I fear that if we start adding more complexity to cover rare use cases, then the codebase becomes unnecessarily complex and the project will start to feel like a day job. And at that point, I'd lose the motivation to keep working on it 🙂

@niklasmohrin would it be OK for you if we closed this PR?

@niklasmohrin
Copy link
Collaborator Author

I don't agree. I think the complexity related to languages is mostly already there, we already have to parse language strings, find the correct directory name for a given language and respect the order. I agree that adding all needed changes to support the features from the issue and from what we discussed here will add complexity, but I think this is just incidental complexity that occurs because we are adding this after the fact. I think that if we sat down and maybe refactored the Cache interface we could come up with something that is harder to use wrong (than this PR) and still supports all desired features.

I suppose we could close the PR if you are unhappy with the trend the code is taking here, but I would not abandon the feature - I still think it is worthwhile to have and a behavior I would expect "good software" to have. After all, we have a lot of code for configuring the style of the output, something that I would consider less of a priority than language specific downloads if I were to write a tldr client from scratch today.

As far as vision goes, I agree that the project should stay maintainable and fast. Still, I see tealdeer as "the sane choice for tldr on the command line" and from that perspective I think that we should not refrain from implementing measures to reduce disk space and internet usage for minimalist users.

I don't know, for some reason I draw the line somewhere between this PR and git downloads :^)

@EphemeralDev
Copy link

EphemeralDev commented Mar 9, 2024

On windows 11, installed tealdeer via scoop. I used the timeit command of nushell and ran tldr --update. It took 6min 2sec 768ms 912µs 800ns to update the cache. Other details, this was done over WiFi and the laptop in question does have a slower HDD.

@dbrgn
Copy link
Owner

dbrgn commented Mar 9, 2024

I agree that adding all needed changes to support the features from the issue and from what we discussed here will add complexity, but I think this is just incidental complexity that occurs because we are adding this after the fact. I think that if we sat down and maybe refactored the Cache interface we could come up with something that is harder to use wrong (than this PR) and still supports all desired features.

To be honest, I doubt that. But if you could pull that off, I wouldn't be opposed to merging it. My main worry is really the added complexity and the increased maintenance burden.

How should we proceed?

@niklasmohrin
Copy link
Collaborator Author

Okay, so I will sit down and do a big refactor when I find the time for it. Let's keep this draft open for visibility

@niklasmohrin
Copy link
Collaborator Author

niklasmohrin commented Apr 24, 2024

Okay, what do you think about the following:

// new_cache.rs

use std::{path::Path, time::Duration};

use crate::{cache::PageLookupResult, types::PlatformType};

use anyhow::Result;

pub struct Language<'a>(&'a str);

pub struct CacheConfig<'a> {
    pub pages_directory: &'a Path,
    pub custom_pages_directory: Option<&'a Path>,
    pub platforms: &'a [PlatformType],
    pub languages: &'a [Language<'a>],
}

/// The directory backing this cache is checked to be populated at construction.
pub struct Cache<'a> {
    config: CacheConfig<'a>,
}

impl<'a> Cache<'a> {
    pub fn open(config: CacheConfig<'a>) -> Result<Option<Self>> {
        todo!()
    }

    pub fn open_or_create(config: CacheConfig<'a>) -> Result<Self> {
        todo!()
    }

    pub fn age(&self) -> Result<Duration> {
        todo!()
    }

    pub fn list_pages(&self) -> impl IntoIterator<Item = String> {
        []
    }

    pub fn find_page(&self, command: &str) -> Option<PageLookupResult> {
        todo!()
    }

    pub fn clear(self) -> Result<()> {
        todo!()
    }

    pub fn update(&mut self, archive_url: &str) -> Result<()> {
        todo!()
    }
}

In main, we would use it like so:

let cache_config = new_cache::CacheConfig {
    pages_directory: &config.directories.cache_dir.path.join("tldr-main"),
    custom_pages_directory: config
        .directories
        .custom_pages_dir
        .as_ref()
        .map(PathWithSource::path),
    platforms: todo!(),
    languages: todo!(),
};

let cache = if args.update {
    new_cache::Cache::open_or_create(cache_config)?
} else {
    new_cache::Cache::open(cache_config)?.context("Cache not found, run `tldr --update`.")?
};

if args.clear_cache {
    cache.clear()?;

    clear_deprecated_cache(&config)?;

    return Ok(());
}

if should_update_cache(todo!(), &args, &config) {
    cache.update(todo!())?;

    clear_deprecated_cache(&config)?;
}

if args.list {
    for page in cache.list_pages() {
        println!("{}", page);
    }
    return Ok(());
}

ensure!(
    !args.command.is_empty(),
    "CLI parsing should have not allowed this!"
);

// Note: According to the TLDR client spec, page names must be transparently
// lowercased before lookup:
// https://github.com/tldr-pages/tldr/blob/main/CLIENT-SPECIFICATION.md#page-names
let command = args.command.join("-").to_lowercase();

let Some(lookup_result) = cache.find_page(&command) else {
    if !args.quiet {
        print_warning(
            enable_styles,
            &format!(
                "Page `{}` not found in cache.\n\
                     Try updating with `tldr --update`, or submit a pull request to:\n\
                     https://github.com/tldr-pages/tldr",
                &command
            ),
        );
    }

    bail!("Page not found")
};

if let Err(ref e) = print_page(&lookup_result, args.raw, enable_styles, args.pager, &config) {
    print_error(enable_styles, e);

    bail!(e)
}

Ok(())

One thing that I paid attention to is that the cache should not care about outputting message to the user - no enable_styles anywhere, only unexpected errors with Err or expected errors with Option. I think I had already written something like that at some point, but I think that we should in general refactor our user output with some kind of OutputManager or so. We could then manage what kind of messages are printed in which way in a separate module. Maybe we can even be fancy and do something similar to anyhow::Context so that we have something like

enum DomainError {
    CacheNotFound,
    PageNotFound(&str),
    // ...
}

Cache::open(cache_config)?.or_domain_error(&output_manager, DomainError::CacheNotFound)?

... but that is a topic for another time.

Another thing: I returned early from main here if for example the cache is cleared. I think it is reasonable, I don't see why tldr --clear-cache --update or tldr --clear-cache tar or tldr --list tar should be allowed. I think this makes more sense and it also makes the code easier.

Note that I already put it clear_deprecated_cache in reasonable places - the implementation is very simple:

fn clear_deprecated_cache(config: &Config) -> anyhow::Result<()> {
    let deprecated_config = new_cache::CacheConfig {
        pages_directory: &config.directories.cache_dir.path().join("tldr-master"),
        custom_pages_directory: None,
        platforms: &[],
        languages: &[],
    };

    let Some(cache) = new_cache::Cache::open(deprecated_config)? else {
        return Ok(());
    };

    cache.clear()?;

    Ok(())
}

So, what do you think? @dbrgn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a language option when updating the cache
3 participants