Use compression on input/output netcdfs #95

pmav99 · 2022-03-25T09:49:28Z

We should try to use compression in our input/output netcdfs.

pmav99 · 2022-03-25T12:13:14Z

These are some notes from a different project, but they are relevant to the task at hand:

Creating archives

When saving data to disk you usually want to use compression.
The type and the level of compression to use is something that you should benchmark with your data and your application.
Nevertheless, nowadays the rules of thumb are:

Any compression is usually better than no compression.
If possible use the newer compression algorithms (i.e. zstd or lz4) compared to older algorithms (i.e. gzip and bz2).
bz2 gives the absolutely best compression rates but it is rather slow. Use it only for long term storage and only if minimizing disk space is crucial
Unless you run benchmarks use compression level 1

When creating new archives (netcdf/zarr/whatever) xarray does not compress data by default.
In order to ask it to use compression you must pass the encoding argument to the various functions (i.e.Dataset.to_netcdf(), Dataset.to_zarr() etc).

encoding is supposed to be a dictionary. The exact keys that we must use depend on the type of the archive. E.g. for netcdf they would be:

encoding = dict(
    SID={"zlib": True, "complevel": 1}
    time={"zlib": True, "complevel": 1}
    lat={"zlib": True, "complevel": 1}
    lon={"zlib": True, "complevel": 1}
)

ds.to_netcdf(path="/path/to/foo.nc", encoding=encoding)

brey · 2022-10-03T10:18:03Z

@pmav99 We do have the option to use compression in the to_thalassa function. The question is should we also move to zarr or parquet also for the Thalassa input?

pmav99 · 2022-10-03T11:18:45Z

I would keep on using netcdfs for now

AFAIK it should be possible to use zarr to store meshes, too, because zarr is pretty much a container format like XML and you can store anything you want inside it, but you would have to come up with a suitable mesh representation on your own.

A couple of relevant links:

pmav99 · 2023-05-15T21:03:34Z

I tested this patch:

diff --git a/pyposeidon/schism.py b/pyposeidon/schism.py
index 0a491ae..62368cc 100644
--- a/pyposeidon/schism.py
+++ b/pyposeidon/schism.py
@@ -382,7 +382,9 @@ class Schism:
 
         filename = kwargs.get("filename", "sflux/sflux_air_{}.0001.nc".format(m_index))
 
-        sout.to_netcdf(path + filename)
+        encoding = {name: {"zlib": True, "complevel": 1} for name in sout.data_vars}
+        sout.to_netcdf(path + filename, encoding=encoding)
+        logger.info("Finished writing meteo files ..\n")
 
     # ============================================================================================
     # DEM

with this script:

import logging
 
import pyposeidon.meteo as pmeteo
  
logger = logging.basicConfig(
    level=10,
    style="{",
    datefmt="%I:%M:%S",
    format="{asctime:s}; {name:<25s} {funcName:<15s} {lineno:4d}; {message:s}",
)
 
meteo = pmeteo.Meteo(
    meteo_source="20220725.00.tropical_cyclone.grib",
)
 
meteo.to_output(solver_name='schism', rpath='./test_c1/')

The initial size of the GRIB is 5.4GB. The uncompressed netcdf is 25GB and the compressed netcdf (with complevel=1) is 4.1GB. Compression increased the runtime on my laptop by ~150 seconds.

If schism can read files created with engine=h5netcdf then we could test different compression algorithms which might result in better runtimes.

pmav99 · 2023-05-16T12:39:20Z

I tried compressing the uncompressed netcdf file with nccopy to see it it was faster but performance is practically identical, which I guess is to be expected.

$ time nccopy -u -s -d 1 test/sflux/sflux_air_1.0001.nc test/sflux/nccopy_1.nc
147.87s user 6.49s system 99% cpu 2:34.43 total

pmav99 self-assigned this Mar 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use compression on input/output netcdfs #95

Use compression on input/output netcdfs #95

pmav99 commented Mar 25, 2022

pmav99 commented Mar 25, 2022

brey commented Oct 3, 2022

pmav99 commented Oct 3, 2022

pmav99 commented May 15, 2023 •

edited

Loading

pmav99 commented May 16, 2023

Use compression on input/output netcdfs #95

Use compression on input/output netcdfs #95

Comments

pmav99 commented Mar 25, 2022

pmav99 commented Mar 25, 2022

Creating archives

brey commented Oct 3, 2022

pmav99 commented Oct 3, 2022

pmav99 commented May 15, 2023 • edited Loading

pmav99 commented May 16, 2023

pmav99 commented May 15, 2023 •

edited

Loading