From 8c205f9642a3bc0d63a7017e2c85cee6051025ab Mon Sep 17 00:00:00 2001 From: Elmo Moilanen <49366097+elmomoilanen@users.noreply.github.com> Date: Sun, 2 Jul 2023 15:19:40 +0300 Subject: [PATCH] docs: modify language to be more clear --- README.md | 28 +++++++++++++--------------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 5391643..a5ceb9e 100644 --- a/README.md +++ b/README.md @@ -2,9 +2,9 @@ [![main](https://github.com/elmomoilanen/Bootstrap-sampling-distribution/actions/workflows/main.yml/badge.svg)](https://github.com/elmomoilanen/Bootstrap-sampling-distribution/actions/workflows/main.yml) -Library that employs the statistical resampling method bootstrap to estimate a sampling distribution of a specific statistic from the provided sample of data. Estimates of standard error and confidence interval of the statistic can subsequently be determined from the obtained distribution, confidence interval being adjusted for both bias and skewness. +Library that employs the statistical resampling method bootstrap to estimate a sampling distribution of a specific statistic from the provided sample of data. Estimates of standard error and confidence interval of the statistic can subsequently be determined from the obtained distribution, confidence interval being adjusted for both bias and skewness. Notice that the bootstrap resampling method is primarily based on frequentist principles. -Generally speaking, in statistical inference the primary interest is to quantify an effect size of a measurement and as a secondary but yet important thing would be to evaluate uncertainty of the measurement. This library provides computational tools for the latter with an assumption that the given data sample is a representative sample of the unknown population. This enables to use the sample to generate new samples by the bootstrap resampling method. +Generally speaking, in statistical inference the primary interest is to quantify an effect size of a measurement and as a secondary but yet important thing would be to evaluate uncertainty of the measurement. This library provides computational tools for the latter with an assumption that the given data sample is a representative sample of the unknown population. This enables the use of the sample to generate new samples using the bootstrap resampling method. Sampling distribution obtained by the bootstrap resampling process makes it possible to compute the standard error and confidence interval for the statistic, both quantifying statistical accuracy of the measurement. Standard error is the standard deviation of obtained values of the statistic whereas confidence intervals are constructed using the bias-corrected and accelerated bootstrap approach (BCa) which makes adjustments for bias and skewness. @@ -12,15 +12,15 @@ At the moment, SciPy's API *stats.bootstrap* somewhat resembles this library but ## Install ## -Poetry is the recommended tool for installation and the following short guide uses it. +Poetry is the recommended tool for installation. -After cloning and navigating to the target folder, running the following command creates a virtual environment within this project directory and installs non-development dependencies inside it +After cloning and navigating to the target folder, running the following command creates a virtual environment within this project directory and installs the default dependencies inside it ```bash -poetry install --without dev +poetry install ``` -In-project virtual environment setup is controlled in *poetry.toml*. As the *--without dev* option skips installation of the development dependencies, do not include it in the command above if e.g. you want to be able to run the unit tests (pytest is needed for that). +In-project virtual environment setup is controlled by *poetry.toml*. Default dependencies are not enough e.g. to run the unit tests as the library pytest is required for that and it is only included in the optional `dev` dependency group which can be installed by adding `--with dev` to the above installation command. For the plotting to work correctly it might be required to set the backend for Matplotlib. One way to do this is to set the MPLBACKEND environment variable (overrides any matplotlibrc configuration) for the current shell. @@ -36,13 +36,13 @@ MPLBACKEND= poetry run python with a proper backend (e.g. macosx or qt5agg) after the equal sign. If the backend has been set correctly earlier, just drop this setting. -Let's consider first a case, where we assume X to be a numerical data with shape n x p (n observations, p attributes) and 10-quantile to be the statistic of interest. Let's further assume that the number of attributes p is equal to or larger than three. +First let's consider a case where we assume X to be a numerical data with shape n x p (n observations, p attributes) and the 10th quantile to be the statistic of interest. Let's further assume that the number of attributes, p, is equal to or greater than three. ```python import numpy as np from sampdist import SampDist -# One-dimensional statistics must be defined with axis=1 +# One-dimensional statistics should be defined with axis=1 def quantile(x): return np.quantile(x, q=0.1, axis=1) # Override default alpha and add random noise to bootstrap samples @@ -54,7 +54,7 @@ samp.estimate(X[:, [0,2]]) # Sampling distribution of the quantile for both columns samp.b_stats -# Standard error (samp.se) and BCa confidence interval (samp.ci) also available +# Standard error (samp.se) and BCa confidence interval (samp.ci) are also available # Plot the sampling distribution for the first column samp.plot(column=0) @@ -62,11 +62,11 @@ samp.plot(column=0) After the necessary module imports in the code snippet above, a custom quantile function was defined which calls NumPy's own quantile routine with the axis parameter equal to one. After an object of the *SampDist* class was instantiated, its estimate method was called in order to compute the sampling distribution, standard error and BCa confidence interval. Data slice of shape n x 2 was passed to the estimate method and as the quantile statistic is one-dimensional (maps n x 1 input to a single result and n x p input to p results) it ran the estimation simultaneously for both of the two attributes (columns 0 and 2 in X). -Following figure represents a possible result of the plot call. In addition to the histogram it shows the observed value (value of the statistic in original data sample) pointed to by the black arrow, standard error and BCa confidence interval pointed to by red arrows on x-axis. +The following figure represents a possible result of the plot call. In addition to the histogram it shows the observed value (value of the statistic in original data sample) pointed to by the black arrow, standard error and BCa confidence interval pointed to by red arrows on x-axis. ![](docs/boostrap_distribution_quantile.png) -For the second example, let's consider the sampling distribution estimation process for a multidimensional statistic, e.g. Pearson's linear correlation. Keeping the mentioned assumptions regarding data X, following code estimates the sampling distribution and in the final row of the snippet, renders a histogram plot similarly to the figure above. Compared to the previous example, notice the difference in estimation process of the chosen statistic. Here the multidimensional statistic, Pearson's correlation, requires two attributes (columns) of the data X as input (a data slice of shape n x 2) and produces a single output which is the value of correlation. +For the second example, let's consider the estimation process of the sampling distribution for a multidimensional statistic, e.g. Pearson's linear correlation. Keeping the mentioned assumptions regarding data X, following code estimates the sampling distribution and in the final row of the snippet, renders a histogram plot similarly to the figure above. Compared to the previous example, notice the difference in estimation process of the chosen statistic. Here the multidimensional statistic, Pearson's correlation, requires two attributes (columns) of the data X as input (a data slice of shape n x 2) and produces a single output which is the value of correlation. ```python from sampdist import SampDist @@ -84,16 +84,14 @@ samp.plot() ![](docs/bootstrap_distribution_corr.png) -Notice that validity of the statistic is checked when calling the estimate method. If this check fails, a *StatisticError* exception will be raised. Furthermore, if the estimated sampling distribution is degenerate (e.g. data almost identical), a *BcaError* exception gets raised (in this case you may try to use True for the smooth_bootstrap parameter). Both exceptions inherit from class *SampDistError* which can be imported directly from the sampdist namespace. +Notice that the validity of the statistic is checked when calling the estimate method. If this check fails, a *StatisticError* exception will be raised. Furthermore, if the estimated sampling distribution is degenerate (e.g. data almost identical), a *BcaError* exception gets raised (in this case you may try to use True for the smooth_bootstrap parameter). Both exceptions inherit from class *SampDistError* which can be imported directly from the sampdist namespace. ## Docs ## -Make sure that you included the *docs* dependency group in the installation step. - Render the documentation as HTML with the following command ```bash -sphinx-build -b html docs/source/ docs/build/html +poetry run sphinx-build -b html docs/source/ docs/build/html ``` and open the starting page docs/build/html/index.html in a browser.