Skip to content

Commit

Permalink
Feature docs frameworks (#234)
Browse files Browse the repository at this point in the history
* Added version for documents

* Working example documentation
  • Loading branch information
canimus authored May 18, 2024
1 parent e1df445 commit 9ad3212
Show file tree
Hide file tree
Showing 20 changed files with 943 additions and 5,141 deletions.
6 changes: 3 additions & 3 deletions cuallee/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,7 +284,7 @@ def add_rule(self, method: str, *arg):
Args:
method (str): Check name
args (list): Parameters of the check
arg (list): Parameters of the check
"""
return operator.methodcaller(method, *arg)(self)

Expand Down Expand Up @@ -534,14 +534,14 @@ def is_legit(self, column: str, pct: float = 1.0):
Validation for string columns giving wrong signal about completeness due to empty strings.
Useful for reading CSV files and preventing empty strings being reported as valid records.
This is an `alias` implementation of the `has_pattern` rule using `^\S+$` as the pattern
This is an `alias` implementation of the `has_pattern` rule using `not black space` as the pattern
Which validates the presence of non-empty characters between the begining and end of a string.
Args:
column (str): Column name in dataframe
pct (float): The threshold percentage required to pass
"""
Rule("has_pattern", column, "^\S+$", CheckDataType.STRING, pct) >> self._rule
Rule("has_pattern", column, r"^\S+$", CheckDataType.STRING, pct) >> self._rule
return self

def has_min(self, column: str, value: float):
Expand Down
154 changes: 0 additions & 154 deletions docs/advanced.md

This file was deleted.

120 changes: 120 additions & 0 deletions docs/bigquery/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# :simple-googlebigquery: BigQuery
In order to follow this examples, make sure your installation is all set for `bigquery`

!!! success "Install"
``` sh
pip install cuallee
pip install cuallee[bigquery]
```

## Pre-Requisites

You will need a Google Cloud account active, with the BigQuery API enabled to proceed with this examples.<br/>
Once your account is enabled with BigQuery, you will have to export a `service account` credential file in `json` format.

`cuallee` will read the environment variable `GOOGLE_APPLICATION_CREDENTIALS` expecting the name of the file that contains your `service account credentials`

!!! warning "Cost Associated"
Be aware that running `cuallee` checks in `bigquery` incurs into cloud costs.


The process inside `cuallee` to handle the credentials is as follows:


!!! tip "Credentials Handling"
``` py
import os
from google.cloud import bigquery

credentials = os.getenv('GOOGLE_APPLICATION_CREDENTIALS')
client = bigquery.Client(project="GOOGLE_CLOUD_PROJECT_IDENTIFIER", credentials=credentials)

```


### is_complete

It validates the _completeness_ attribute of a data set. It confirms that a column does not contain `null` values .


???+ example "is_complete"

=== ":material-checkbox-marked-circle:{ .ok } PASS"

In this example, we validate that the column `id` does not have any missing values.

``` py
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("taxi_id")

# Validate
check.validate(df)
```

:material-export: __output:__

``` markdown
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING taxi_id is_complete N/A 102589284 0 1.0 1.0 PASS
```

=== ":material-close-circle:{ .ko } FAIL"

In this example, we intentionally place 2 `null` values in the dataframe and that produces a `FAIL` check as result.

``` py
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp")

# Validate
check.validate(df)
```

:material-export: __output:__

``` markdown
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING trip_end_timestamp is_complete N/A 102589284 1589 0.999985 1.0 FAIL
```

=== ":material-alert-circle:{ .kk } THRESHOLD"

In this example, we validate reuse the data frame with empty values from the previous example, however we set our tolerance via the `pct` parameter on the rule `is_complete` to `0.6`. Producing now a `PASS` result on the check, regardless of the `2` present `null` values.

``` py
from google.cloud import bigquery
from cuallee import Check
# Public dataset in BigQuery
df = bigquery.dataset.Table("bigquery-public-data.chicago_taxi_trips.taxi_trips")
check = Check()
check.is_complete("trip_end_timestamp", pct=0.9)

# Validate
check.validate(df)
```

:material-export: __output:__

``` markdown
timestamp check level column rule value rows violations pass_rate pass_threshold status
id
1 2024-05-18 21:24:15 cuallee.check WARNING trip_end_timestamp is_complete N/A 102589284 1589 0.999985 0.9 PASS
```


Loading

0 comments on commit 9ad3212

Please sign in to comment.