Skip to content

gibbsbravo/DataDelta

Repository files navigation


Logo

The best Python package for comparing two dataframes
Explore the docs »

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage Examples
  4. Contributing
  5. Example HTML Report Output
  6. License
  7. Contact

About The Project

DataDelta is a very useful Python package for easily comparing two pandas dataframes for use in data analysis, data engineering, and tracking table changes across time.

DataDelta generates a report as both a Python dict and HTML file that summarizes the key changes between two dataframes through completing a series of tests (that can also be selected individually). The Python report is intended for use as part of a DevOps / DataOps pipeline for testing to ensure table changes are expected.

Report Bug · Request Feature

(back to top)

Getting Started

DataDelta is easy to install through pip or feel free to clone locally to make changes.

Dependencies

DataDelta has very few dependencies:

  • pandas: a fast, powerful, flexible and easy to use open source data analysis and manipulation tool - DataDelta is built on for comparing dataframes
  • numpy: The fundamental package for scientific computing with Python - used for transformations and calculations
  • jinja2: a fast, expressive, extensible templating engine - used to generate the HTML report
  • pytest (optional): a mature full-featured Python testing tool that helps you write better programs - used for testing

Installation

  • Install using Pip through PyPI:
    pip install datadelta

OR

  • Clone the repo locally:
    git clone https://github.com/gibbsbravo/DataDelta.git

(back to top)

Usage Examples

  • Quick starter code to get summary dataframe changes report:

    import pandas as pd
    import datadelta as delta
    
    old_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
    new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
    primary_key = 'A' # Set the primary key
    column_subset = None # Specify the subset of columns of interest or leave None to compare all columns
    
    # The consolidated_report dictionary will contain the summary changes
    consolidated_report, record_changes_comparison_df = delta.create_consolidated_report(
        old_df, new_df, primary_key, column_subset)
    
    # This will create a report named datadelta_html_report.html in the current working directory containing the summary changes
    delta.export_html_report(consolidated_report, record_changes_comparison_df,
                          export_file_name='datadelta_html_report.html',
                          overwrite_existing_file=False)
  • Get dataframe summary:

      import pandas as pd
      import datadelta as delta
    
      new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
    
      # Returns a report summarizing the key attributes and values of a dataframe
      summary_report = delta.get_df_summary(
        input_df=new_df, primary_key=primary_key, column_subset=column_subset, max_cols=15)
  • Get record count changes report:

      old_df = pd.read_csv('MainTestData_old_df.csv') # Add your old dataframe here
      new_df = pd.read_csv('MainTestData_new_df.csv') # Add your new dataframe here
      primary_key = 'A' # Set the primary key
      column_subset = None # Specify the subset of columns of interest or leave None to compare all columns
    
      # Returns a report summarizing any changes to the number of records (and composition) between two dataframes
      record_count_change_report = delta.check_record_count(
        old_df, new_df, primary_key)

Other functions include:

  • check_column_names: Returns a report summarizing any changes to column names between two dataframes
  • check_datatypes: Returns a report summarizing any columns with different datatypes
  • check_chg_in_values: Returns a report summarizing any records with changes in values
  • get_records_in_both_tables: Returns the records found in both dataframes
  • get_record_changes_comparison_df: Returns a dataframe comparing any records with changes in values by column
  • export_html_report: Exports an html report of the differences between two dataframes

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

(back to top)

Example HTML Report Output

Report Screenshot

(back to top)

License

Distributed under the GNU General Public License v3 (GPLV3) License. See LICENSE.txt for more information.

(back to top)

Contact

Andrew Gibbs-Bravo - [email protected]

Project Link: https://github.com/gibbsbravo/DataDelta

(back to top)