-
Notifications
You must be signed in to change notification settings - Fork 5
inputChecker
Activity-based travel models rely on data from a variety of sources (zonal data, highway networks, transit networks, synthetic population, etc). A problem in any of these inputs can affect the accuracy of model outputs and/or can result in run time error(s) during the model run. It is very important that the analyst carefully prepare and review all inputs prior to running the model. However, even with the best of efforts, sometimes errors in input data remain undetected. In order to aid the analyst in the input checking process, an automated Input Checker Tool was developed for use with the ABM. The following sections describe the setup and application of this tool.
The Input Checker Tool (inputChecker) was implemented in Python and makes heavy use of the pandas and numpy packages. The main inputs to inputChecker are a list of ABM input tables, a list of QA/QC checks to be performed on these input tables and the actual ABM inputs in CSV format. All CSV inputs are read as pandas DataFrames (2-dimensional data tables). The input checks are specified by the user as pandas expressions which are solved by the inputChecker on the input pandas DataFrames. The inputChecker generates a LOG file summarizing the results of all of the input checks.
The inputChecker setup is described in the table below:
Directory/File | Description |
---|---|
config directory | Contains list of inputs, list of checks and a settings file |
inputs directory | All the inputs specified in the inputs list are exported or copied to this directory |
logs directory | The Log and summary files from different runs are outputted to this directory |
scripts directory | Contains the main inputChecker Python script |
RunInputChecker.bat | The batch file to run inputChecker |
The RunInputChecker.bat DOS batch file is called by the RunModel.bat DOS batch file to run the inputChecker at the beginning of each ABM run. The user can also launch the inputChecker independently by simply double-clicking the RunInputChecker.bat DOS batch file. However, the inputChecker working directory must be inside the ABM working directory to read inputs from the appropriate input sub-directories.
inputChecker executes the following steps:
First, inputChecker reads all the inputs specified in the list of inputs and copies them to the inputChecker/inputs
directory. After assembling all inputs in the inputChecker/inputs
directory, all the inputs are loaded as pandas DataFrames.
Next, the list of input checks is read. inputChecker loops through the list of input checks and evaluates the checks. The result of each check is sent to the logging module. The user must specify the severity level of each check as - Fatal, Logical or Warning.
Besides the checks specified by the user, inputChecker also performs self-diagnostics to check for missing values in inputs. The severity level for the automated missing value checks is set via the config/settings.csv
file.
The final step is to generate the inputChecker log file. The inputChecker log includes results of all checks. The checks that failed are moved up in order of the severity-level specified for the test. A summary of inputChecker results is also generated to be read by the RunModel.bat DOS batch file to generate a reminder message for the user at the end of the SOABM run. An appropriate exit code is returned depending on the outcome of the inputChecker run. The table below describes the various outcomes and the associated exit codes:
inputChecker End State | Exit Code |
---|---|
inputChecker ran successfully with no fatal checks fails | 0 |
inputChecker did not run successfully due to errors | 1 |
inputChecker ran successfully with at least one fatal check fails | 2 |
With a return code of 0, the RunModel.bat DOS batch file resumes the SOABM run. A reminder message is generated at the end to check the inputChecker log file. In case the inputChecker errors out, the model run is aborted. If the inputChecker completes with at least one fatal check fails, the RunModel.bat DOS batch file aborts the SOABM run and user is directed to check the inputChecker log file.
Configuring inputChecker involves specifying both the inputs and and the checks to be performed on them. This section describes the configuration details of the two settings file - config/inputs_list.csv
and config/inputs_checks.csv
.
Inputs on which QA/QC checks are to be performed are specified in the config/inputs_list.csv
file. Each row in inputs_list.csv
represents an ABM input. The attributes that user must specify for each input are described in the table below:
Attribute | Description |
---|---|
Table | The name of the input table. The inputs are loaded into inputChecker memory as data-frames under this name. For CSV inputs, this must match the CSV file name. |
Directory | The location of the CSV input file - SOABM inputs directory or SOABM uec directory |
Visum_Object | The name of the Visum Object whose attributes must be exported. Must be specified as 'NA' for CSV inputs |
Input_ID_Column | The name of the unique ID column. inputChecker creates an ID column by specified name if the column is missing from the input table |
Fields | The list of attributes to be exported from the Visum network object. All the fields are read for CSV inputs |
Column_Map | A column name can be specified if some columns must be renamed for easy reference |
Input_Description | The description of the input file. |
All the inputs must be in CSV format. Some ABM inputs may not be available in CSV format. Specifically, network related inputs are usually embedded in a transportation modeling software database. For the Visum-based SOABM, the Visum version file, SOABM.ver
, contains the zone system geography, all zonal attributes, the highway network, and the transit network. The export_csv
module of inputChecker loads the model version file and exports attributes of the specified Visum network objects to the inputChecker/inputs
directory in CSV format. The inputChecker assumes that the model version file exists within the input
sub-directory of the SOABM working directory. The name of the version file is specified in the inputChecker/config/settings.csv file next to the input_version_file
token. The user must specify each input either as a Visum object (e.g., Visum.Net.Links) or a csv file in the inputs
or uec
sub-directories. The CSV inputs are copied from the specified sub-directory to the inputChecker/inputs
directory. Columns are renamed as per user specification and an ID column is generated if not specified.
The user has an option to comment out inputs that should not to be loaded. To comment out a line in inputs_list.csv
, add a "#" in front of the table name. All inputs whose table name starts with a "#" are ignored by inputChecker
The QA/QC checks to be performed on the ABM inputs are specified in the config/checks_list.csv
file. Each row in checks_list.csv
represents a specific operation to be performed on a specific input listed in inputs_list.csv
. The operations are evaluated in the same order as they are listed in inputs_list.csv
. Each operation can be classified as a Test
or Calculation
. For Test
operations, the pandas expression is evaluated and the result is sent to the logging module of inputChecker for logging. For Calculation
operations, the pandas expression is evaluated and the result is stored as a Python object to be referenced by subsequent operations. The table below describes the various tokens that user must specify for each Test
or Calculation
operation:
Attribute | Description |
---|---|
Test | The name of the QA/QC check. The check results are referenced using this name in the log file. For calculation operations, this becomes the name of the resulting object |
Input_Table | The name of the input table on which the check is to be performed. This name must match the name specified under the Table token in inputs_list
|
ID_Column | The unique ID column name. This must match the name specified under the Input_ID_Column token in inputs_list
|
Severity | The severity level of the test - Fatal, Logical or Warning |
Type | The type of operation - Test or Calculation
|
Expression | The pandas expression to be evaluated |
Test_Vals | A list of values on which the test needs to be repeated. List must be comma separated. Test for each value is logged separately |
Report-Statistic | Any additional statistics from the test that must be reported to the log file |
Test_Description | The description of the check that is being performed |
An important step in specifying checks is assigning a severity level to each check. inputChecker allows the user to specify three severity levels for each QA/QC check - Fatal, Logical, Warning. Careful thought must be given while assigning severity level to each check. Some general principles to help decide the severity level of a check are described below:
If inputChecker fails a fatal check, it returns an exit code of 2 to the main ABM procedure, causing the ABM run to halt. Therefore, the analyst should only set the severity level of Fatal for checks that must pass in order to proceed with a model run.
The failure of these checks indicates logical inconsistencies in the inputs. With logical errors in inputs, the ABM outputs may not be very meaningful.
The failure of warning checks would indicate an issue in the input data which are not significant enough to cause a run-time error or affect model outputs. However, these checks might reveal other problems related to data processing or data quality.
At the heart of an input data check is the pandas expression that is evaluated on an input data table. Each Test
expression must evaluate to a single logical value (TRUE
or FALSE
) or a vector of logical values. Therefore, the Test
expression must be a logical test. For most applications, this involves creating logical relationships such as equalities, inequalities and ranges using standard logical operators (AND
, OR
, EQUAL
, GREATER THAN
, LESS THAN
, IN
, etc.). The length of the result vector must be equal to the length of the input on which the check was performed. The result of a Calculation
expression can be any Python data type to be used by a subsequent expression.
The success or failure of a check is decided based on the test result. In case of a single value result, the check fails if the result is FALSE
. In case of a vector result, the test is declared as failed if any value in the vector is FALSE
. Therefore, the expression must be designed to evaluate to TRUE
if there are no problems in the input data.
Rules and conventions for writing inputChecker expressions are summarized below:
- Expressions must be a valid Python/pandas expression
- Expressions must be designed to evaluate to
FALSE
to indicate any errors in data - Each expression must evaluate to logical value(s)
- Each expression must be applied to valid input table specified in
inputs_list.csv
or make use of intermediate tables created by precedingCalculation
expressions - Expressions must use the same table names as specified in
inputs_list.csv
or the Test name of theCalculation
object - Expressions must use the same field names as specified in
inputs_list.csv
. If a column map was specified, then the new names must be used - Expressions can be looped over a list of
Test_Vals
to reduce number of expressions - The Report_Statistic must also be a valid Python/pandas expression and must evaluate to a single numeric value
- Expressions can be commented by adding a "#" in front of the
Test
name. All checks whose test name starts with a "#" are ignored by inputChecker
Below are some example expressions for different types of checks
Check if household income field exists in the input synthetic population
For performing this check for multiple fields, write the expression as follows and specify the list of field names under Test_Vals
token (separated by comma):
Check if household size ('np') is greater than zero for each household
households.np>0
Check if each person's occupation code ('occp') matches the pre-defined occupation codes
persons.occp.apply(lambda x: True if x in [1,2,3,4,5,6,999] else False)
It is possible that all person records pass the above test but one of the occupation code may not have a single person record. To check for such cases, following expression can be used:
set(persons.occp)=={1,2,3,4,5,6,999}
Check if total employment across occupation categories sum to total employment for each MAZ. Since this may result in a complex expression, this can be done in two steps. First, employment across all occupation types are summed using a Calculation
expression:
maz_data[[col for col in maz_data if (col.startswith('EMP')) and not (col.endswith('TOTAL'))]].sum(axis=1)
The result of the above expression is a MAZ level vector - maz_total_employment
Next, the total employment field can be compared against maz_total_employment
maz_data.EMP_TOTAL==maz_total_employment
Check if household IDs start from 1 and are sequential
(min(households.hhid)==1) & (max(households.hhid)==len(set(households.hhid)))
To ensure that ABM outputs are meaningful, it is important to perform logical checks on input data. One such check is to compare the number of workers against available jobs in each industry. While they may not match exactly, the difference must not exceed 10%. For this check, first the number of workers and jobs by industry type must be calculated. This can be achieved by a series of Calculate
operations.
Next, the check can be performed for each industry type separately
It can be noted in the above example that the indexing between the two arrays is off by one. This is because the maz_occ_jobs
array is indexed on array position (starting from 0) whereas the person_occ_workers
array is indexed on occupation type code which goes from 1 to 6. A consistent indexing must be used wherever possible to avoid coding errors.
In addition to the result of this test, an analyst might be interested in knowing the actual ratio of jobs to workers. Therefore, a Report_Statistic
can be specified for this test as maz_occp_jobs[0]/person_occ_workers[1]
While most of the above checks apply to link and node level attributes, some checks might be unique to some other network objects such as transit routes. In Visum, the transit line route names must be unique. This requires performing a check on transit line route data as follows:
len(set(lineroute_data.NAME)) == len(lineroute_data.NAME)
The design of network level checks will depend on the transportation modeling software being used.
inputChecker is launched by the RunModel.bat DOS batch file. The user also has an option to run inputChecker independent of the ABM run. In order to run inputChecker by itself, run the inputChecker/RunInputChecker.bat file.
The final output from inputChecker is a log file which is outputted to the inputChecker/logs
directory. The log file is named as inputCheckerLog[RUN_DATE].LOG
. The log file can be opened using any text editor. The results of all checks are summarized in this log file. The following sections describe the organization and details of the log file.
The log file summarizes results from all checks. However, the order in which they are presented depends upon the severity level and the output of the check. inputChecker organizes the check results under the following headings:
- IMMEDIATE ACTION REQUIRED: All failed FATAL checks are logged under this heading
- ACTION REQUIRED: All failed LOGICAL checks are logged under this heading
- WARNINGS: All failed WARNING checks are logged under this heading
- LOG OF ALL PASSED CHECKS: A complete LOG of all passed checks
- MISSING VALUE DIAGNOSTICS ON ALL INPUTS: All failed missing value self-diagnostics tests are logged under this section
A standard check log is generated for each check. The table below shows the elements of a check LOG:
Attribute | Description |
---|---|
Input File Name | The name of the input file on which the check was evaluated |
Input File Location | Path to the location of the input file |
Visum Object | The name of the Visum object, if applicable |
Input Description | The decription of the input as specified in inputs_list.csv
|
Test Name | The name of the test as specified in checks_list.csv
|
Test Description | The description of the test |
Test Severity | The severity level of the test |
TEST RESULT | The result of the test - PASSED or FAILED |
TEST results for Test_Vals | Test result for each Test val on which the test was repeated |
Test Statistics | The value of the expression specified under the Report_Statistic token of checks_list.csv . First 25 values are printed in case of vector result |
ID Column | The name of the unique ID column of the input data table |
List of failed IDs | The first 25 IDs for which the test failed. This is generated in case of vector result |
Number of failures | Total number of failures in case of vector result |
In addition to the log file, inputChecker also produces a text file (inputCheckerSummary.txt
) containing a summary of number of inputChecker fails by their severity levels. This file is read by the main ABM batch script to present the summary at the end of the model run.
- Getting Started
- RunModel bat file
- Networks and Zone Data
- Auto Network Coding
- VDF Definition
- Transit Network Coding
- Non-motorized Network Coding
- Editing Land Use Data
- Running the Population Synthesizer
- Input Checker
- Analyzing Model Outputs
- Commercial Vehicle Model
- External Model
- Model Cost Inputs
- Value of Time
- Person Type Coding Logic
- MSA Feedback
- VMT Computation
- Shadow Pricing Mechanism
- Methodology for Developing TAZ Boundaries
- Methodology for Developing MAZ Boundaries
- Methodology for Developing TAPS
- Source of Land-Use Inputs
- Major University Model (Optional)
- Running Transit Everywhere Scenario
- Building the ABM Java Program
- Debugging ABM Python Code
- ABM Cleaning Protocol
- Updating to New Visum
- Troubleshooting and Debugging