Skip to content

bostjanv76/featConstr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explainable Feature Construction (EFC)

We present a novel method for efficient constructive induction that is applicable to both classification and regression problems. The method significantly speeds up the process of learning new, powerful features for predictive modelling and improves prediction performance by reducing the search space, coins a novel approach to using instance explanations for feature construction (FC). The developed feature construction method contributes to more successful and more comprehensible prediction models, which are becoming an important part of scientific, industrial, and societal processes.

The developed EFC1 method consist of the following four steps:

  1. Explanation of a model predictions for individual instances.
  2. Identification of groups of attributes that commonly appear together in explanations.
  3. Efficient creation of constructs from the identified groups.
  4. Evaluation of constructs and selection of the best as new features.

methodology_v352-sizeOfDrawing

Using EFC

Basic use

Place the desired dataset(s) in the demo folder (datasets/demo) and start the program. If you want to test the method on one of the experimental datasets (toy, artificial, UCI, real), simply remove the comments in the line containing the desired experimental dataset(s) and comment the line with the folder containing the demonstration dataset(s). EFC automatically recognises whether the dataset is a classification or a regression problem.

    //classification datasets
    /*****demo datasets*****/
    folder = new File("datasets/demo");
    /*****toy datasets*****/       
    //folder = new File("datasets/toy");    
    /*****artificial datasets*****/ 
    //folder = new File("datasets/artificial");  
    /*****artificial datasets (only unique instances)*****/
    //folder = new File("datasets/artificial/uniqueInst");
    /*****UCI datasets*****/
    //folder = new File("datasets/uci");
    /*****real dataset - credit score*****/       
    //folder = new File("datasets/real");    
    //regression datasets    
    /*****artificial datasets*****/       
    //folder = new File("datasets/regr");

The default settings are following:

  • EFC is enabled when flag variable efc is set to true and variables othrBase and exhaustive to false.
  • Black-box prediction algorithm: XGBoost for classification problems and RF for regression problems
  • XGBoost parameters:
    • number of decision trees is 100 (numOfRounds=100)
    • size of decision trees is 3 (maxDepth=3)
    • shrinkage is 0.3 (eta=0.3)
    • pseudo-regularization hyperparameter is 1 (gamma=1)
  • Explanation method: Tree SHAP for classification problems and IME for regression problems
  • Explain just instances from the minority class (explAllClasses=false).
  • Types of features: logical operator features, decision rules features, threshold features
    • logical operators: EQU, XOR, IMPL
  • Testing classifiers: decision trees (j48), Naïve Bayes (NB), support vector machines (SVM), k-nearest neighbours (kNN), decision rules (FURIA), and random forest (RF)
  • Prediction model evaluation: 10-fold CV

Results are printed and saved in the logs/efc folder.

  • impGroups-"time-date" – the file that stores groups of attributes that co-occur in explanations
  • report-"time-date" – the file that stores ACC and learning time of all classifiers for all method settings
  • attrImpListMDL-"time-date" – the file that stores MDL2 scores of attributes/features; attribute/feature evaluation step
  • attrImpListReliefF-"time-date" – the file that stores ReliefF3 scores of attributes/features; attribute/feature evaluation step
  • discretizationIntervals-"time-date" – the file that stores discretization intervals of numerical attributes for calculating logical features
  • params-"time-date" – the file that stores the best parameters used in the FS setting of the method

Each type of features must be activated/deactivated by the flag (logFeat, decRuleFeat, thrFeat, relatFeat, numerFeat, cartFeat).

Advanced use

Choosing explanation method

We can choose between three explanation methods (IME4, LIME5, SHAP6) for classification problems; the default explanation method is the SHAP method. Explanation method for regression problems is the IME method.

    explM=ExplMeth.SHAP   //selected explanation method; choose between {IME, LIME, SHAP}

Additional settings

When the IME explanation method is chosen, we can choose between different sampling methods (equalSampling, adaptiveSamplingSS*, adaptiveSamplingAE**, aproxErrSampling). The list of included classifiers for the prediction and visualisation model (if IME is chosen) is: rf, mp, svmLin, svmPoly, svmRBF, nb, j48, and furia.

    method=IMEver.adaptiveSamplingSS;   //sampling method
    predictionModel=rf;                 //model (based on the chosen classifier) for explanations; use of the IME method
    visualModel=rf;                     //model (based on the chosen classifier) for visualisations; use of the IME method

*Stopping criteria is sum of samples.

**Stopping criteria is approximation error for all attributes.

When selecting feature types, at least one feature type must be activated (true). Different operators can be used for different types of features. The full set of implemented logical operators is {AND, OR, EQU, XOR, IMPL}, relational operators {LESSTHAN, DIFF}, and numerical operators {ADD, SUBTRACT, DIVIDE, MULTIPLY, ABSDIFF}. The depth of the feature construction is only controlled for conjunctions and disjunctions; for example, depth (featDepth) 3 means depth 2 and 3.

    logFeat=true;                           //enable/disable generation of logical operators features
    decRuleFeat=true;                       //enable/disable generation of decision rules features
    thrFeat=true;                           //enable/disable generation of threshold features
    relatFeat=true;                         //enable/disable generation of relational features
    cartFeat=true;                          //enable/disable generation of Cartesian product features
    numerFeat=true;                         //enable/disable generation of numerical features
    operationLogUse={"AND","OR"};           //choose logical operators   
    operationRelUse={"LESSTHAN","DIFF"};    //choose relational operators                
    operationNumUse={"ADD","SUBTRACT"};     //choose numerical operators

Other EFC parameters

Groups of attributes that commonly appear together in explanations can be regulated with a noise threshold (noiseThr). The noise threshold determines the minimal required empirical support for candidate groups (for FC), i.e. the minimal required frequency to accept the attribute group as important.

    FeatEvalMeth.ReliefF    //feature evaluation measure, we can choose between MDL and ReliefF
    explAllClasses=true;    //explain all (true) classes or just minority class (false)
    explAllData=true;       //explain all (true) instances from the dataset when explaining (minority) class 
    thrL=0.1;               //lower weight threshold 
    thrU=0.8;               //upper weight threshold
    step=0.1;               //step for traversing all thresholds from thrL to thrU
    NOISE=1;                //noiseThr=(numInst*NOISE)/100.0; NOISE=0 (we take all groups of attributes)
    evalFeatDuringFC=false; //enable/disable feature evaluation during FC process
    featThr=0.05;           //evaluation threshold (use of MDL); useful only when evalFeatDuringFC is enabled
    folds=10;               //evaluation of models, folds=1 means no CV and using split in ratio listed below
    splitTrain=5;           //5 ... 80%:20%, 4 ... 75%:25%, 3 ... 66%:33%; useful only when folds=1

----- Parameters for regression problems

Parameters for adjustments to regression problems.

    numOfBins=2;           //number of bins for discretisation
    pctExplReg=50;         //percent of explained instances

----- Baseline methods

To activate baseline methods, the othrBase flag variable must be set to true and the efc and exhaustive flags must be set to false.

    BaselineMeth.rndBase;   //selecting baseline method {rndBase, globalEval}
    groupSize=3;            //the size of the group
    numOfGroups=9;          //the number of groups
    nMostInfAttr=10;        //the number of most informative attributes

Flag variables that allow additional options (knowledge discovery, visualisation, FC based on exhaustive search, FC based on interaction information) must be set to false. Additional flag variables are groupsByThrStat and writeAccByFoldsInFile. The first enables statistics to count groups of attributes identified by the EFC for each fold for a given threshold. The results are stored in the groupsStat-"time-date" file in the logs/efc folder. The second enables the storage of ACC for each fold for each prediction algorithm; the results are stored in algorithmName-byFolds-"time-date".

    justExplain=false;
    visualisation=false;
    exhaustive=false;
    jakulin=false;

----- Exhaustive search (generate all possible combinations between attributes)

Flag variable exhaustive must be set to true and jakulin, justExplain and visualisation to false. Results are printed and saved in the logs/exhaustive folder.

    justExplain=false;
    visualisation=false;
    exhaustive=true;
    jakulin=false;

----- FC based on interaction information

Flag variables jakulin and exhaustive must be set to true and flags justExplain and visualisation to false. Calculate interaction information between all combinations of attributes7 and construct features. Results are printed and saved in the logs/jakulin folder.

    justExplain=false;
    visualisation=false;
    exhaustive=true;
    jakulin=true;

----- Knowledge Discovery (construct features from the whole dataset and evaluate them)

To activate Knowledge Discovery (KD), the flag variable justExplain must be set to true and visualisation to false. New constructs of FC are evaluated by MDL scores. The results are printed and saved in the subfolder kd (logs/kd). To save new constructs with original attributes, the flag variable saveConstructs must be activated (saveConstructs=true) - the file "dataset name"-origPlusXLFeat-"time-date".arff is created; X in the file name indicates the feature level {1,2}. If the flag variable renameGenFeat is activated, the constructed features are renamed†† and saved in another file "dataset name"-origPlusRenXLFeat-"time-date".arff - this dataset serves for the next††† level construction.

  • impGroups-"time-date" – the file that stores groups of attributes that co-occur in explanations
  • attrImpListMDL-"time-date" – the file that stores MDL scores of attributes/features; attribute/feature evaluation step
  • attrImpListReliefF-"time-date" – the file that stores ReliefF scores of attributes/features; attribute/feature evaluation step
  • discretizationIntervals-"time-date" – the file that stores discretization intervals of numerical attributes for calculating logical features
    justExplain=true;
    visualisation=false;

First level features are generated from attributes, second level features are generated from attributes and first level features.

††Features are renamed in the form FSLX, where S is the serial number of the feature and X the level of feature construction; the renamed features are explained in the names-X-level-feat-"time-date".dat file.

†††For the construction of second level features, the dataset "dataset name"-origPlusRen1LFeat-"time-date".arff must be used.

----- Visualisation

To visualise explanations of instances from visFrom to visTo, the flag variable visualisation must be set to true. Besides that, attribute importance is also visualised. The default prediction algorithm is Random Forest and for the explanations the IME method is used. For each explained instance, only the topHigh (default: 6) attributes with the highest absolute contributions are shown. All images are saved in the visualisation folder. For attribute importance8, which is based on the instance explanations, we draw (max.) 20 of the most important attributes.

The folder consists of two subfolders (beforeFC and afterFC); visualisation can be performed before (justExplain=false) or after (justExplain=true) FC.

    justExplain=false;      //false - visualisation of original dataset, true - visualisation after FC
    visualisation=true;     //visualisation of explanations using IME method
    visFrom=1, visTo=10;    //visualise instances from visFrom to visTo
    drawLimit=20;           //we draw (max.) 20 the most important attributes (attribute importance visualisation)
    topHigh=10;             //visualise features with highest contributions (instance explanation visualisation)
    RESOLUTION=100;         //density for model visualisation
    N_SAMPLES=100;          //if we use equalSampling ... number of samples  
    pdfPng=true;            //besides eps, print also pdf and png format

Requirements

Java

  • JDK
  • NetBeans

R

  • R with the packages CORElearn and RWeka must be installed on the system; to use the MDL evaluation measure in the attrEval function.

Python

  • Python with libraries Lime, pandas, XGBoost and scikit-learn.

📝 Note: Visual C++ Redistributable must also be installed (because of xgboost4j.jar) and the library gsdll64.dll must be placed in the system32 folder (for converting eps to pdf).

Authors

EFC was created by Boštjan Vouk, Marko Robnik-Šikonja and Matej Guid.

Footnotes

  1. Vouk, B., Guid, M., & Robnik-Šikonja, M. (2023). Feature construction using explanations of individual predictions. Engineering Applications of Artificial Intelligence, 120, 105823. https://doi.org/jtnn

  2. Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (I), 1034–1040. https://bit.ly/3HkEGhT

  3. Robnik-Šikonja, M., & Kononenko, I. (2003). Theoretical and empirical analysis of Relieff and RRelieff. Machine Learning. https://doi.org/d63s9s

  4. Štrumbelj, E., & Kononenko, I. (2010). An efficient explanation of individual classifications using game theory. Journal of Machine Learning Research, 11, 1–18. https://bit.ly/48E9SpC

  5. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1135-1144). https://bit.ly/3v20jTq

  6. Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems 30 (NIPS 2017) (pp. 4765–4774). Curran Associates, Inc. https://bit.ly/3zhk5Is

  7. Jakulin, A. (2005). Machine learning based on attribute interactions [Doctoral dissertation, University of Ljubljana]. ePrints.FRI. https://bit.ly/3eiJ18x

  8. Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions with feature contributions. Knowledge and Information Systems, 41(3), 647–665. https://doi.org/f6pnsr

Releases

No releases published

Packages

No packages published