shap.prep.stack.data uses internal ID as grouping variable for hierarchical clustering in addition to shap values #31

kransom14 · 2022-06-02T20:43:29Z

In an attempt to understand how the groups were created in shap.prep.stack.data(), I attempted to reproduce the grouping the function calculates on my own with the stats::hclust() function. stats::hclust() and cutree() give a different number of samples in each group when passed the shap values from a model. However, the stats::hclust() and cutree() functions will give the same number of samples per group as shap.stack.prep.data() when a column is added for sequential row ID and included as a grouping variable. Please see below for reproducible example.

library(SHAPforxgboost)
library(xgboost)
library(DALEX)
library(caret)

# use apartments data from DALEX
data("apartments")
head(apartments)
dummy <- dummyVars(" ~ .", data=apartments)
final_df <- data.frame(predict(dummy, newdata=apartments))
head(final_df)
X1 = as.matrix(final_df[,-1])
mod1 = xgboost::xgboost(
  data = X1, label = apartments$m2.price, gamma = 0, eta = 1,
  lambda = 0, nrounds = 1, verbose = FALSE)

shap_values <- shap.values(xgb_model = mod1, X_train = X1)
shap_values$mean_shap_score
shap_values_appts <- shap_values$shap_score

plot_data <- shap.prep.stack.data(shap_contrib = shap_values_appts,
                                  n_groups = 4)
summary(as.factor(plot_data$group))
#1   2   3   4 
#606  92 215  87 

# calculate clusters with hclust() as is done internally to shap.prep.stack.data
# include the scaling that shap.prep.stack.data performs
h <- hclust(dist(scale(shap_values_appts)), method = "ward.D")
groups <- cutree(h, 4)
summary(as.factor(groups))
#   2   3   4 
#307 336 270  8

# add row ID column to shap values data frame and recalculate
# the number of samples in each group will reproduce (groups identities are just shuffled)
shap_values_appts_id <- shap_values_appts
shap_values_appts_id$ID <- seq(1, nrow(shap_values_appts_id))

h2 <- hclust(dist(scale(shap_values_appts_id)), method = "ward.D")
groups2 <- cutree(h2, 4)
summary(as.factor(groups2))
#1   2   3   4 
#215 606  87  92

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shap.prep.stack.data uses internal ID as grouping variable for hierarchical clustering in addition to shap values #31

shap.prep.stack.data uses internal ID as grouping variable for hierarchical clustering in addition to shap values #31

kransom14 commented Jun 2, 2022

shap.prep.stack.data uses internal ID as grouping variable for hierarchical clustering in addition to shap values #31

shap.prep.stack.data uses internal ID as grouping variable for hierarchical clustering in addition to shap values #31

Comments

kransom14 commented Jun 2, 2022