Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can we use OmniXAI for pyspark models? Anyone tried it out? #79

Open
akshat-suwalka-dream11 opened this issue Apr 5, 2023 · 10 comments

Comments

@akshat-suwalka-dream11
Copy link

I have pyspark model - RandomForest which is pretrained model....and also i don't want to retrain it.

@yangwenz
Copy link
Collaborator

yangwenz commented Apr 5, 2023

You can simply implment a prediction function using the trained pyspark model, and pass it to the "model" parameter when initializing a "TabularExplainer". Some examples:

How to use TabularExplainer:
https://github.com/salesforce/OmniXAI/blob/main/tutorials/tabular_classification.ipynb
How to define a prediction function:
https://github.com/salesforce/OmniXAI/blob/main/tutorials/tabular/shap.ipynb

@akshat-suwalka-dream11
Copy link
Author

akshat-suwalka-dream11 commented Apr 5, 2023

@yangwenz Do you use some parallelization technique while solving the problem....like if i have 1M rows then, will you use mine various workers of cluster or only Driver will be used?

@yangwenz
Copy link
Collaborator

yangwenz commented Apr 5, 2023

If you want to generate explanations for 1M examples, please also use pyspark to run the explainer, e.g. distribute workloads into multiple workers.

@akshat-suwalka-dream11
Copy link
Author

explainer = ShapTabular(
    training_data=tabular_data,
    predict_function=predict_function,
    nsamples=100
)

I will use pyspark model in predict function to infer and my tabular data will also be pyspark dataframe.....
Will the explainer.explain work parallely/distributed fashion ?
@yangwenz

@yangwenz
Copy link
Collaborator

yangwenz commented Apr 5, 2023

"tabular_data" is only used for initializing the explainers, so there is no need to use the whole dataset. The lib provides a function for extracting a subset of the whole dataset: https://github.com/salesforce/OmniXAI/blob/main/omnixai/sampler/tabular.py. If your data is pyspark dataframe, you can convert a partition into a pandas dataframe that can fit the memory.

@akshat-suwalka-dream11
Copy link
Author

akshat-suwalka-dream11 commented Apr 6, 2023

@yangwenz Can you please explain these please also use pyspark to run the explainer, e.g. distribute workloads into multiple workers. that you mention above...
As you mention that i will put the data as pandas format and then pass as tabular_data....My model is pyspark model in predict_functon so dont u think my model will not be able to infer bcz the format of the data is pandas not pyspark..

@yangwenz
Copy link
Collaborator

yangwenz commented Apr 6, 2023 via email

@akshat-suwalka-dream11
Copy link
Author

@yangwenz Unable to resolve the error, can you please help with it

def get_features(df_in, columns_input):
    assembler = VectorAssembler(inputCols=columns_input, outputCol="features")  
    df_out = (assembler
              .transform(df_in)
             ).cache()    
    return(df_out)
def pyspark_data(pd_data):
pd_data = pd.DataFrame(pd_data)
py_data = spark.createDataFrame(pd_data)
columns_input_1 = py_data.columns

data_features_1 = get_features(df_in = py_data,
                         columns_input = columns_input_1)
return model_object_1.predictProbability(data_features_1.head().features)


pd_fe = features_df_1.toPandas()

predict_function=lambda z: pyspark_data(transformer.transform(z))

tabular_data = Tabular(data=pd_fe.drop(['date_trans','userid'], axis=1))
transformer = TabularTransform().fit(tabular_data)
class_names = transformer.class_names
x = transformer.transform(tabular_data)
explainer = ShapTabular(
    training_data=tabular_data,
    predict_function=predict_function,
    nsamples=100
)

Below is the error while running explainer code
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File :1
----> 1 explainer = ShapTabular(
2 training_data=tabular_data,
3 predict_function=predict_function,
4 nsamples=100
5 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/omnixai/explainers/tabular/agnostic/shap.py:63, in ShapTabular.init(self, training_data, predict_function, mode, ignored_features, **kwargs)
60 self.valid_indices = [i for i, f in enumerate(self.feature_columns) if f not in self.ignored_features]
62 self.background_data = shap.sample(self.data, nsamples=kwargs.get("nsamples", 100))
---> 63 self.explainer = shap.KernelExplainer(self.predict_fn, self.background_data, link=self.link, **kwargs)

File /databricks/python/lib/python3.9/site-packages/shap/explainers/_kernel.py:95, in Kernel.init(self, model, data, link, **kwargs)
93 if safe_isinstance(model_null, "tensorflow.python.framework.ops.EagerTensor"):
94 model_null = model_null.numpy()
---> 95 self.fnull = np.sum((model_null.T * self.data.weights).T, 0)
96 self.expected_value = self.linkfv(self.fnull)
98 # see if we have a vector output

ValueError: operands could not be broadcast together with shapes (2,) (100,) `

@akshat-suwalka-dream11
Copy link
Author

@yangwenz

@yangwenz
Copy link
Collaborator

Hi, please check your input data. It is probably not a problem coming from the lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants