How can we use OmniXAI for pyspark models? Anyone tried it out? #79

akshat-suwalka-dream11 · 2023-04-05T07:47:18Z

I have pyspark model - RandomForest which is pretrained model....and also i don't want to retrain it.

yangwenz · 2023-04-05T07:57:22Z

You can simply implment a prediction function using the trained pyspark model, and pass it to the "model" parameter when initializing a "TabularExplainer". Some examples:

How to use TabularExplainer:
https://github.com/salesforce/OmniXAI/blob/main/tutorials/tabular_classification.ipynb
How to define a prediction function:
https://github.com/salesforce/OmniXAI/blob/main/tutorials/tabular/shap.ipynb

akshat-suwalka-dream11 · 2023-04-05T09:07:51Z

@yangwenz Do you use some parallelization technique while solving the problem....like if i have 1M rows then, will you use mine various workers of cluster or only Driver will be used?

yangwenz · 2023-04-05T09:10:27Z

If you want to generate explanations for 1M examples, please also use pyspark to run the explainer, e.g. distribute workloads into multiple workers.

akshat-suwalka-dream11 · 2023-04-05T09:33:37Z

explainer = ShapTabular(
    training_data=tabular_data,
    predict_function=predict_function,
    nsamples=100
)

I will use pyspark model in predict function to infer and my tabular data will also be pyspark dataframe.....
Will the explainer.explain work parallely/distributed fashion ?
@yangwenz

yangwenz · 2023-04-05T11:02:30Z

"tabular_data" is only used for initializing the explainers, so there is no need to use the whole dataset. The lib provides a function for extracting a subset of the whole dataset: https://github.com/salesforce/OmniXAI/blob/main/omnixai/sampler/tabular.py. If your data is pyspark dataframe, you can convert a partition into a pandas dataframe that can fit the memory.

akshat-suwalka-dream11 · 2023-04-06T09:28:43Z

@yangwenz Can you please explain these please also use pyspark to run the explainer, e.g. distribute workloads into multiple workers. that you mention above...
As you mention that i will put the data as pandas format and then pass as tabular_data....My model is pyspark model in predict_functon so dont u think my model will not be able to infer bcz the format of the data is pandas not pyspark..

yangwenz · 2023-04-06T09:49:06Z

You can install OmniXAI as an additional package when launching a pyspark job. Then you can use it directly as I mentioned above.

…

On Thu, Apr 6, 2023 at 5:28 PM akshat-suwalka ***@***.***> wrote: @yangwenz <https://github.com/yangwenz> Can you please explain these please also use pyspark to run the explainer, e.g. distribute workloads into multiple workers. that you mention above... As you mention that i will put the data as pandas format and then pass as tabular_data....My model is pyspark model so dont u think my model will not be able to infer bcz the format of the data is pandas not pyspark.. — Reply to this email directly, view it on GitHub <#79 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZWA6NHJN5YZSCS7XGJKMTW72EFNANCNFSM6AAAAAAWTWUBMI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

akshat-suwalka-dream11 · 2023-04-11T10:26:58Z

@yangwenz Unable to resolve the error, can you please help with it

def get_features(df_in, columns_input):
    assembler = VectorAssembler(inputCols=columns_input, outputCol="features")  
    df_out = (assembler
              .transform(df_in)
             ).cache()    
    return(df_out)

def pyspark_data(pd_data):
pd_data = pd.DataFrame(pd_data)
py_data = spark.createDataFrame(pd_data)
columns_input_1 = py_data.columns

data_features_1 = get_features(df_in = py_data,
                         columns_input = columns_input_1)
return model_object_1.predictProbability(data_features_1.head().features)


pd_fe = features_df_1.toPandas()

predict_function=lambda z: pyspark_data(transformer.transform(z))

tabular_data = Tabular(data=pd_fe.drop(['date_trans','userid'], axis=1))

transformer = TabularTransform().fit(tabular_data)
class_names = transformer.class_names
x = transformer.transform(tabular_data)

explainer = ShapTabular(
    training_data=tabular_data,
    predict_function=predict_function,
    nsamples=100
)

Below is the error while running explainer code
`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File :1
----> 1 explainer = ShapTabular(
2 training_data=tabular_data,
3 predict_function=predict_function,
4 nsamples=100
5 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/omnixai/explainers/tabular/agnostic/shap.py:63, in ShapTabular.init(self, training_data, predict_function, mode, ignored_features, **kwargs)
60 self.valid_indices = [i for i, f in enumerate(self.feature_columns) if f not in self.ignored_features]
62 self.background_data = shap.sample(self.data, nsamples=kwargs.get("nsamples", 100))
---> 63 self.explainer = shap.KernelExplainer(self.predict_fn, self.background_data, link=self.link, **kwargs)

File /databricks/python/lib/python3.9/site-packages/shap/explainers/_kernel.py:95, in Kernel.init(self, model, data, link, **kwargs)
93 if safe_isinstance(model_null, "tensorflow.python.framework.ops.EagerTensor"):
94 model_null = model_null.numpy()
---> 95 self.fnull = np.sum((model_null.T * self.data.weights).T, 0)
96 self.expected_value = self.linkfv(self.fnull)
98 # see if we have a vector output

ValueError: operands could not be broadcast together with shapes (2,) (100,) `

akshat-suwalka-dream11 · 2023-04-23T08:40:15Z

@yangwenz

yangwenz · 2023-04-24T04:33:12Z

Hi, please check your input data. It is probably not a problem coming from the lib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can we use OmniXAI for pyspark models? Anyone tried it out? #79

How can we use OmniXAI for pyspark models? Anyone tried it out? #79

akshat-suwalka-dream11 commented Apr 5, 2023

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 5, 2023 •

edited

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 5, 2023

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 6, 2023 •

edited

yangwenz commented Apr 6, 2023 via email

akshat-suwalka-dream11 commented Apr 11, 2023

akshat-suwalka-dream11 commented Apr 23, 2023

yangwenz commented Apr 24, 2023

How can we use OmniXAI for pyspark models? Anyone tried it out? #79

How can we use OmniXAI for pyspark models? Anyone tried it out? #79

Comments

akshat-suwalka-dream11 commented Apr 5, 2023

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 5, 2023 • edited

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 5, 2023

yangwenz commented Apr 5, 2023

akshat-suwalka-dream11 commented Apr 6, 2023 • edited

yangwenz commented Apr 6, 2023 via email

akshat-suwalka-dream11 commented Apr 11, 2023

akshat-suwalka-dream11 commented Apr 23, 2023

yangwenz commented Apr 24, 2023

akshat-suwalka-dream11 commented Apr 5, 2023 •

edited

akshat-suwalka-dream11 commented Apr 6, 2023 •

edited