Usage of Glue Data Catalog with sagemaker_pyspark #109

mattiamatrix · 2020-03-11T12:32:07Z

System Information

Spark or PySpark: PySpark
SDK Version: v1.2.8
Spark Version: v2.3.2
Algorithm (e.g. KMeans): n/a

Describe the problem

I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.

I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)

Minimal repo / logs

Below is the current code that runs in the notebook but it doesn't actually work.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = SparkSession.builder \
    .config("spark.driver.extraClassPath", classpath) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("hive.metastore.schema.verification", "false") \
    .enableHiveSupport() \
    .getOrCreate()

The text was updated successfully, but these errors were encountered:

nadiaya · 2020-04-01T23:50:20Z

Can you post the error message you got?

Also the currently supported spark version is 2.2

mattiamatrix · 2020-04-08T08:59:48Z

Hi,
I don't get any specific error but Spark uses a default local catalog and not the Glue Data Catalog.
Basically those configurations don't have any effect.

laurenyu · 2020-04-23T16:45:31Z

sorry for the slow reply here. it looks like the code you're referencing is more about PySpark and Glue rather than this sagemaker-pyspark library, so apologies if some of my questions/suggestions seem too basic.

what kind of log messages are showing you that it's not using your configuration?

I did some Googling and found https://forums.aws.amazon.com/thread.jspa?threadID=263860. When I compare your code to the last reply in that thread, I notice that your code doesn't have parentheses with builder. Perhaps you need to invoke it with builder() rather than just builder?

krishanunandy · 2020-06-26T19:48:06Z

Hi @laurenyu,

I'm having the same issue as @mattiamatrix above, where instructing Spark to use the Glue catalog as a metastore doesn't throw any errors but also does not appear to have any effect at all, with Spark defaulting to using the local catalog.

I looked at the reference you suggested from the AWS forums but I believe that example is in Scala (or maybe Java?) and adding the parentheses to builder yields the following error -

TypeError: 'Builder' object is not callable

Happy to provide any additional information if that's helpful.

metrizable · 2020-07-01T16:19:51Z

Hi @mattiamatrix and @krishanunandy . Thanks for the reply. I'm not exactly sure of your set-up, but I noticed from the original post that you were attempting to follow the cited guide and, as noted in the original post, "this is do-able via EMR" by enabling "Use AWS Glue Data Catalog for table metadata" on cluster launch which ensures the necessary jar is available on the cluster instances and on the classpath.

However, when using a notebook launched from the AWS SageMaker console, the necessary jar is not a part of the classpath. Launching a notebook instance with, say, conda_py3 kernel and utilizing code similar to the original post reveals the Glue catalog metastore classes are not available:

import sagemaker_pyspark


for jar in sagemaker_pyspark.classpath_jars():
    !jar -tvf {jar} | grep AWSGlueDataCatalogHiveClientFactory | wc

      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0

Can you provide more details on your setup?

krishanunandy · 2020-07-01T20:53:09Z

Hi @metrizable!

Thanks for following up! I ran the code snippet you posted on my SageMaker instance that's running the conda_python3 kernel and I get an output identical to the one you posted, so I think you may be on to something with the missing jar file.

At the top of my code I create a SparkSession using the following code, but if the relevant jar file is missing I'm presuming this won't solve the issue I'm having.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath).getOrCreate()

Do you know where I can find the jar file? I'm optimistically presuming that once I have the jar, something like this -

from pyspark.conf import SparkConf

conf = SparkConf()
conf.set("spark.jars", "<path_to_jar>")

and adding .config(conf=conf) to the SparkSession builder configuration should solve the issue?

laurenyu · 2020-07-24T16:35:09Z

sorry for the delayed response. talked to @metrizable and it looks like https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore probably contains the right class.

the README has instructions for building, but there's also an open PR to correct which release to check out. After that, I ran into a few errors along the way and found this issue comment to be helpful.

I found https://github.com/tinyclues/spark-glue-data-catalog, which looks to be an unofficial build that contains AWSGlueDataCatalogHiveClientFactory:

$ for x in $(ls); do jar -tvf $x | grep AWSGlueDataCatalogHiveClientFactory; done
  1193 Thu Apr 30 13:30:30 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class
  1193 Thu Apr 30 13:30:26 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class

does that help?

krishanunandy · 2020-07-25T21:10:10Z

We ended up using an EMR backend for running Spark on SageMaker as a workaround but I'll try your solution and report back. Appreciate the follow up!

mattiamatrix · 2020-10-23T17:28:56Z

Hello,

since this issue is still open,
did anyone find/confirm a solution to use the Glue Catalog from Sagemaker without using EMR?

Thanks

davdonin · 2021-01-14T11:13:08Z

I am also interested to see a solution for using Glue Catalog from Sagemaker without using EMR.

devonkinghorn · 2021-03-26T19:02:13Z

Is there any way we can bump the priority on this? It would be really nice to use the glue data catalog from SM notebooks

RajarshiBhadra · 2021-08-20T19:32:16Z

Is this available as a feature now?

joaopcm1996 · 2022-10-17T08:38:24Z

For visibility, you can now run Glue interactive sessions directly from a SageMaker Studio Notebook
https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/

hisuraj-amazon · 2023-02-05T06:11:00Z

@joaopcm1996 Can we run glue interactive sessions from SM notebooks without using SM studio? Or as per the original request Is there a way to read glue catalog data from SM notebook. I see that there was a jar missing problem above. Was anyone able to get this to work?

csotomon · 2023-02-22T01:19:31Z

Hi, can we configure a sagemaker pysparkprocessor to use Glue Data Catalog as the metastore for Hive, or can we use the Glue interactive sessions with this processor?

ArtemioPadilla · 2024-04-05T22:46:31Z

Did anybody managed to make a sagamaker instancen work with PySpark and the Glue data catalog?

Send help.

laurenyu added the question label Apr 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage of Glue Data Catalog with sagemaker_pyspark #109

Usage of Glue Data Catalog with sagemaker_pyspark #109

mattiamatrix commented Mar 11, 2020

nadiaya commented Apr 1, 2020

mattiamatrix commented Apr 8, 2020

laurenyu commented Apr 23, 2020

krishanunandy commented Jun 26, 2020

metrizable commented Jul 1, 2020

krishanunandy commented Jul 1, 2020

laurenyu commented Jul 24, 2020

krishanunandy commented Jul 25, 2020

mattiamatrix commented Oct 23, 2020

davdonin commented Jan 14, 2021

devonkinghorn commented Mar 26, 2021

RajarshiBhadra commented Aug 20, 2021

joaopcm1996 commented Oct 17, 2022

hisuraj-amazon commented Feb 5, 2023

csotomon commented Feb 22, 2023

ArtemioPadilla commented Apr 5, 2024

Usage of Glue Data Catalog with sagemaker_pyspark #109

Usage of Glue Data Catalog with sagemaker_pyspark #109

Comments

mattiamatrix commented Mar 11, 2020

System Information

Describe the problem

Minimal repo / logs

nadiaya commented Apr 1, 2020

mattiamatrix commented Apr 8, 2020

laurenyu commented Apr 23, 2020

krishanunandy commented Jun 26, 2020

metrizable commented Jul 1, 2020

krishanunandy commented Jul 1, 2020

laurenyu commented Jul 24, 2020

krishanunandy commented Jul 25, 2020

mattiamatrix commented Oct 23, 2020

davdonin commented Jan 14, 2021

devonkinghorn commented Mar 26, 2021

RajarshiBhadra commented Aug 20, 2021

joaopcm1996 commented Oct 17, 2022

hisuraj-amazon commented Feb 5, 2023

csotomon commented Feb 22, 2023

ArtemioPadilla commented Apr 5, 2024