Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of Glue Data Catalog with sagemaker_pyspark #109

Open
mattiamatrix opened this issue Mar 11, 2020 · 16 comments
Open

Usage of Glue Data Catalog with sagemaker_pyspark #109

mattiamatrix opened this issue Mar 11, 2020 · 16 comments
Labels

Comments

@mattiamatrix
Copy link

System Information

  • Spark or PySpark: PySpark
  • SDK Version: v1.2.8
  • Spark Version: v2.3.2
  • Algorithm (e.g. KMeans): n/a

Describe the problem

I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.

I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)

Minimal repo / logs

Below is the current code that runs in the notebook but it doesn't actually work.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = SparkSession.builder \
    .config("spark.driver.extraClassPath", classpath) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("hive.metastore.schema.verification", "false") \
    .enableHiveSupport() \
    .getOrCreate()
@nadiaya
Copy link
Contributor

nadiaya commented Apr 1, 2020

Can you post the error message you got?

Also the currently supported spark version is 2.2

@mattiamatrix
Copy link
Author

Hi,
I don't get any specific error but Spark uses a default local catalog and not the Glue Data Catalog.
Basically those configurations don't have any effect.

@laurenyu
Copy link
Contributor

sorry for the slow reply here. it looks like the code you're referencing is more about PySpark and Glue rather than this sagemaker-pyspark library, so apologies if some of my questions/suggestions seem too basic.

what kind of log messages are showing you that it's not using your configuration?

I did some Googling and found https://forums.aws.amazon.com/thread.jspa?threadID=263860. When I compare your code to the last reply in that thread, I notice that your code doesn't have parentheses with builder. Perhaps you need to invoke it with builder() rather than just builder?

@krishanunandy
Copy link

Hi @laurenyu,

I'm having the same issue as @mattiamatrix above, where instructing Spark to use the Glue catalog as a metastore doesn't throw any errors but also does not appear to have any effect at all, with Spark defaulting to using the local catalog.

I looked at the reference you suggested from the AWS forums but I believe that example is in Scala (or maybe Java?) and adding the parentheses to builder yields the following error -

TypeError: 'Builder' object is not callable

Happy to provide any additional information if that's helpful.

@metrizable
Copy link

Hi @mattiamatrix and @krishanunandy . Thanks for the reply. I'm not exactly sure of your set-up, but I noticed from the original post that you were attempting to follow the cited guide and, as noted in the original post, "this is do-able via EMR" by enabling "Use AWS Glue Data Catalog for table metadata" on cluster launch which ensures the necessary jar is available on the cluster instances and on the classpath.

However, when using a notebook launched from the AWS SageMaker console, the necessary jar is not a part of the classpath. Launching a notebook instance with, say, conda_py3 kernel and utilizing code similar to the original post reveals the Glue catalog metastore classes are not available:

import sagemaker_pyspark


for jar in sagemaker_pyspark.classpath_jars():
    !jar -tvf {jar} | grep AWSGlueDataCatalogHiveClientFactory | wc

      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0
      0       0       0

Can you provide more details on your setup?

@krishanunandy
Copy link

Hi @metrizable!

Thanks for following up! I ran the code snippet you posted on my SageMaker instance that's running the conda_python3 kernel and I get an output identical to the one you posted, so I think you may be on to something with the missing jar file.

At the top of my code I create a SparkSession using the following code, but if the relevant jar file is missing I'm presuming this won't solve the issue I'm having.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())
spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath).getOrCreate()

Do you know where I can find the jar file? I'm optimistically presuming that once I have the jar, something like this -

from pyspark.conf import SparkConf

conf = SparkConf()
conf.set("spark.jars", "<path_to_jar>")

and adding .config(conf=conf) to the SparkSession builder configuration should solve the issue?

@laurenyu
Copy link
Contributor

sorry for the delayed response. talked to @metrizable and it looks like https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore probably contains the right class.

the README has instructions for building, but there's also an open PR to correct which release to check out. After that, I ran into a few errors along the way and found this issue comment to be helpful.

I found https://github.com/tinyclues/spark-glue-data-catalog, which looks to be an unofficial build that contains AWSGlueDataCatalogHiveClientFactory:

$ for x in $(ls); do jar -tvf $x | grep AWSGlueDataCatalogHiveClientFactory; done
  1193 Thu Apr 30 13:30:30 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class
  1193 Thu Apr 30 13:30:26 UTC 2020 com/amazonaws/glue/catalog/metastore/AWSGlueDataCatalogHiveClientFactory.class

does that help?

@krishanunandy
Copy link

We ended up using an EMR backend for running Spark on SageMaker as a workaround but I'll try your solution and report back. Appreciate the follow up!

@mattiamatrix
Copy link
Author

Hello,

since this issue is still open,
did anyone find/confirm a solution to use the Glue Catalog from Sagemaker without using EMR?

Thanks

@davdonin
Copy link

I am also interested to see a solution for using Glue Catalog from Sagemaker without using EMR.

@devonkinghorn
Copy link

Is there any way we can bump the priority on this? It would be really nice to use the glue data catalog from SM notebooks

@RajarshiBhadra
Copy link

Is this available as a feature now?

@joaopcm1996
Copy link

For visibility, you can now run Glue interactive sessions directly from a SageMaker Studio Notebook
https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/

@hisuraj-amazon
Copy link

@joaopcm1996 Can we run glue interactive sessions from SM notebooks without using SM studio? Or as per the original request Is there a way to read glue catalog data from SM notebook. I see that there was a jar missing problem above. Was anyone able to get this to work?

@csotomon
Copy link

Hi, can we configure a sagemaker pysparkprocessor to use Glue Data Catalog as the metastore for Hive, or can we use the Glue interactive sessions with this processor?

@ArtemioPadilla
Copy link

Did anybody managed to make a sagamaker instancen work with PySpark and the Glue data catalog?

Send help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests