Questions tagged [aws-glue]

AWS Glue is a fully managed ETL (extract, transform, and load) service that can categorize your data, clean it, enrich it, and move it between various data stores. AWS Glue consists of a central data repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python code, and a scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there's no infrastructure to manage.

Filter by
Sorted by
Tagged with
47 votes
8 answers
37k views

AWS Glue Crawler Not Creating Table

I have a crawler I created in AWS Glue that does not create a table in the Data Catalog after it successfully completes. The crawler takes roughly 20 seconds to run and the logs show it successfully ...
Vince's user avatar
  • 611
43 votes
7 answers
69k views

How do I write messages to the output log on AWS Glue?

AWS Glue jobs log output and errors to two different CloudWatch logs, /aws-glue/jobs/error and /aws-glue/jobs/output by default. When I include print() statements in my scripts for debugging, they get ...
Jesse Clark's user avatar
  • 1,190
43 votes
5 answers
18k views

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table ...
rjmurt's user avatar
  • 1,195
43 votes
9 answers
33k views

Can I test AWS Glue code locally?

After reading Amazon docs, my understanding is that the only way to run/test a Glue script is to deploy it to a dev endpoint and debug remotely if necessary. At the same time, if the (Python) code ...
lfk's user avatar
  • 2,523
42 votes
4 answers
48k views

DynamicFrame vs DataFrame

What is the difference? I know that DynamicFrame was created for AWS Glue, but AWS Glue also supports DataFrame. When should DynamicFrame be used in AWS Glue?
Alex Oh's user avatar
  • 461
35 votes
5 answers
32k views

Is AWS Lambda preferred over AWS Glue Job?

In AWS Glue job, we can write some script and execute the script via job. In AWS Lambda too, we can write the same script and execute the same logic provided in above job. So, my query is not whats ...
john's user avatar
  • 1,045
34 votes
3 answers
23k views

What is transformation_ctx used for in aws glue?

There are a lot of methods in API which received this with default "" value. Is it just string marker but again what it purpose?
Cherry's user avatar
  • 32.4k
31 votes
6 answers
36k views

AWS Glue to Redshift: Is it possible to replace, update or delete data?

Here are some bullet points in terms of how I have things setup: I have CSV files uploaded to S3 and a Glue crawler setup to create the table and schema. I have a Glue job setup that writes the data ...
krchun's user avatar
  • 1,014
27 votes
6 answers
21k views

Could not find S3 endpoint or NAT gateway for subnetId

I am unable to connect AWS Glue with RDS VPC S3 endpoint validation failed for SubnetId: subnet-7e8a2. VPC: vpc-4d2d25. Reason: Could not find S3 endpoint or NAT gateway for subnetId: subnet-7ea32 in ...
user11448446's user avatar
26 votes
3 answers
33k views

Overwrite parquet files from dynamic frame in AWS Glue

I use dynamic frames to write a parquet file in S3 but if a file already exists my program append a new file instead of replace it. The sentence that I use is this: glueContext.write_dynamic_frame....
Mateo Rod's user avatar
  • 594
26 votes
2 answers
60k views

AWS Glue Job Input Parameters

I am relatively new to AWS and this may be a bit less technical question, but at present AWS Glue notes a maximum of 25 jobs permitted to be created. We are loading in a series of tables that each ...
Sauron's user avatar
  • 6,519
26 votes
3 answers
21k views

What actions does job.commit perform in aws glue?

Every job script code should be ended with job.commit() but what exact action this function do? Is it just job end marker or not? Can it be called twice during one job (if yes - in what cases)? Is it ...
Cherry's user avatar
  • 32.4k
25 votes
4 answers
43k views

At least one security group must open all ingress ports. AWS Glue connecting to RDS

I am still starting out with AWS Glue and I am trying to connect it to my publicly accessible MySql database hosted on RDS Aurora to get its data. So I start by creating a crawler and in the data ...
Naguib Ihab's user avatar
  • 4,378
25 votes
5 answers
36k views

AWS Glue: How to handle nested JSON with varying schemas

Objective: We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum. Background: The ...
ehelander's user avatar
  • 253
23 votes
6 answers
26k views

Can we consider AWS Glue as a replacement for EMR?

Just a quick question to clarify from Masters, since AWS Glue as an ETL tool, can provide companies with benefits such as, minimal or no server maintenance, cost savings by avoiding over-provisioning ...
Yuva's user avatar
  • 2,999
22 votes
6 answers
45k views

AWS Glue executor memory limit

I found that AWS Glue set up executor's instance with memory limit to 5 Gb --conf spark.executor.memory=5g and some times, on a big datasets it fails with java.lang.OutOfMemoryError. The same is for ...
Alexey Bakulin's user avatar
22 votes
4 answers
16k views

AWS Glue pricing against AWS EMR

I am doing some pricing comparison between AWS Glue against AWS EMR so as to chose between EMR & Glue. I have considered 6 DPUs (4 vCPUs + 16 GB Memory) with ETL Job running for 10 minutes for ...
Yuva's user avatar
  • 2,999
22 votes
1 answer
26k views

Spark dynamic frame show method yields nothing

So I am using AWS Glue auto-generated code to read csv file from S3 and write it to a table over a JDBC connection. Seems simple, Job runs successfully with no error but it writes nothing. When I ...
PyRaider's user avatar
  • 639
21 votes
8 answers
26k views

Optional job parameter in AWS Glue?

How can I implement an optional parameter to an AWS Glue Job? I have created a job that currently have a string parameter (an ISO 8601 date string) as an input that is used in the ETL job. I would ...
matsev's user avatar
  • 33.1k
21 votes
4 answers
38k views

Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0

When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error: An error occurred while calling ...
Robert Kossendey's user avatar
21 votes
1 answer
16k views

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my ...
Ilya Kisil's user avatar
  • 2,568
20 votes
4 answers
23k views

Add a partition on glue table via API on AWS?

I have an S3 bucket which is constantly being filled with new data, I am using Athena and Glue to query that data, the thing is if glue doesn't know that a new partition is created it doesn't search ...
Gudzo's user avatar
  • 639
20 votes
2 answers
29k views

AWS Glue issue with double quote and commas

I have this CSV file: reference,address V7T452F4H9,"12410 W 62TH ST, AA D" The following options are being used in the table definition ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2....
ln9187's user avatar
  • 740
19 votes
3 answers
59k views

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is ...
user3476463's user avatar
  • 4,335
19 votes
3 answers
27k views

How to use extra files for AWS glue job

I have an ETL job written in python, which consist of multiple scripts with following directory structure; my_etl_job | |--services | | | |-- __init__.py | |-- dynamoDB_service.py | |-- ...
Anum Sheraz's user avatar
  • 2,549
17 votes
4 answers
22k views

How to list all databases and tables in AWS Glue Catalog?

I created a Development Endpoint in the AWS Glue console and now I have access to SparkContext and SQLContext in gluepyspark console. How can I access the catalog and list all databases and tables? ...
Jiří Mauritz's user avatar
16 votes
9 answers
22k views

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One : I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned. But the demo data of ELB in ...
Kush Vyas's user avatar
  • 5,939
16 votes
13 answers
43k views

Use AWS Glue Python with NumPy and Pandas Python Packages

What is the easiest way to use packages such as NumPy and Pandas within the new ETL tool on AWS called Glue? I have a completed script within Python I would like to run in AWS Glue that utilizes NumPy ...
jumpman23's user avatar
  • 385
16 votes
5 answers
26k views

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup: S3 --> Glue --> Athena My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data. Since ...
mark s.'s user avatar
  • 656
16 votes
1 answer
10k views

How do I set multiple --conf table parameters in AWS Glue?

Multiple Answers on stackoverflow for AWS Glue say to set the --conf table parameter. However, sometimes in a job we'll need to set multiple --conf key value pairs in 1 job. I've tried the following ...
Zambonilli's user avatar
  • 4,489
16 votes
4 answers
34k views

How can I use an external python library in AWS Glue?

First stack overflow question here. Hope I do this correctly: I need to use an external python library in AWS glue. "Openpyxl" is the name of the library. I follow these directions: https://docs.aws....
Marlon Holland's user avatar
16 votes
2 answers
8k views

AWS Glue vs EMR Serverless

Recently, AWS announced Amazon EMR Serverless (Preview) https://aws.amazon.com/blogs/big-data/announcing-amazon-emr-serverless-preview-run-big-data-applications-without-managing-servers/ - new very ...
alexanoid's user avatar
  • 25k
15 votes
3 answers
15k views

AWS Glue takes a long time to finish

I just run a very simple job as follows glueContext = GlueContext(SparkContext.getOrCreate()) l_table = glueContext.create_dynamic_frame.from_catalog( database="gluecatalog", ...
Shawn's user avatar
  • 5,200
15 votes
4 answers
19k views

AWS Glue cannot create database from crawler: permission denied

I am trying to use an AWS Glue crawler on an S3 bucket to populate a Glue database. I run the Create Crawler wizard, select my datasource (the S3 bucket with the avro files), have it create the IAM ...
mhamrah's user avatar
  • 9,188
15 votes
2 answers
22k views

AWS Glue output file name

I am using AWS to transform some JSON files. I have added the files to Glue from S3. The job I have set up reads the files in ok, the job runs successfully, there is a file added to the correct S3 ...
Ewan Peters's user avatar
15 votes
6 answers
10k views

AWS Glue Crawler adding tables for every partition?

I have several thousand files in an S3 bucket in this form: ├── bucket │ ├── somedata │ │   ├── year=2016 │ │   ├── year=2017 │ │   │   ├── month=11 │ │   | │   ├── sometype-2017-11-01....
chazzmoney's user avatar
15 votes
5 answers
12k views

How set name for crawled table?

AWS crawler has prefix property for adding new tables. So If I leave prefix empty and start crawler to s3://my-bucket/some-table-backup it creates table with name some-table-backup. Is there a way to ...
Cherry's user avatar
  • 32.4k
15 votes
1 answer
3k views

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. ...
Thiago Baldim's user avatar
15 votes
0 answers
10k views

Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

I am having an AWS EMR cluster (v5.11.1) with Spark(v2.2.1) and trying to use AWS Glue Data Catalog as its metastore. As per guidelines provided in official AWS documentation (reference link below), I ...
Sridher's user avatar
  • 211
14 votes
4 answers
26k views

AWS Glue job consuming data from external REST API

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources. Is that even possible? Anyone does it? Please ...
deorst's user avatar
  • 189
14 votes
4 answers
18k views

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error ...
Vigneshwaran's user avatar
14 votes
2 answers
15k views

AWS Athena - GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters

I'm querying a table in Athena that is giving the error: GENERIC_INTERNAL_ERROR: Number of partition values does not match number of filters I was able to query it earlier, but added another ...
Neil Galloway's user avatar
14 votes
1 answer
2k views

using AWS Glue with Apache Avro on schema changes

I am new to AWS Glue and am having difficulty fully understanding the AWS docs, but am struggling through the following use case: We have an s3 bucket with a number of Avro files. We have decided to ...
CharStar's user avatar
  • 427
14 votes
1 answer
2k views

Glue Dynamic Frame is way slower than regular Spark

In the image below we have the same glue job run with three different configurations in terms of how we write to S3: We used a dynamic frame to write to S3 We used a pure spark frame to write to S3 ...
justHelloWorld's user avatar
13 votes
3 answers
14k views

glue job for redshift connection: "Unable to find suitable security group"

I'm trying to set up a AWS Glue job and make a connection to Redshift. I'm getting error when I set the connection type to Redshift: "Unable to find a suitable security group. Change connection ...
user3871's user avatar
  • 12.6k
13 votes
4 answers
23k views

Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?

Is it possible to trigger an AWS Glue crawler on new files, that get uploaded into a S3 bucket, given that the crawler is "pointed" to that bucket? In other words: a file upload generates an event, ...
BoIde's user avatar
  • 316
13 votes
2 answers
22k views

How to solve this HIVE_PARTITION_SCHEMA_MISMATCH?

I have partitioned data in CSV files on S3: s3://bucket/dataset/p=1/*.csv (partition #1) ... s3://bucket/dataset/p=100/*.csv (partition #100) I run a classifier over s3://bucket/dataset/ and the ...
Raffael's user avatar
  • 19.8k

1
2 3 4 5
97