-
Notifications
You must be signed in to change notification settings - Fork 46
Error: Container is running beyond physical memory limits #19
Comments
Hi @clifff - there's a couple things you can try here.
I would recommend trying the first option as that will inherently give you more memory to work with. |
Thanks for the tip @dacort! Didn't realize worker type was configurable like that. I upped to
Which matches what Cloudwatch is showing: It seems promising it didn't hit a memory usage of |
Confirm, timed out in about the same amount of time with the |
OK, thanks for trying that @clifff - looks like building up the list of those 13M files is taking up quite the resources. Give me a few days to see if I can reproduce this in my own environment to see what options there might be. There's definitely still some more testing for these scripts at that scale. |
Sounds good - thanks for looking into this @dacort! Happy to tweak settings/code and retry whenever. |
@dacort - sorry to bump, but any update on this? Totally understand if not - I may take a go at loading these up on an EC2 instance with lots of RAM and attempting to dig at what I want w/ unix tools. |
Hey @clifff - Unfortunately haven't been able to take a look much deeper. How high did you bump A couple other options:
There is some more detail on debugging OOM issues here as well: https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html#monitor-profile-debug-oom-driver edit I think you can specify the file grouping as an additional_options={"groupFiles": "inPartition"} |
No worries! I actually was successful loading the logs onto an EC2 instance. Turns out the bucket inventory size was way off and it was more like 60 GB of logs... but the good news is I was able to filter it down to ~100 mb of relevant lines using Will go ahed and close this for now since, but feel free to re-open if you want to track the issue further. |
👍 Sounds good, thanks! |
I didn't realize you were just trying to do a one-time query. For future reference, this library creates two tables - one for the "raw" unconverted data and another for the "optimized" parquet data. This appears to have been failing during the conversion process, but you still could have queried the raw data. But |
Hi I've tried with:
all with many combinations of memory settings to no avail. I instead went to the code and skipped the repartition stage: https://github.com/awslabs/athena-glue-service-logs/blob/master/athena_glue_service_logs/converter.py#L66 The job succeeded with 100 'standard' workers after only 4hours. But for me at least it is better to have the jobs succeeding, than having no log data at all. Also, for our use case, the athena query performance will be good enough. Q: Would it make sense to make the repartitioning configurable? EDIT: added this as a separate issue instead: #21 |
I'm trying to run some analysis on a collection of S3 Access Logs, and set up a Glue job using the steps in the README to do so. The set of logs is about 14 GB over 12.8 million files. Whenever I kick off the job, it runs for about 13 minutes and then fails with a
Command failed with exit code 1
message. Looking at the logs, I see this line that seems important:This is corroborated by CloudWatch metrics, which show the driver memory usage steadily climbing and the executor staying low.
Based on the
athena_glue_service_logs
blog post here, it seems like my volume of data is well within the expected limits. I retried the job after adding the--conf
parameter set tospark.yarn.executor.memoryOverhead=1G
, but it failed in the same way.Any advice for getting this to work are appreciated - otherwise I'll follow the Glue documentation suggestion of writing a script to do the conversion using DynamicFrames.
The text was updated successfully, but these errors were encountered: