You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, MMS by default will print memory utilization into log which is great. The problem I have is after each request to MMS, the memory utilization increment a little bit. after several requests, the memory utilization went up to 100% and worker died.
I don't think this is the right behavior right?
I tried gc.collect() in _handle function but it doesn't work.(no gpu available in this machine)
I wonder if anyone can help me out here.
here is an example:
when just started the server, the log shows: 2021-10-31 18:22:25,881 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:5.1|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704545
After one request: mms_1 | 2021-10-31 18:24:25,742 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:26.2|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704665
After second request: mms_1 | 2021-10-31 18:26:25,601 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:39.7|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704785
After third request: mms_1 | 2021-10-31 18:30:25,323 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:58.5|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635705025
After 4th request: mms_1 | 2021-10-31 18:32:25,187 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:81.6|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635705145
After 5th request,OOM appears: mms_1 | 2021-10-31 18:35:41,402 [INFO ] epollEventLoopGroup-4-7 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-96795301 Worker disconnected. WORKER_MODEL_LOADED mms_1 | 2021-10-31 18:35:41,528 [DEBUG] W-9000-video_segmentation_v1 com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. mms_1 | java.lang.InterruptedException mms_1 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) mms_1 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) mms_1 | at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) mms_1 | at com.amazonaws.ml.mms.wlm.WorkerThread.runWorker(WorkerThread.java:148) mms_1 | at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:211) mms_1 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) mms_1 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) mms_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) mms_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) mms_1 | at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered:
n0thing233
changed the title
worker died and restart, memory issue
memory utilization increment after every request, worker died, memory issue
Oct 31, 2021
Commenting to follow - at first I suspected this was involved with #942 , but I tested with that PR and saw no change in behavior compared to the current released version (1.1.4). @n0thing233 - Are you doing any large memory load from inside the predict function, or is it all in the model load?
Hi, MMS by default will print memory utilization into log which is great. The problem I have is after each request to MMS, the memory utilization increment a little bit. after several requests, the memory utilization went up to 100% and worker died.
I don't think this is the right behavior right?
I tried gc.collect() in _handle function but it doesn't work.(no gpu available in this machine)
I wonder if anyone can help me out here.
here is an example:
when just started the server, the log shows:
2021-10-31 18:22:25,881 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:5.1|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704545
After one request:
mms_1 | 2021-10-31 18:24:25,742 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:26.2|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704665
After second request:
mms_1 | 2021-10-31 18:26:25,601 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:39.7|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635704785
After third request:
mms_1 | 2021-10-31 18:30:25,323 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:58.5|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635705025
After 4th request:
mms_1 | 2021-10-31 18:32:25,187 [INFO ] pool-2-thread-1 MMS_METRICS - MemoryUtilization.Percent:81.6|#Level:Host|#hostname:cebbb237ccfc,timestamp:1635705145
After 5th request,OOM appears:
mms_1 | 2021-10-31 18:35:41,402 [INFO ] epollEventLoopGroup-4-7 com.amazonaws.ml.mms.wlm.WorkerThread - 9000-96795301 Worker disconnected. WORKER_MODEL_LOADED mms_1 | 2021-10-31 18:35:41,528 [DEBUG] W-9000-video_segmentation_v1 com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died. mms_1 | java.lang.InterruptedException mms_1 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) mms_1 | at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) mms_1 | at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) mms_1 | at com.amazonaws.ml.mms.wlm.WorkerThread.runWorker(WorkerThread.java:148) mms_1 | at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:211) mms_1 | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) mms_1 | at java.util.concurrent.FutureTask.run(FutureTask.java:266) mms_1 | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) mms_1 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) mms_1 | at java.lang.Thread.run(Thread.java:748)
The text was updated successfully, but these errors were encountered: