You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.
Describe the bug
When a job failed due to maybe OOM, pymars shows ambiguous error messages and incorrect progress bar.
To Reproduce
To help us reproducing this bug, please provide information below:
Your Python version 3.8.5
The version of Mars you use: latest
Versions of crucial packages, such as numpy, scipy and pandas: follow pymars
Full stack of the error.
Traceback (most recent call last):
File "/opt/mars/benchmarks/tpch/run_queries.py", line 1063, in<module>main()
File "/opt/mars/benchmarks/tpch/run_queries.py", line 1056, in main
run_queries(folder, use_arrow_dtype=use_arrow_dtype)
File "/opt/mars/benchmarks/tpch/run_queries.py", line 986, in run_queries
mars.execute([lineitem, orders, customer, nation, region, supplier, part, partsupp])
File "/opt/mars/mars/deploy/oscar/session.py", line 1890, in execute
return session.execute(
File "/opt/mars/mars/deploy/oscar/session.py", line 1684, in execute
execution_info: ExecutionInfo = fut.result(
File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 439, in result
returnself.__get_result()
File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/opt/mars/mars/deploy/oscar/session.py", line 1870, in _execute
await execution_info
File "/opt/mars/mars/deploy/oscar/session.py", line 105, inwaitreturn await self._aio_task
File "/opt/mars/mars/deploy/oscar/session.py", line 953, in _run_in_background
raise task_result.error.with_traceback(task_result.traceback)
File "/opt/mars/mars/services/task/supervisor/processor.py", line 369, in run
await self._process_stage_chunk_graph(*stage_args)
File "/opt/mars/mars/services/task/supervisor/processor.py", line 247, in _process_stage_chunk_graph
chunk_to_result = await self._executor.execute_subtask_graph(
File "/opt/mars/mars/services/task/execution/mars/executor.py", line 196, in execute_subtask_graph
return await stage_processor.run()
File "/opt/mars/mars/services/task/execution/mars/stage.py", line 240, in run
raise self.result.error.with_traceback(self.result.traceback)
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 392, in internal_run_subtask
subtask_info.result = await self._retry_run_subtask(
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 497, in _retry_run_subtask
return await _retry_run(subtask, subtask_info, _run_subtask_once)
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 91, in _retry_run
raise ex
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 69, in _retry_run
return await target_async_func(*args)
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 475, in _run_subtask_once
raise ex
File "/opt/mars/mars/services/scheduling/worker/execution.py", line 439, in _run_subtask_once
return await asyncio.shield(aiotask)
File "/opt/mars/mars/services/subtask/api.py", line 68, in run_subtask_in_slot
return await ref.run_subtask.options(profiling_context=profiling_context).send(
File "/opt/mars/mars/oscar/backends/context.py", line 195, in send
result = await self._wait(future, actor_ref.address, message)
File "/opt/mars/mars/oscar/backends/context.py", line 89, in _wait
return await future
File "/opt/mars/mars/oscar/backends/context.py", line 80, in _wait
await asyncio.shield(future)
File "/opt/mars/mars/oscar/backends/core.py", line 68, in _listen
raise ServerClosed(
mars.oscar.errors.ServerClosed: Remote server unixsocket:///917504 closed
Describe the bug
When a job failed due to maybe OOM, pymars shows ambiguous error messages and incorrect progress bar.
To Reproduce
To help us reproducing this bug, please provide information below:
Should run this tpch query in a env with less memory to trigger this error.
My Azure VM is 2 core and 8G memory. When running query 1 with tpch data 1G, this error happens.
The mars UI shows that 100% progress but failed.
The text was updated successfully, but these errors were encountered: