Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

[BUG] Pymars shows ambiguous error messages and incorrect progress bar #10

Open
ChengjieLi28 opened this issue Aug 26, 2022 · 0 comments

Comments

@ChengjieLi28
Copy link

Describe the bug
When a job failed due to maybe OOM, pymars shows ambiguous error messages and incorrect progress bar.

To Reproduce
To help us reproducing this bug, please provide information below:

  1. Your Python version 3.8.5
  2. The version of Mars you use: latest
  3. Versions of crucial packages, such as numpy, scipy and pandas: follow pymars
  4. Full stack of the error.
Traceback (most recent call last):
  File "/opt/mars/benchmarks/tpch/run_queries.py", line 1063, in <module>
    main()
  File "/opt/mars/benchmarks/tpch/run_queries.py", line 1056, in main
    run_queries(folder, use_arrow_dtype=use_arrow_dtype)
  File "/opt/mars/benchmarks/tpch/run_queries.py", line 986, in run_queries
    mars.execute([lineitem, orders, customer, nation, region, supplier, part, partsupp])
  File "/opt/mars/mars/deploy/oscar/session.py", line 1890, in execute
    return session.execute(
  File "/opt/mars/mars/deploy/oscar/session.py", line 1684, in execute
    execution_info: ExecutionInfo = fut.result(
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/opt/mars/mars/deploy/oscar/session.py", line 1870, in _execute
    await execution_info
  File "/opt/mars/mars/deploy/oscar/session.py", line 105, in wait
    return await self._aio_task
  File "/opt/mars/mars/deploy/oscar/session.py", line 953, in _run_in_background
    raise task_result.error.with_traceback(task_result.traceback)
  File "/opt/mars/mars/services/task/supervisor/processor.py", line 369, in run
    await self._process_stage_chunk_graph(*stage_args)
  File "/opt/mars/mars/services/task/supervisor/processor.py", line 247, in _process_stage_chunk_graph
    chunk_to_result = await self._executor.execute_subtask_graph(
  File "/opt/mars/mars/services/task/execution/mars/executor.py", line 196, in execute_subtask_graph
    return await stage_processor.run()
  File "/opt/mars/mars/services/task/execution/mars/stage.py", line 240, in run
    raise self.result.error.with_traceback(self.result.traceback)
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 392, in internal_run_subtask
    subtask_info.result = await self._retry_run_subtask(
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 497, in _retry_run_subtask
    return await _retry_run(subtask, subtask_info, _run_subtask_once)
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 91, in _retry_run
    raise ex
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 69, in _retry_run
    return await target_async_func(*args)
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 475, in _run_subtask_once
    raise ex
  File "/opt/mars/mars/services/scheduling/worker/execution.py", line 439, in _run_subtask_once
    return await asyncio.shield(aiotask)
  File "/opt/mars/mars/services/subtask/api.py", line 68, in run_subtask_in_slot
    return await ref.run_subtask.options(profiling_context=profiling_context).send(
  File "/opt/mars/mars/oscar/backends/context.py", line 195, in send
    result = await self._wait(future, actor_ref.address, message)
  File "/opt/mars/mars/oscar/backends/context.py", line 89, in _wait
    return await future
  File "/opt/mars/mars/oscar/backends/context.py", line 80, in _wait
    await asyncio.shield(future)
  File "/opt/mars/mars/oscar/backends/core.py", line 68, in _listen
    raise ServerClosed(
mars.oscar.errors.ServerClosed: Remote server unixsocket:///917504 closed
  1. Minimized code to reproduce the error.
 python mars/benchmarks/tpch/run_queries.py --folder /opt/data-1G/ --query 1 --endpoint http://10.0.0.4:8001

Should run this tpch query in a env with less memory to trigger this error.

My Azure VM is 2 core and 8G memory. When running query 1 with tpch data 1G, this error happens.
The mars UI shows that 100% progress but failed.
image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant