-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Advanced DAG Workflow. #4319
Draft
cblmemo
wants to merge
12
commits into
master
Choose a base branch
from
advanced-dag
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+1,470
−322
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…t Flow Definition (#4067) * provide an example, edited from pipeline.yml * more focus on dependencies for user dag lib * more powerful user interface * load and dump new yaml format * fix * fix: reversed logic in add_edge * [docs] Unroll k8s internal load balancer docs (#4083) unroll load balancer docs * rename * refactor due to reviewer's comments * generate task.name if not given * [docs] `sky status --kubernetes` docs (#4064) * observability docs * comments * [UX] Show log after failure and fix the color issue with narrow window (#4084) * fix narrow window and show log path during exception * format * format * [k8s] `sky status --k8s` refactor (#4079) * refactor * lint * refactor, dataclass * refactor, dataclass * refactor * lint * add comments for add_edge * add `print_exception_no_traceback` when raise * make `Dag.tasks` a property * print dependencies for `__repr__` * move `get_unique_task_name` to common_utils * [Performance] Use new GCP custom images (#4027) * [Performance] Use new custom image to create GCP GPU VMs * update image tags for both CPU and GPU * always generate .sky/python_path --------- Co-authored-by: Yika Luo <[email protected]> * [GCP] Add H100 mega (#4099) * Add H100 mega support on GCP * fix for some other regions * format * fix resource type * fix catalog fetching * [GCP] Add gVNIC support (#4095) * add gvnic support through config.yaml * lint * docs * [Lambda] Lambda Cloud SkyPilot provisioner (#3865) * feat: lambda cloud new provisioner * feat: address cblmemo reviews and other reviews + make multi-node work again * fix: quotes * fix: address some reviews * chore: rm unused option * chore: update typedef * feat: use lists directly * fix: formatting * chore: address reviews * fix: formatting * chore: rm query ports since default impl per review * feat: add back query ports * fix: formatting * chore: add newline at eof * feat: try removing query ports again * [Docs] GKE Nvidia Driver installation instructions update (#4106) * docs * docs * docs * [Performance] Use new AWS custom images (#4091) * rename methods to use downstream/edge terminology * [Performance] Add Packer image generation scripts for GCP and AWS (#4068) * [Performance] Add Packer image generation scripts for GCP and AWS * Add docker install and tests * solve nvidia container issue * Install cuDNN * [Performance] Scripts to copy/delete AWS images for all regions and add cloud deps (#4073) * [Performance] Add AWS script to copy images for all regions * script to delete all AWS images across regions * Add cloud dependencies to image --------- Co-authored-by: Yika Luo <[email protected]> * Disable AWS images.csv refreshing (#4116) * [Docs] .skyignore doc (#4114) * [Docs] .skyignore doc * Correct typos Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * [Core] Raise error for none existing cluster when endpoint is called (#4117) raise error for none existing cluster * Refresh local aws images.csv when image not found (#4127) Refresh local aws images.csv by pulling from github catalog when image tag not found * [Docs] News revamps. (#4126) * News revamps. updates updates updates updates updates updates updates updates * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Serve] Support manually terminating a replica and with purge option (#4032) * define replica id param in cli * create endpoint on controller * call controller endpoint to scale down replica * add classmethod decorator * add handler methods for readability in cli * update docstr and error msg, and inline in cli * update log and return err msg * add docstr, catch and reraise err, add stopped and nonexistent message * inline constant to avoid circular import * fix error statement and return encoded str * add purge feature * add purge replica usage in docstr * use .get to handle unexpected packages * fix: diff terminate replica when failed/purging or not * fix: stay up to date for `is_controller_accessible` * revert * up to date with current APIs * error handling * when purged remove record in the main loop * refactor due to reviewer's suggestions * combine functions * fix: terminate the healthy replica even with purge option * remove abbr * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/serve/controller.py Co-authored-by: Tian Xia <[email protected]> * Update sky/cli.py Co-authored-by: Tian Xia <[email protected]> * got services hint * check if not yes in the outside if branch * fix some output messages * Update sky/serve/core.py Co-authored-by: Tian Xia <[email protected]> * set conflict status code for already scheduled termination * combine purge and normal terminating down branch together * bump version * global exception handler to render a json response with error messages * fix: use responses.JSONResponse for dict serialize * error messages for old controller * fix: check version mismatch in generated code * revert mistakenly change update_service * refine already in terminating message * fix: branch code workaround in cls.build * wording Co-authored-by: Tian Xia <[email protected]> * refactor due to reviewer's comments * fix use ux_utils Co-authored-by: Tian Xia <[email protected]> * add changelog as comments * fix messages * edit the message for mismatch error Co-authored-by: Tian Xia <[email protected]> * no traceback when raising in `terminate_replica` * messages decode * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * format * forma * Empty commit --------- Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> * [Provisioner] Support docker in Lambda Cloud and TPU (#4115) * [Provisioner] Support docker in Lambda Cloud * fix permission issue * merge with check docker installed * add tpu support & test * patch lambda cloud * add comment * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * change wording all to up/downstream style * Add unique suffix to task names, fallback to timestamp if unnamed * Unify handling of single and multiple tasks without dependencies * Refactor tasks initialization: use list comprehension and fail fast * Fix remove task dependency description: upstream, not downstream Co-authored-by: Tian Xia <[email protected]> * Remove duplicated `self.edges`, use nx api instead * [Serve] Add `ux_utils.print_exception_no_traceback()` for cleaner error output (#4111) * add `ux_utils.print_exception_no_traceback()` for cleaner error output * Empty commit * remove unnecessary with block * Partially revert: Remove unnecessary `ux_utils.print_exception_no_traceback()` wrappers (#4130) fix unnecessary with block for returning * Revert "Add unique suffix to task names, fallback to timestamp if unnamed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352. * comment the checking used as upstream logic * [examples] Deepspeed fixes + k8s support (#4124) deepspeed kubernetes fixes * Empty commit * [OCI] Support more OS types in addition to ubuntu (#4080) * Bug fix for sky config file path resolution. * format * [OCI] Bug fix for image_id in Task YAML * [OCI]: Support more OS types (esp. oraclelinux) in addition to ubuntu. * format * Disable system firewall * Bug fix for validation of the Marketplace images * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * variable/function naming * address review comments: not to change the service_catalog api. call oci_catalog directly for get os type for a image. * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * Update sky/clouds/oci.py Co-authored-by: Zhanghao Wu <[email protected]> * address review comments --------- Co-authored-by: Zhanghao Wu <[email protected]> * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * fix: typing.cast * add TODOs for future function migration * remove dependencies wording to reduce ambiguity * temporarily add github actions --------- Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: yika-luo <[email protected]> Co-authored-by: Yika Luo <[email protected]> Co-authored-by: Kote Mushegiani <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: David Tran <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Hysun He <[email protected]>
…ck (#4186) * feat: check is dag * fix: also ensure it's connected * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * add a comment for `nx.is_weakly_connected` --------- Co-authored-by: Tian Xia <[email protected]>
* feat: support `with_data` for `dag` * feat: implement `from_yaml` and `to_yaml` for new data api * feat(dag): add TaskData class with source/target paths for data transfer between tasks
* feat: visualize with/without jupyter * style: import the module instead Co-authored-by: Tian Xia <[email protected]> * feat(dag): add format option to DAG visualization --------- Co-authored-by: Tian Xia <[email protected]>
* provide an example, edited from pipeline.yml * more focus on dependencies for user dag lib * more powerful user interface * load and dump new yaml format * fix * fix: reversed logic in add_edge * rename * refactor due to reviewer's comments * generate task.name if not given * add comments for add_edge * add `print_exception_no_traceback` when raise * make `Dag.tasks` a property * print dependencies for `__repr__` * move `get_unique_task_name` to common_utils * rename methods to use downstream/edge terminology * Add dependencies feature for task dependency management (#4067) provide an example, edited from pipeline.yml more focus on dependencies for user dag lib more powerful user interface load and dump new yaml format fix fix: reversed logic in add_edge rename refactor due to reviewer's comments generate task.name if not given add comments for add_edge add `print_exception_no_traceback` when raise make `Dag.tasks` a property print dependencies for `__repr__` move `get_unique_task_name` to common_utils rename methods to use downstream/edge terminology * fix(jobs): type errors * refactor: `_update_failed_task_state` for unified error handling * refactor: separate finally block for a meaningful name * feat: simple parallel execution support * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * change wording all to up/downstream style * Add unique suffix to task names, fallback to timestamp if unnamed * Unify handling of single and multiple tasks without dependencies * Refactor tasks initialization: use list comprehension and fail fast * Fix remove task dependency description: upstream, not downstream Co-authored-by: Tian Xia <[email protected]> * Remove duplicated `self.edges`, use nx api instead * Revert "Add unique suffix to task names, fallback to timestamp if unnamed" Otherwise, users can not refer to the task by name in the DAG. This reverts commit 8486352. * comment the checking used as upstream logic * remove is_chain restriction in jobs launch * Change Static layered parallelism to dynamic queue parallelism * fix: add blocked_tasks set * refactor and update canceled tasks in database * format: some types are unsubscriptable * add some logging * Fooled again by the silly finally block, mistakenly thought it only runs on errors! * feat: redirect logging for each thread to a separate file to prevent interleaving output * fix due to reviwer's suggestions and some nits * chore: remove some debugging info * partially revert "update canceled tasks in database", view a task as a whole * add some comments to inform the future of thread-level redirector * cancell all tasks when a task failed * combine 3 sets to 1 dict * add some comments to illustrate `_try_add_successors_to_queue` only queue each node once * Cancel all non-running tasks when cancelling a job. Left a TODO for future cancellation policy discussion. * make those logging files and dir * refactor: stream_logs_by_id * refactor and format * provide a cli to print run.log of a certain subtask * format * add some comments and checks * clearer log * hint users to see the log * add some comments explaining why skip convetional code path * add with exception_no_traceback * fix os.makedirs, add expandusr first * fix: diamond named diamond * fix: use managed_job_id as dirname instead of runtimestamp * fix: make strategy_executor local to prevent race conditions * feat: implement stream logs for a task command * fix: forgot to add parentheses around the addition * chore: log indent * fix: early checking and better comments * refactor: reuse `stream_logs_by_id` for 2nd part run log tailing * fix: deal with tasks already finished * refactor: a generalized `follow_logs` and implement output checking on its basis * format * add return type annotations Co-authored-by: Tian Xia <[email protected]> * add all annotations and rename `task_queue` to `ready_queue` * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * revert one-liner conditional for better readibility * use tuple unpacking instead of `[1]` Co-authored-by: Tian Xia <[email protected]> * Apply suggestions from code review Co-authored-by: Tian Xia <[email protected]> * chore: fix more type annotations * refactor: clearer responsibility split between `_ThreadAwareOutput` and `RedirectOutputForThread` * feat: ensures `open` returns a `TextIO` * fix: unbounded `is_dag_chain` * fix: typing * revert type subscription, stuck by pylint * fix: use default false `cancel_all` argument to avoid breaking existing calls * fix: add explicit write/flush methods to _ThreadAwareOutput for logging redirection --------- Co-authored-by: Tian Xia <[email protected]>
fix: corresponding example for new data API
* feat: global optimization * style: add some comments
…zation (#4320) * Add function _add_dummy_storage_nodes * Set outputs for upstream tasks and buckets * Remove unused edges * Reformat and add todo * Delete useless logs * Remove tmp figures * Add pylint: disable=pointless-statement * Use with_data instead of set_outputs to trigger egress plan * Remove set_outputs in _add_dummy_storage_nodes * Replace dummy with storage nodes * Remove useless logs * Reformat * Correct get_edge_data * Add data=True when unpacking edges * Remove storage nodes after optimization * Remove storage nodes after optimization * Change nbytes into n_gigabytes * Reformat * Add storage mount for data transfer * Run with data * Add user hash to bucket name * Remove logging * Remove underline in bucket name * Add uuid to bucket name * Delete plot * Apply make_cluster_name to bucket name * [DAG] Run global optimization on controller for task placement (#4364) * feat: global optimization * style: add some comments * Use global optimization * Use from_cloud for storage type and only allow four types of storage * Pass str instead of cloud to from_cloud * Remove special case in egress_plan * Remove mkdir for infer_A * Remove plot * Add empty line * Remove node parameter for egress cost * Add sleep for data processing after echoing * Add comments for _add_storage_nodes_for_data_transfer * Remove getattr * Check enabled cloud before optimization * Remove r2 in enabled clouds * Import package instead of function or class * Add docstr for set_storage_mounts_for_data_transfer * Add todo for checking the region of storage is none * Polish up the doc string * Reformat * Use TaskEdge as input type * Add todo * Use the value of 'edge' in data * Fix import --------- Co-authored-by: Andy Lee <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A PR to stash our progress for advanced. This is in experimental stage and we need to discuss more on the API & UX issue.
Credit to @andylizf for the amazing contributions!
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh