-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sdk tests with papermill #2448
base: master
Are you sure you want to change the base?
Sdk tests with papermill #2448
Conversation
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/rerun-all |
@yehudit1987 Can you please fix these CI errors? |
@yehudit1987 Can you sign your commits with |
FYI, you can check this reference: https://github.com/kubeflow/katib/pull/2448/checks?check_run_id=32215445282 |
Signed-off-by: Yehudit Kerido <[email protected]>
… trigger it Signed-off-by: Yehudit Kerido <[email protected]>
… trigger it 2 Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
963d367
to
6633aa5
Compare
/rerun-all |
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review @yehudit1987. I was very busy recently.
I left some comments for you. Thanks for your great contributions!
"from kubernetes import client, config\n", | ||
"\n", | ||
"# Initialize KatibClient\n", | ||
"kclient = KatibClient(namespace=namespace)\n", | ||
"\n", | ||
"# Load kubeconfig\n", | ||
"config.load_kube_config()\n", | ||
"\n", | ||
"# Kubernetes API for managing namespaces\n", | ||
"core_v1_api = client.CoreV1Api()\n", | ||
"\n", | ||
"# Function to add label to namespace if it doesn't have the required one\n", | ||
"def add_katib_label_to_namespace(namespace):\n", | ||
" ns = core_v1_api.read_namespace(namespace)\n", | ||
" labels = ns.metadata.labels or {}\n", | ||
" if labels.get(\"katib.kubeflow.org/metrics-collector-injection\") != \"enabled\":\n", | ||
" print(f\"Adding label to namespace {namespace}...\")\n", | ||
" labels[\"katib.kubeflow.org/metrics-collector-injection\"] = \"enabled\"\n", | ||
" body = {\"metadata\": {\"labels\": labels}}\n", | ||
" core_v1_api.patch_namespace(namespace, body)\n", | ||
" print(f\"Label added to namespace {namespace}.\")\n", | ||
" else:\n", | ||
" print(f\"Namespace {namespace} already has the required label.\")\n", | ||
"\n", | ||
"# Add the required label to the namespace\n", | ||
"add_katib_label_to_namespace(namespace)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to add these lines?
papermill-args-yaml: | ||
description: 'Additional arguments to pass to Papermill in yaml format' | ||
required: false | ||
default: "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we didn't pass parameters to these two notebook examples. Not sure if it meets our requirements
@@ -65,10 +65,12 @@ echo "Deploying Katib" | |||
cd ../../../../../ && WITH_DATABASE_TYPE=$WITH_DATABASE_TYPE make deploy && cd - | |||
|
|||
# Wait until all Katib pods is running. | |||
TIMEOUT=120s | |||
TIMEOUT=180s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change TIMEOUT
to 180s
?
kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod || | ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1) | ||
echo "Waiting for pods to be ready for $TIMEOUT seconds..." | ||
sleep $TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary since we already have kubectl wait
instruction? Please let me know your thought.
if ! kubectl get namespaces | grep -q "kubeflow-user-example-com"; then | ||
kubectl create namespace kubeflow-user-example-com | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use default
namespace instead of creating a new one:)
- name: Install dependencies | ||
shell: bash | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install papermill kubeflow-katib jupyter ipykernel | ||
python -m ipykernel install --user --name python3 --display-name "Python 3" | ||
|
||
- name: Setup Minikube Cluster | ||
shell: bash | ||
run: ./test/e2e/v1beta1/scripts/gh-actions/setup-minikube.sh true true "" "" "cmaes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to create minikube cluster between these two steps?
katib/.github/workflows/template-setup-e2e-test/action.yaml
Lines 39 to 48 in 2b41ae6
- name: Setup Minikube Cluster | |
uses: medyagh/[email protected] | |
with: | |
network-plugin: cni | |
cni: flannel | |
driver: none | |
kubernetes-version: ${{ inputs.kubernetes-version }} | |
minikube-version: 1.31.1 | |
start-args: --wait-timeout=120s | |
function check_minikube() { | ||
if minikube status >/dev/null 2>&1; then | ||
echo "Minikube is already running." | ||
else | ||
echo "Minikube is not running. Starting Minikube..." | ||
minikube start | ||
fi | ||
} | ||
|
||
echo "Checking Minikube Kubernetes Cluster" | ||
check_minikube |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, we do not check the status of minikube cluster here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be better if you do not delete the original content.
And I think it's necessary to rewrite this example since TFJobClient()
is outdated in the newest SDK in training-operator. WDYT👀 @kubeflow/wg-automl-leads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't apply to many changes to that notebook beside as you mentioned replacing the TFJobClient with TrainingClient and specifying the job type. For some reason it looks like I rewrite the whole example. Anyway I fixed the "corrupted" notebook. Also I fixed the other issues you pointed at.
Regarding default namespace, it failed to create experiments without specifying the namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yehudit1987!
I know some code lines were generated by pycharm or vscode. The "original content" I meant is some output and images in the notebook, not those code lines.
As for the namespace, could we specify default
namespace for the experiment like namespace=default
? And it will be better if we could specify namespace for papermill like: kubeflow/training-operator#2274 .
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of effort @yehudit1987 ! Thanks for your contribution.
I left some comments for you. cc👀 @kubeflow/wg-automl-leads
@@ -671,4 +671,4 @@ | |||
}, | |||
"nbformat": 4, | |||
"nbformat_minor": 4 | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod || | ||
(kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1) | ||
kubectl wait --for=condition=ContainersReady=True --timeout=${TIMEOUT} -l "katib.kubeflow.org/component in ($WITH_DATABASE_TYPE,controller,db-manager,ui)" -n kubeflow pod || (kubectl get pods -n kubeflow && kubectl describe pods -n kubeflow && exit 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if we could adjust the format of this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Recover its original state)
"metadata": { | ||
"pycharm": { | ||
"name": "#%%\n" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Experiment name and namespace.\n", | ||
"namespace = \"kubeflow-user-example-com\"\n", | ||
"namespace = \"kubeflow\"\n", | ||
"experiment_name = \"cmaes-example\"\n", | ||
"\n", | ||
"metadata = V1ObjectMeta(\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add parameters
tag in metadata
and allow args in papermill rewrite them like: kubeflow/training-operator#2274?
@@ -314,7 +342,8 @@ | |||
"\n", | |||
"# Start the Katib Experiment.\n", | |||
"exp_name = \"tune-mnist\"\n", | |||
"katib_client = katib.KatibClient()\n", | |||
"namespace=\"kubeflow\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above
"import time\n", | ||
"time.sleep(120)\n", | ||
"status = katib_client.is_experiment_succeeded(exp_name, namespace=namespace)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we replace fixed-time sleep with wait_for_experiment_condition()
?
katib/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py
Lines 1002 to 1010 in 2b41ae6
def wait_for_experiment_condition( | |
self, | |
name: str, | |
namespace: Optional[str] = None, | |
expected_condition: str = constants.EXPERIMENT_CONDITION_SUCCEEDED, | |
timeout: int = 600, | |
polling_interval: int = 15, | |
apiserver_timeout: int = constants.DEFAULT_TIMEOUT, | |
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can use this API here.
/rerun-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this effort and updating broken Notebooks in Katib @yehudit1987!
Let's finalize this PR once we design the testing script in the Training Operator.
- name: Run Jupyter Notebook with Papermill | ||
shell: bash | ||
run: | | ||
IFS=',' read -r -a NOTEBOOK_ARRAY <<< "${{ inputs.notebook-input }}" | ||
# Loop through each notebook path | ||
for NOTEBOOK in "${NOTEBOOK_ARRAY[@]}"; do | ||
OUTPUT_FILE="${NOTEBOOK%.ipynb}_output.ipynb" | ||
echo "Running notebook: $NOTEBOOK" | ||
papermill "$NOTEBOOK" "$OUTPUT_FILE" --log-output --kernel python3 || { | ||
echo "Papermill failed for notebook: $NOTEBOOK" | ||
exit 1 | ||
} | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed with @saileshd1402 in the Training Operator PR: kubeflow/training-operator#2274 (comment), we might want to create script to run those Notebooks with papermill rather than adding the script in the GitHub action directly.
I think, once we finalize it, we can use the same approach for Katib tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @andreyvelich for now I fix all the previous comments.
One of the notebooks is again seems to be rewritten (even though using jupyter lab) as you suggest.
Anyway I will fix those notebooks together with the decision about using the script or not.
Please keep me update on that.
"import time\n", | ||
"time.sleep(120)\n", | ||
"status = katib_client.is_experiment_succeeded(exp_name, namespace=namespace)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can use this API here.
"pycharm": { | ||
"name": "#%% md\n" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to edit these Notebooks using JupyterLab directly.
In that case, the JSON format will be correctly rendered for every IDE.
E.g. you can just run JupyterLab locally to edit them:
pip install jupyterlab
jupyter lab
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
Signed-off-by: Yehudit Kerido <[email protected]>
/rerun-all |
What this PR does / why we need it:
This PR creates E2E tests for katib examples to run with papermill.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #2417
Checklist: