-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC] Add e2e test for tune
api with LLM hyperparameter optimization
#2420
base: master
Are you sure you want to change the base?
Changes from all commits
6be7f29
1a1f119
8461a49
c860238
216ebd9
f6b96f5
c636493
8180422
6101489
d67a1b8
295abb6
e0a1b6d
1df7df9
d1e1311
0cc319f
0383932
08c8634
7a98a00
8862d79
e4f614d
0385eea
e0c5170
0286f70
f6e5ed5
7ea7e43
25d99b1
fcd64fa
122c611
c1fde09
8ff6864
da3c298
a1bff26
bbae57b
e45ceac
4ae11ed
bedab36
7bfb3cc
efffdc2
2a18b17
c6c964b
28ffb96
dc684e3
a12034c
b088815
e468b27
64d8fef
45db42e
c6e91cd
b1a2390
e5bf840
fca94ae
b5cae0d
a785d35
865379e
d1ea629
5e2e44f
982e268
55c404d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
name: Free-Up Disk Space | ||
description: Remove Non-Essential Tools And Move Docker Data Directory to /mnt/docker | ||
|
||
runs: | ||
using: composite | ||
steps: | ||
# This step is a Workaround to avoid the "No space left on device" error. | ||
# ref: https://github.com/actions/runner-images/issues/2840 | ||
- name: Remove unnecessary files | ||
shell: bash | ||
run: | | ||
echo "Disk usage before cleanup:" | ||
df -hT | ||
|
||
sudo rm -rf /usr/share/dotnet | ||
sudo rm -rf /opt/ghc | ||
sudo rm -rf /usr/local/share/boost | ||
sudo rm -rf "$AGENT_TOOLSDIRECTORY" | ||
sudo rm -rf /usr/local/lib/android | ||
sudo rm -rf /usr/local/share/powershell | ||
sudo rm -rf /usr/share/swift | ||
|
||
echo "Disk usage after cleanup:" | ||
df -hT | ||
|
||
- name: Prune docker images | ||
shell: bash | ||
run: | | ||
docker image prune -a -f | ||
docker system df | ||
df -hT | ||
|
||
- name: Move docker data directory | ||
shell: bash | ||
run: | | ||
echo "Stopping docker service ..." | ||
sudo systemctl stop docker | ||
DOCKER_DEFAULT_ROOT_DIR=/var/lib/docker | ||
DOCKER_ROOT_DIR=/mnt/docker | ||
echo "Moving ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}" | ||
sudo mv ${DOCKER_DEFAULT_ROOT_DIR} ${DOCKER_ROOT_DIR} | ||
echo "Creating symlink ${DOCKER_DEFAULT_ROOT_DIR} -> ${DOCKER_ROOT_DIR}" | ||
sudo ln -s ${DOCKER_ROOT_DIR} ${DOCKER_DEFAULT_ROOT_DIR} | ||
echo "$(sudo ls -l ${DOCKER_DEFAULT_ROOT_DIR})" | ||
echo "Starting docker service ..." | ||
sudo systemctl daemon-reload | ||
sudo systemctl start docker | ||
echo "Docker service status:" | ||
sudo systemctl --no-pager -l -o short status docker |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,16 @@ | ||
import argparse | ||
import logging | ||
|
||
import kubeflow.katib as katib | ||
import transformers | ||
from kubeflow.katib import KatibClient, search | ||
from kubeflow.storage_initializer.hugging_face import ( | ||
HuggingFaceDatasetParams, | ||
HuggingFaceModelParams, | ||
HuggingFaceTrainerParams, | ||
) | ||
from kubernetes import client | ||
from peft import LoraConfig | ||
from verify import verify_experiment_results | ||
|
||
# Experiment timeout is 40 min. | ||
|
@@ -11,8 +19,8 @@ | |
# The default logging config. | ||
logging.basicConfig(level=logging.INFO) | ||
|
||
|
||
def run_e2e_experiment_create_by_tune( | ||
# Test for Experiment created with custom objective function. | ||
def run_e2e_experiment_create_by_tune_with_custom_objective( | ||
katib_client: KatibClient, | ||
exp_name: str, | ||
exp_namespace: str, | ||
|
@@ -57,6 +65,75 @@ def objective(parameters): | |
logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) | ||
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace)) | ||
|
||
# Test for Experiment created with external models and datasets. | ||
def run_e2e_experiment_create_by_tune_with_llm_optimization( | ||
katib_client: KatibClient, | ||
exp_name: str, | ||
exp_namespace: str, | ||
): | ||
# Create Katib Experiment and wait until it is finished. | ||
logging.debug("Creating Experiment: {}/{}".format(exp_namespace, exp_name)) | ||
|
||
# Use the test case from fine-tuning API tutorial. | ||
# https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we link an updated guide for Katib LLM Optimization ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the Katib LLM Optimization guide is still under review, should I link to the file in its current state for now? Additionally, the example in the Katib LLM Optimization guide uses a different model and dataset compared to this one. The guide uses the LLaMa model, which requires access tokens. I’ve already applied for the access token and am awaiting approval. Once I receive it, I will test the example to see if it works. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried running the above example, but I ran into some unexpected errors in the If we aim to include this in Katib 0.18-rc.0 this week, we might need to stick with the current example. Otherwise, I’ll work on fixing it before RC.1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think, it is fine to include it in RC.1 since it is a bug fix. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can keep URL for Kubeflow Training docs for now. |
||
# Create Katib Experiment. | ||
# And Wait until Experiment reaches Succeeded condition. | ||
katib_client.tune( | ||
name=exp_name, | ||
namespace=exp_namespace, | ||
# BERT model URI and type of Transformer to train it. | ||
model_provider_parameters=HuggingFaceModelParams( | ||
model_uri="hf://google-bert/bert-base-cased", | ||
transformer_type=transformers.AutoModelForSequenceClassification, | ||
num_labels=5, | ||
), | ||
# In order to save test time, use 8 samples from Yelp dataset. | ||
dataset_provider_parameters=HuggingFaceDatasetParams( | ||
repo_id="yelp_review_full", | ||
split="train[:8]", | ||
), | ||
# Specify HuggingFace Trainer parameters. | ||
trainer_parameters=HuggingFaceTrainerParams( | ||
training_parameters=transformers.TrainingArguments( | ||
output_dir="test_tune_api", | ||
save_strategy="no", | ||
learning_rate = search.double(min=1e-05, max=5e-05), | ||
num_train_epochs=1, | ||
), | ||
# Set LoRA config to reduce number of trainable model parameters. | ||
lora_config=LoraConfig( | ||
r = search.int(min=8, max=32), | ||
lora_alpha=8, | ||
lora_dropout=0.1, | ||
bias="none", | ||
), | ||
), | ||
objective_metric_name = "train_loss", | ||
objective_type = "minimize", | ||
algorithm_name = "random", | ||
max_trial_count = 1, | ||
parallel_trial_count = 1, | ||
resources_per_trial=katib.TrainerResources( | ||
num_workers=1, | ||
num_procs_per_worker=1, | ||
resources_per_worker={"cpu": "2", "memory": "10G",}, | ||
), | ||
storage_config={ | ||
"size": "10Gi", | ||
"access_modes": ["ReadWriteOnce"], | ||
}, | ||
retain_trials=True, | ||
) | ||
experiment = katib_client.wait_for_experiment_condition( | ||
exp_name, exp_namespace, timeout=EXPERIMENT_TIMEOUT | ||
) | ||
|
||
# Verify the Experiment results. | ||
verify_experiment_results(katib_client, experiment, exp_name, exp_namespace) | ||
|
||
# Print the Experiment and Suggestion. | ||
logging.debug(katib_client.get_experiment(exp_name, exp_namespace)) | ||
logging.debug(katib_client.get_suggestion(exp_name, exp_namespace)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest we using a prettifier to format the result of the test success of failure here, for example using pprint. WDYT? ![]() |
||
|
||
if __name__ == "__main__": | ||
parser = argparse.ArgumentParser() | ||
|
@@ -79,18 +156,33 @@ def objective(parameters): | |
client.CoreV1Api().patch_namespace(args.namespace, {'metadata': {'labels': namespace_labels}}) | ||
|
||
# Test with run_e2e_experiment_create_by_tune | ||
exp_name = "tune-example" | ||
exp_name_custom_objective = "tune-example-1" | ||
exp_name_llm_optimization = "tune-example-2" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest a more meaningful name for the test, while I was looking at the result of the tests, it was not easy for me to find out what are the difference between tune-example-1 and 2. how about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
exp_namespace = args.namespace | ||
try: | ||
run_e2e_experiment_create_by_tune(katib_client, exp_name, exp_namespace) | ||
run_e2e_experiment_create_by_tune_with_custom_objective(katib_client, exp_name_custom_objective, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
helenxie-bit marked this conversation as resolved.
Show resolved
Hide resolved
|
||
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name_custom_objective}") | ||
except Exception as e: | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name_custom_objective}") | ||
raise e | ||
finally: | ||
# Delete the Experiment. | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
katib_client.delete_experiment(exp_name_custom_objective, exp_namespace) | ||
|
||
try: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest we using a simpler iterate over a data structure like the unit-tests, for example: test_tune_data = [
(
"tune_with_custom_objective",
run_e2e_experiment_create_by_tune_with_custom_objective,
),
] WDYT? |
||
run_e2e_experiment_create_by_tune_with_llm_optimization(katib_client, exp_name_llm_optimization, exp_namespace) | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
logging.info(f"E2E is succeeded for Experiment created by tune: {exp_namespace}/{exp_name_llm_optimization}") | ||
except Exception as e: | ||
logging.info("---------------------------------------------------------------") | ||
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name}") | ||
logging.info(f"E2E is failed for Experiment created by tune: {exp_namespace}/{exp_name_llm_optimization}") | ||
raise e | ||
finally: | ||
# Delete the Experiment. | ||
logging.info("---------------------------------------------------------------") | ||
logging.info("---------------------------------------------------------------") | ||
katib_client.delete_experiment(exp_name, exp_namespace) | ||
katib_client.delete_experiment(exp_name_llm_optimization, exp_namespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest importing each e2e specific requirements inside its function, for example:
in this way, the scope of each test is more determined - WDYT?