Merge branch 'main' into tgi_params

arcee-ai · Aug 20, 2024 · b85316a · b85316a
2 parents 7cd9226 + e5634ff
commit b85316a
Showing 5 changed files with 853 additions and 0 deletions.
diff --git a/arcee/config.py b/arcee/config.py
@@ -39,6 +39,7 @@ def write_configuration_value(key: str, value: str) -> None:
             except json.JSONDecodeError:
                 pass
     else:
+        conf_path.parent.mkdir(parents=True, exist_ok=True)
         conf_path.touch()
 
     config[key] = value

diff --git a/notebooks/model_alignment_dpo.ipynb b/notebooks/model_alignment_dpo.ipynb
@@ -0,0 +1,253 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Aligning a model on Arcee Cloud with Direct Preference Optimization (DPO)\n",
+    "\n",
+    "In this notebook, you will learn how to align a model with DPO on Arcee Cloud.\n",
+    "\n",
+    "In order to run this demo, you need a Starter account on Arcee Cloud. Please see our [pricing](https://www.arcee.ai/pricing) page for details.\n",
+    "\n",
+    "The Arcee documentation is available at [docs.arcee.ai](https://docs.arcee.ai/deployment/start-deployment)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Please [sign up](https://app.arcee.ai/account/signup) to Arcee Cloud and create an [API key](https://docs.arcee.ai/getting-arcee-api-key/getting-arcee-api-key).\n",
+    "\n",
+    "Then, please update the cell below with your API key. Remember to keep this key safe, and **DON'T COMMIT IT to one of your repositories**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%env ARCEE_API_KEY=YOUR_API_KEY"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a new Python environment (optional but recommended) and install [arcee-python](https://github.com/arcee-ai/arcee-python)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the next three lines to create a virtual environment\n",
+    "#!pip install -q virtualenv\n",
+    "#!virtualenv -q arcee-cloud\n",
+    "#!source arcee-cloud/bin/activate\n",
+    "\n",
+    "%pip install -qU arcee-py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import arcee\n",
+    "import pprint"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Aligning the model\n",
+    "\n",
+    "At the moment, the DPO dataset is not configurable. We use the [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) dataset. It consists of 64k prompts, 256k responses from differents LLMs and 380k high-quality feedback provided by GPT-4. \n",
+    "\n",
+    "Here, we will run DPO on the [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model we tuned for instruction following in the Supervised Fine-Tuning (SFT) notebook. You may remember that we used the [reasoning-share-gpt](https://huggingface.co/datasets/arcee-ai/reasoning-sharegpt) dataset.\n",
+    "\n",
+    "We could pick any model available on the Hugging Face hub, or a model we've already worked with on Arcee Cloud.\n",
+    "\n",
+    "Let's launch the alignment job with the `start_alignment()` API. It should last between 2 and 2.5 hours."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(arcee.start_alignment)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alignment_name = \"llama-3-8B-reasoning-share-gpt-dpo\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response=arcee.start_alignment(alignment_name=alignment_name,\n",
+    "                      #hf_model=\"meta-llama/Meta-Llama-3-8B\",\n",
+    "                      alignment_model=\"llama-3-8B-reasoning-share-gpt\",\n",
+    "                      alignment_type=\"dpo\",\n",
+    "                      full_or_peft=\"peft\"\n",
+    ")\n",
+    "print(response)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from time import sleep\n",
+    "\n",
+    "while True:\n",
+    "    response = arcee.alignment_status(alignment_name)\n",
+    "    if response[\"processing_state\"] == \"processing\":\n",
+    "        print(\"Alignment is in progress. Waiting 15 minutes before checking again.\")\n",
+    "        sleep(900)\n",
+    "    else:\n",
+    "        print(response)\n",
+    "        break"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploying our aligned model\n",
+    "\n",
+    "Once alignment is complete, we can deploy and test the aligned model. As part of the Arcee Cloud free tier, this is free of charge and the endpoint will be automatically shut down after 2 hours.\n",
+    "\n",
+    "Deployment should take 5-7 minutes. Please see the model deployment sample notebook for details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "deployment_name = alignment_name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = arcee.start_deployment(deployment_name=deployment_name, alignment=alignment_name)\n",
+    "\n",
+    "while True:\n",
+    "    response = arcee.deployment_status(deployment_name)\n",
+    "    if response[\"deployment_processing_state\"] == \"pending\":\n",
+    "        print(\"Deployment is in progress. Waiting 60 seconds before checking again.\")\n",
+    "        sleep(60)\n",
+    "    else:\n",
+    "        print(response)\n",
+    "        break"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the model endpoint is up and running, we can prompt the model with a domain-specific question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#query = \"Is Pluto a planet? Use markdown.\"\n",
+    "query = \"I was supposed to fly to NYC but my connecting flight was cancelled. I'm now stuck in Omaha, Nebraska and it's 8PM. I have a meeting in Manhattan tomorrow at 10AM. What is my best option? Use markdown.\"\n",
+    "\n",
+    "response = arcee.generate(deployment_name=deployment_name, query=query)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from IPython.display import display, Markdown\n",
+    "\n",
+    "display(Markdown(response[\"text\"]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Stopping our deployment\n",
+    "\n",
+    "When we're done working with our model, we should stop the deployment to save resources and avoid unwanted charges.\n",
+    "\n",
+    "The `stop_deployment()` API only requires the deployment name."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "arcee.stop_deployment(deployment_name=deployment_name)\n",
+    "arcee.deployment_status(deployment_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This concludes the model alignment demonstration. Thank you for your time!\n",
+    "\n",
+    "If you'd like to know more about using Arcee Cloud in your organization, please visit the [Arcee website](https://www.arcee.ai), or contact [sales@arcee.ai](mailto:sales@arcee.ai).\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/model_cli.ipynb b/notebooks/model_cli.ipynb
@@ -0,0 +1,217 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Downloading models with the Arcee Command Line Interface\n",
+    "\n",
+    "In this notebook, you will learn how to download model weights with the Arcee Command Line Interface (CLI).\n",
+    "\n",
+    "The Arcee documentation is available at [docs.arcee.ai](https://docs.arcee.ai/deployment/start-deployment)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Please [sign up](https://app.arcee.ai/account/signup) to Arcee Cloud and create an [API key](https://docs.arcee.ai/getting-arcee-api-key/getting-arcee-api-key).\n",
+    "\n",
+    "Remember to keep this key safe, and **DON'T COMMIT IT to one of your repositories**."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a new Python environment (optional but recommended) and install [arcee-python](https://github.com/arcee-ai/arcee-python)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the next three lines to create a virtual environment\n",
+    "#!pip install -q virtualenv\n",
+    "#!virtualenv -q arcee-cloud\n",
+    "#!source arcee-cloud/bin/activate\n",
+    "\n",
+    "%pip install -q arcee-py"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now use the `arcee` command-line interface (CLI) tool."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sh\n",
+    "arcee "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Storing our API key\n",
+    "\n",
+    "The first step is to configure the CLI and provide your API key."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "$ arcee configure\n",
+    "Current API URL: https://app.arcee.ai/api\n",
+    "API key: not in config (file or env)\n",
+    "\n",
+    "Enter your Arcee API key 🔒\n",
+    "Hit enter to leave it as is.\n",
+    "See https://docs.arcee.ai/getting-arcee-api-key/getting-arcee-api-key for more details.\n",
+    "You can also pass this at runtime with the ARCEE_API_KEY environment variable.\n",
+    ": [MY_API_KEY]\n",
+    "Setting API key\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The key is now stored locally in a configuration file named `config.json`. The default location is platform-dependent, and you print the path by running the cell below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import typer\n",
+    "import pprint\n",
+    "\n",
+    "pprint.pprint(typer.get_app_dir(\"arcee\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If this path doesn't work for you, you can move the configuration file you just created to another location and set its new location with the `ARCEE_CONFIG_LOCATION` environment variable, e.g.:\n",
+    "\n",
+    "```bash\n",
+    "mv \"/Users/julien/Library/Application Support/arcee\" ~\n",
+    "export ARCEE_CONFIG_LOCATION=/Users/julien/arcee\"\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once you've configured the CLI, you can quickly check that it's working by printing your default Arcee organization:\n",
+    "\n",
+    "```bash\n",
+    "$ arcee org\n",
+    "Current org: juliens-test-organization\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading model weights\n",
+    "\n",
+    "The CLI allows you to download model weight for models hosted in Arcee. We just need to pass the type of model (continuous pretrained, merged, or aligned) and the model name.\n",
+    "\n",
+    "```bash\n",
+    "$ arcee {cpt, merging, sft} download --name [MODEL_NAME]\n",
+    "```\n",
+    "\n",
+    "For example, we can download the weights of the model we aligned in the model alignment notebook:\n",
+    "\n",
+    "```bash\n",
+    "$ arcee sft download --name llama-3-8B-reasoning-share-gpt\n",
+    "Downloading alignment model weights for llama-3-8B-reasoning-share-gpt to /Users/julien/llama-3-8B-reasoning-share-gpt.tar.gz\n",
+    "Downloading llama-3-8B-reasoning-share-gpt weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:27:32 0.0/12.7 GB 0:00:14 7.7 MB/s\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the model weights have been downloaded, we can extract them locally. You can use `gzip` or `pigz` (faster option) for decompression\n",
+    "\n",
+    "```\n",
+    "$ mkdir my_llama3\n",
+    "$ pigz -dc pigz -dc llama-3-8B-reasoning-share-gpt.tar.gz | tar xvf - -C my_llama3\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Loading the model with the transformers library\n",
+    "\n",
+    "Finally, we can load the model with the Hugging Face transformers library.\n",
+    "\n",
+    "```python\n",
+    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
+    "\n",
+    "model_dir=\"llama3\"\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_dir)\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_dir)\n",
+    "```\n",
+    "```bash\n",
+    "Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:19<00:00,  5.00s/it]\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This concludes the CLI demonstration. Thank you for your time!\n",
+    "\n",
+    "If you'd like to know more about using Arcee Cloud in your organization, please visit the [Arcee website](https://www.arcee.ai), or contact [sales@arcee.ai](mailto:sales@arcee.ai).\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/model_pretraining.ipynb b/notebooks/model_pretraining.ipynb
@@ -0,0 +1,382 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pretraining a model on Arcee Cloud\n",
+    "\n",
+    "In this notebook, you will learn how to run continuous pretraining a model on Arcee Cloud. In this example, we'll train a Llama3-8B model on the Energy domain.\n",
+    "\n",
+    "In order to run this demo, you need a Starter account on Arcee Cloud. Please see our [pricing](https://www.arcee.ai/pricing) page for details.\n",
+    "\n",
+    "The Arcee documentation is available at [docs.arcee.ai](https://docs.arcee.ai/deployment/start-deployment)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "Please [sign up](https://app.arcee.ai/account/signup) to Arcee Cloud and create an [API key](https://docs.arcee.ai/getting-arcee-api-key/getting-arcee-api-key).\n",
+    "\n",
+    "Then, please update the cell below with your API key. Remember to keep this key safe, and **DON'T COMMIT IT to one of your repositories**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%env ARCEE_API_KEY=YOUR_API_KEY"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a new Python environment (optional but recommended) and install [arcee-python](https://github.com/arcee-ai/arcee-python)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment the next three lines to create a virtual environment\n",
+    "#!pip install -q virtualenv\n",
+    "#!virtualenv -q arcee-cloud\n",
+    "#!source arcee-cloud/bin/activate\n",
+    "\n",
+    "%pip install -q arcee-py"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import arcee\n",
+    "from IPython.display import Image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Preparing our dataset\n",
+    "\n",
+    "We need a dataset that holds the appropriate domain knowledge on the Energy domain. Arcee Cloud can ingest data in a variety of formats, like PDF, JSON, XML, TXT, HTML, and CSV. Please check the [documentation](https://docs.arcee.ai/continuous-pretraining/upload-pretraining-data) for an up-to-date list of supported formats.\n",
+    "\n",
+    "\n",
+    "We assembled a collection of about 300 PDF reports from the [International Energy Agency]((https://www.iea.org/analysis?type=report)) and the [Energy Reports](https://www.sciencedirect.com/journal/energy-reports) journal. The total size of the dataset is 1.5GB and 16 million tokens. Please note that this is probably too small for efficient pretraining. For real-life applications, we recommend using at least 100 million tokens.\n",
+    "\n",
+    "For convenience, we have stored the dataset in this Google drive [folder](https://drive.google.com/drive/folders/1DX5hIuVfykHqz2gwLTu4MR9R6TTAxiEO?usp=sharing). However, please note that Arcee Cloud requires training datasets to be stored in Amazon S3, so we also uploaded the dataset to a \"customer\" bucket defined below. You will be able to use this bucket to run the rest of this notebook, but you won't be able to list its content. In real-life, you would of course use your own S3 bucket."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset_bucket_name = \"juliensimon-datasets\"\n",
+    "dataset_name = \"energy-pdf\"\n",
+    "dataset_s3_uri=f\"s3://{dataset_bucket_name}/{dataset_name}\"\n",
+    "print(f\"Dataset S3 URI: {dataset_s3_uri}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The training code in Arcee Cloud runs in one of Arcee's AWS accounts. \n",
+    "\n",
+    "We need to allow this account to access the data stored in the bucket above (which is attached to a different AWS account). \n",
+    "\n",
+    "This setup is called \"cross-account access\" and it requires adding a policy to the bucket, allowing the Arcee account to read the data it stores. \n",
+    "\n",
+    "You'll find more information about cross-account access and bucket policies in the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html). \n",
+    "\n",
+    "If you're unfamiliar with the process, or don't have the AWS permissions required, please contact your AWS administrator."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here is the bucket policy applied to the \"customer\" bucket. \n",
+    "\n",
+    "It gives Arcee's AWS account `812782781539` read and list permission on the \"customer\" bucket. Working with your bucket, you would need to update the `Resource` section with your bucket and prefixes. Then, you would either apply this bucket policy to your bucket, using either the AWS console or one of the AWS SDKs.\n",
+    "    \n",
+    "    \n",
+    "    import boto3\n",
+    "    import json\n",
+    "\n",
+    "    bucket_policy = {\n",
+    "        \"Version\": \"2012-10-17\",\n",
+    "        \"Statement\": [\n",
+    "            {\n",
+    "                \"Effect\": \"Allow\",\n",
+    "                \"Principal\": {\n",
+    "                    \"AWS\": \"arn:aws:iam::812782781539:root\"\n",
+    "                },\n",
+    "                \"Action\": [\n",
+    "                    \"s3:GetBucketLocation\",\n",
+    "                    \"s3:ListBucket\",\n",
+    "                    \"s3:GetObject\",\n",
+    "                    \"s3:GetObjectAttributes\",\n",
+    "                    \"s3:GetObjectTagging\"\n",
+    "                ],\n",
+    "                \"Resource\": [\n",
+    "                    \"arn:aws:s3:::juliensimon-datasets\",\n",
+    "                    \"arn:aws:s3:::juliensimon-datasets/*\"\n",
+    "                ]\n",
+    "            },\n",
+    "        ]\n",
+    "    }\n",
+    "\n",
+    "    policy_string = json.dumps(bucket_policy)\n",
+    "\n",
+    "    boto3.client('s3').put_bucket_policy(Bucket=\"juliensimon-datasets\", Policy=policy_string)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Uploading our dataset\n",
+    "\n",
+    "Now that Arcee Cloud can read the training dataset, let's upload it with the `upload_corpus_folder()` API."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(arcee.upload_corpus_folder)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_name = \"meta-llama/Meta-Llama-3-8B\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = arcee.upload_corpus_folder(\n",
+    "    corpus=dataset_name,\n",
+    "    s3_folder_url=dataset_s3_uri,\n",
+    "    tokenizer_name=model_name,\n",
+    "    block_size=8192  # see max_position_embeddings in https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/config.json\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from time import sleep\n",
+    "\n",
+    "while True:\n",
+    "    response = arcee.corpus_status(dataset_name)\n",
+    "    if response[\"processing_state\"] == \"processing\":\n",
+    "        print(\"Upload is in progress. Waiting 30 seconds before checking again.\")\n",
+    "        sleep(30)\n",
+    "    else:\n",
+    "        print(response)\n",
+    "        break\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Pretraining our model\n",
+    "\n",
+    "Once the dataset has been uploaded, we can launch training with the `start_pretraining()` API."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "help(arcee.start_pretraining)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pretraining_name=f\"{model_name}-{dataset_name}\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = arcee.start_pretraining(\n",
+    "    pretraining_name=pretraining_name,\n",
+    "    corpus=dataset_name,\n",
+    "    base_model=model_name\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In the Arcee Cloud console, we can see the training job has started. After a few minutes, you should see the training loss decreasing, indicating that the model is learning how to correctly predict the tokens present in your dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "Image(\"model_pretraining_01.png\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deploying our trained model\n",
+    "\n",
+    "Once training is complete, we can deploy and test the pretrained model. The model hasn't been aligned, so chances are it's not going to generate anything really useful. However, we should still check that the model is able to generate properly.\n",
+    "\n",
+    "As part of the Arcee Cloud free tier, model deployment is free of charge and the endpoint will be automatically shut down after 2 hours.\n",
+    "\n",
+    "Deployment should take 5-7 minutes. Please see the model deployment sample notebook for details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "deployment_name = f\"{model_name}-{dataset_name}\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = arcee.start_deployment(deployment_name=deployment_name, pretraining=pretraining_name)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "while True:\n",
+    "    response = arcee.deployment_status(deployment_name)\n",
+    "    if response[\"deployment_processing_state\"] == \"pending\":\n",
+    "        print(\"Deployment is in progress. Waiting 60 seconds before checking again.\")\n",
+    "        sleep(60)\n",
+    "    else:\n",
+    "        print(response)\n",
+    "        break"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the model endpoint is up and running, we can prompt the model with a domain-specific question."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "query = \"Is solar a good way to achieve net zero?\"\n",
+    "\n",
+    "response = arcee.generate(deployment_name=deployment_name, query=query)\n",
+    "print(response[\"text\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Stopping our deployment\n",
+    "\n",
+    "When we're done working with our model, we should stop the deployment to save resources and avoid unwanted charges.\n",
+    "\n",
+    "The `stop_deployment()` API only requires the deployment name."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "arcee.stop_deployment(deployment_name=deployment_name)\n",
+    "arcee.deployment_status(deployment_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This concludes the model pretraining demonstration. Thank you for your time!\n",
+    "\n",
+    "If you'd like to know more about using Arcee Cloud in your organization, please visit the [Arcee website](https://www.arcee.ai), or contact [sales@arcee.ai](mailto:sales@arcee.ai).\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/notebooks/model_pretraining_01.png b/notebooks/model_pretraining_01.png