Add examples to walkthrough notebook for how to predict on new data

WarmCyan · WarmCyan · commit 13232332edf2 · 2025-01-02T16:06:00.000-05:00
diff --git a/notebooks/usage_walkthrough.ipynb b/notebooks/usage_walkthrough.ipynb
@@ -710,7 +710,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "102ddd7d-9421-4043-8b6c-5c98abf0a5c6",
+   "id": "96972e35-62fe-49bf-a825-357e3addbe54",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -727,6 +727,124 @@
    "source": [
     "It's definitely not stellar, but this is a solid starting point given that we only gave it 12 labeled points and 5 features, involving a total of only 15 keywords."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "be982bb2-a74d-4696-a5dc-9834017bcbe1",
+   "metadata": {},
+   "source": [
+    "While one of the main usecases for this tool is to interactively filter down a dataset, since this is training a model under the hood, the resulting model can in principle now be applied to other datasets. As an example, we demonstrate on the testing subset of the 20 newsgroups data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d77f39d9-9434-4c3e-9aa3-0da6a2f69a28",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset_test = fetch_20newsgroups(subset=\"test\")\n",
+    "df_test = pd.DataFrame({\"text\": dataset_test[\"data\"], \"category\": [dataset_test[\"target_names\"][i] for i in dataset_test[\"target\"]]})\n",
+    "df_test.category.value_counts()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7afccd3e-bbe1-433d-bed5-67453114ae11",
+   "metadata": {},
+   "source": [
+    "Running the model on new data can be done in a few ways:\n",
+    "\n",
+    "1. call `model.predict()` and pass in the dataframe to run predictions on"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c3c62f0-7bb6-4952-bb63-236e20f7bd57",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preds = model.predict(df_test)\n",
+    "np.where(preds >= .5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ea92fd0-4215-4026-8418-3719311ca22a",
+   "metadata": {},
+   "source": [
+    "2. Set the active data in the model to the new dataframe and explore/label/analyze in the interface as normal, using `model.data.set_data()`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3d3bddc5-d9ac-43ee-9517-95ec66ae48af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.data.set_data(df_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ae5e890-7676-4d04-bcc8-66e500c73e75",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.data.active_data[model.data.active_data._pred >= .5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97f36c4e-cd5e-4c3c-af54-d3fb2d986241",
+   "metadata": {},
+   "source": [
+    "<div style=\"margin-left: 20px; background-color: #00796B; color: white; padding: 10px;\">\n",
+    "Note: although the interface no longer has the original dataset anymore, this is \"non-destructive\" to the model - all labeled data is separately copied into <code>model.training_data</code>, so additional points can be labeled, new features can be added, etc. without losing any of the \"original\" signal.\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d671eb78-4fd4-4006-a442-7ae449f5cadb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.training_data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84a2065c-9a34-4fe6-98a2-4e81c1268f46",
+   "metadata": {},
+   "source": [
+    "3. The underlying scikit learn model (a logistic regression model) is accessible at `model.classifier`, so you could in principle directly use this (notably assumes you separately are \"featurizing\" any new data yourself, or using `model.featurize()`)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9fef6ab-ea0d-4978-82fe-2872aedbca5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.classifier, model.classifier.coef_"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "48ec3627-1c4b-4705-a652-44894d20387f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "featurized_df = model.featurize(df_test, normalize=False).drop(columns=[\"text\", \"category\"])\n",
+    "np.where(model.classifier.predict_proba(featurized_df)[:,1] >= .5)"
+   ]
   }
  ],
  "metadata": {