From 9813d13402063a9ca62a32c858dab8f754b72208 Mon Sep 17 00:00:00 2001 From: Winterflower Date: Thu, 5 Dec 2019 14:18:27 +0100 Subject: [PATCH 1/8] Adds first version of classification example companion notebook --- .../ml-analytics-classification.ipynb | 376 ++++++++++++++++++ 1 file changed, 376 insertions(+) create mode 100644 Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb new file mode 100644 index 00000000..03148a69 --- /dev/null +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -0,0 +1,376 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Predicting delayed flights with classification analysis\n", + "\n", + "[This is a companion Jupyter notebook to the documentation example.](https://www.elastic.co/guide/en/elastic-stack-overview/7.5/flightdata-classification.html)\n", + "\n", + "Let’s try to predict whether a flight will be delayed or not by using the sample flight data. We want to be able to use information such as weather conditions, carrier, flight distance, origin, or destination to predict flight delays. There are only two possible outcome values: the flight is either delayed or not, therefore we use binary classification to make the prediction.\n", + "\n", + "We have chosen this dataset as an example because it is easily accessible for Kibana users and the use case is relevant. However, the data has been manually created and contains some inconsistencies. For example, a flight can be both delayed and canceled. Please remember that the quality of your input data will affect the quality of results.\n", + "\n", + "Each document in the dataset contains details for a single flight, so this data is ready for analysis as it is already in a two-dimensional entity-based data structure (data frame). In general, you often need to transform the data into an entity-centric index before you analyze the data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "## imports\n", + "import pprint\n", + "\n", + "from elasticsearch import Elasticsearch\n", + "import requests\n", + "## create a client to connect to Elasticsearch\n", + "es_url = 'http://localhost:9200'\n", + "es_client = Elasticsearch()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example Document" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## insert example of reading docs from ES index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that each document contains a FlightDelay field with a boolean value. Classification is a supervised machine learning analysis and therefore needs to train on data that contains the ground truth, known as the dependent_variable. In this example, the ground truth is available in each document as the actual value of FlightDelay. In order to be analyzed, a document must contain at least one field with a supported data type (numeric, boolean, text, keyword or ip) and must not contain arrays with more than one item.\n", + "\n", + "If your source data consists of some documents that contain a dependent variable and some that do not, the model is trained on the subset of documents that contain ground truth. By default, all of that subset of documents is used for training. However, you can choose to specify a percentage of the documents as your training data. Predictions are made against all of the data. The current implementation of classification analysis supports a single batch analysis for both training and predictions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Creating a Classification Model" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'allow_lazy_start': False,\n", + " 'analysis': {'classification': {'dependent_variable': 'FlightDelay',\n", + " 'num_top_classes': 2,\n", + " 'prediction_field_name': 'FlightDelay_prediction',\n", + " 'training_percent': 10.0}},\n", + " 'analyzed_fields': {'excludes': ['Cancelled',\n", + " 'FlightDelayMin',\n", + " 'FlightDelayType'],\n", + " 'includes': []},\n", + " 'create_time': 1575548781478,\n", + " 'dest': {'index': 'df-flight-delayed', 'results_field': 'ml'},\n", + " 'id': 'model-flight-delay-classification',\n", + " 'model_memory_limit': '100mb',\n", + " 'source': {'index': ['kibana_sample_data_flights'],\n", + " 'query': {'match_all': {}}},\n", + " 'version': '8.0.0'}\n" + ] + } + ], + "source": [ + "# 1. Creating a classification job \n", + "\n", + "endpoint_url = \"/_ml/data_frame/analytics/model-flight-delay-classification\"\n", + "\n", + "job_config = {\n", + " \"source\": {\n", + " \"index\": [\n", + " \"kibana_sample_data_flights\" \n", + " ]\n", + " },\n", + " \"dest\": {\n", + " \"index\": \"df-flight-delayed\", \n", + " \"results_field\": \"ml\" \n", + " },\n", + " \"analysis\": {\n", + " \"classification\": {\n", + " \"dependent_variable\": \"FlightDelay\", \n", + " \"training_percent\": 10 \n", + " }\n", + " },\n", + " \"analyzed_fields\": {\n", + " \"includes\": [],\n", + " \"excludes\": [ \n", + " \"Cancelled\",\n", + " \"FlightDelayMin\",\n", + " \"FlightDelayType\"\n", + " ]\n", + " },\n", + " \"model_memory_limit\": \"100mb\"}\n", + "\n", + "result = requests.put(es_url+endpoint_url, json=job_config)\n", + "pprint.pprint(result.json())\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'error': {'caused_by': {'reason': 'task with id '\n", + " '{data_frame_analytics-model-flight-delay-classification} '\n", + " 'already exist',\n", + " 'type': 'resource_already_exists_exception'},\n", + " 'reason': 'Cannot start data frame analytics '\n", + " '[model-flight-delay-classification] because it has '\n", + " 'already been started',\n", + " 'root_cause': [{'reason': 'task with id '\n", + " '{data_frame_analytics-model-flight-delay-classification} '\n", + " 'already exist',\n", + " 'type': 'resource_already_exists_exception'}],\n", + " 'type': 'status_exception'},\n", + " 'status': 409}\n" + ] + } + ], + "source": [ + "# 2. Start the job\n", + "\n", + "start_endpoint = \"/_ml/data_frame/analytics/model-flight-delay-classification/_start\"\n", + "result = requests.post(es_url+start_endpoint)\n", + "pprint.pprint(result.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The job takes a few minutes to run. Runtime depends on the local hardware and also on the number of documents and fields that are analyzed. The more fields and documents, the longer the job runs." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'count': 1,\n", + " 'data_frame_analytics': [{'id': 'model-flight-delay-classification',\n", + " 'progress': [{'phase': 'reindexing',\n", + " 'progress_percent': 100},\n", + " {'phase': 'loading_data',\n", + " 'progress_percent': 100},\n", + " {'phase': 'analyzing',\n", + " 'progress_percent': 100},\n", + " {'phase': 'writing_results',\n", + " 'progress_percent': 100}],\n", + " 'state': 'stopped'}]}\n" + ] + } + ], + "source": [ + "# 3. Check the job stats\n", + "\n", + "stats_endpoint = \"/_ml/data_frame/analytics/model-flight-delay-classification/_stats\"\n", + "result = requests.get(es_url+stats_endpoint)\n", + "pprint.pprint(result.json())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## View Classification Results" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "# insert code to get results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: true and false. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, false has a class_probability of 0.94 while true has only 0.06, so the prediction will be false which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Evaluating Results\n", + "The results can be evaluated for documents which contain both the ground truth field and the prediction. In the example below, FlightDelay contains the ground truth and the prediction is stored as FlightDelay_prediction.\n", + "\n", + "We use the data frame analytics evaluate API to evaluate the results. First, we want to know the training error that represents how well the model performed on the training dataset. In the previous step, we saw that the new index contained a field that indicated which documents were used as training data, which we can now use to calculate the training error:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'classification': {'multiclass_confusion_matrix': {'confusion_matrix': [{'actual_class': 'false',\n", + " 'actual_class_doc_count': 1000,\n", + " 'predicted_classes': [{'predicted_class': 'false', 'count': 904},\n", + " {'predicted_class': 'true', 'count': 96}],\n", + " 'other_predicted_class_doc_count': 0},\n", + " {'actual_class': 'true',\n", + " 'actual_class_doc_count': 334,\n", + " 'predicted_classes': [{'predicted_class': 'false', 'count': 20},\n", + " {'predicted_class': 'true', 'count': 314}],\n", + " 'other_predicted_class_doc_count': 0}],\n", + " 'other_actual_class_count': 0}}}" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# compute the training error\n", + "\n", + "evaluate_endpoint = \"/_ml/data_frame/_evaluate\"\n", + "\n", + "config = {\n", + " \"index\": \"df-flight-delayed\", \n", + " \"query\": {\n", + " \"term\": {\n", + " \"ml.is_training\": {\n", + " \"value\": True \n", + " }\n", + " }\n", + " },\n", + " \"evaluation\": {\n", + " \"classification\": {\n", + " \"actual_field\": \"FlightDelay\", \n", + " \"predicted_field\": \"ml.FlightDelay_prediction\", \n", + " \"metrics\": {\n", + " \"multiclass_confusion_matrix\" : {}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "result = requests.post(es_url+evaluate_endpoint, json=config)\n", + "result.json()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we calculate the generalization error that represents how well the model performed on previously unseen data. The returned confusion matrix shows us how many datapoints were classified correctly (where the actual_class matches the predicted_class) and how many were misclassified (actual_class does not match predicted_class):" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'classification': {'multiclass_confusion_matrix': {'confusion_matrix': [{'actual_class': 'false',\n", + " 'actual_class_doc_count': 8779,\n", + " 'predicted_classes': [{'predicted_class': 'false', 'count': 7176},\n", + " {'predicted_class': 'true', 'count': 1603}],\n", + " 'other_predicted_class_doc_count': 0},\n", + " {'actual_class': 'true',\n", + " 'actual_class_doc_count': 2946,\n", + " 'predicted_classes': [{'predicted_class': 'false', 'count': 963},\n", + " {'predicted_class': 'true', 'count': 1983}],\n", + " 'other_predicted_class_doc_count': 0}],\n", + " 'other_actual_class_count': 0}}}" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# compute the generalization error\n", + "\n", + "config = {\n", + " \"index\": \"df-flight-delayed\", \n", + " \"query\": {\n", + " \"term\": {\n", + " \"ml.is_training\": {\n", + " \"value\": False\n", + " }\n", + " }\n", + " },\n", + " \"evaluation\": {\n", + " \"classification\": {\n", + " \"actual_field\": \"FlightDelay\", \n", + " \"predicted_field\": \"ml.FlightDelay_prediction\", \n", + " \"metrics\": {\n", + " \"multiclass_confusion_matrix\" : {}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "result = requests.post(es_url+evaluate_endpoint, json=config)\n", + "result.json()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 7f9897835fda6633620191f4aeb882915185340b Mon Sep 17 00:00:00 2001 From: Winterflower Date: Thu, 5 Dec 2019 14:19:43 +0100 Subject: [PATCH 2/8] Adds requirements for running jupyter and connecting to ES --- ...-analytics-classification-requirements.txt | 53 +++++++++++++++++++ 1 file changed, 53 insertions(+) create mode 100644 Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification-requirements.txt diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification-requirements.txt b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification-requirements.txt new file mode 100644 index 00000000..e19a32f5 --- /dev/null +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification-requirements.txt @@ -0,0 +1,53 @@ +appnope==0.1.0 +attrs==19.3.0 +backcall==0.1.0 +bleach==3.1.0 +certifi==2019.11.28 +chardet==3.0.4 +decorator==4.4.1 +defusedxml==0.6.0 +elasticsearch==7.1.0 +entrypoints==0.3 +idna==2.8 +importlib-metadata==1.2.0 +ipykernel==5.1.3 +ipython==7.10.1 +ipython-genutils==0.2.0 +ipywidgets==7.5.1 +jedi==0.15.1 +Jinja2==2.10.3 +jsonschema==3.2.0 +jupyter==1.0.0 +jupyter-client==5.3.4 +jupyter-console==6.0.0 +jupyter-core==4.6.1 +MarkupSafe==1.1.1 +mistune==0.8.4 +more-itertools==8.0.0 +nbconvert==5.6.1 +nbformat==4.4.0 +notebook==6.0.2 +pandocfilters==1.4.2 +parso==0.5.1 +pexpect==4.7.0 +pickleshare==0.7.5 +prometheus-client==0.7.1 +prompt-toolkit==2.0.10 +ptyprocess==0.6.0 +Pygments==2.5.2 +pyrsistent==0.15.6 +python-dateutil==2.8.1 +pyzmq==18.1.1 +qtconsole==4.6.0 +requests==2.22.0 +Send2Trash==1.5.0 +six==1.13.0 +terminado==0.8.3 +testpath==0.4.4 +tornado==6.0.3 +traitlets==4.3.3 +urllib3==1.25.7 +wcwidth==0.1.7 +webencodings==0.5.1 +widgetsnbextension==3.5.1 +zipp==0.6.0 From 4032969e7389f911f74be5cfc4a49e0a5b69597b Mon Sep 17 00:00:00 2001 From: Winterflower Date: Fri, 6 Dec 2019 11:40:09 +0100 Subject: [PATCH 3/8] Adds code to read docs from flights index --- .../ml-analytics-classification.ipynb | 107 ++++++++++++++++-- 1 file changed, 100 insertions(+), 7 deletions(-) diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb index 03148a69..4d9b1bef 100644 --- a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -40,11 +40,54 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'hits': {'hits': [{'_index': 'kibana_sample_data_flights',\n", + " '_id': 'HZL81W4BszKtAdTQ9e-h',\n", + " '_score': 1.0,\n", + " '_source': {'FlightNum': '9HY9SWR',\n", + " 'DestCountry': 'AU',\n", + " 'OriginWeather': 'Sunny',\n", + " 'OriginCityName': 'Frankfurt am Main',\n", + " 'AvgTicketPrice': 841.2656419677076,\n", + " 'DistanceMiles': 10247.856675613455,\n", + " 'FlightDelay': False,\n", + " 'DestWeather': 'Rain',\n", + " 'Dest': 'Sydney Kingsford Smith International Airport',\n", + " 'FlightDelayType': 'No Delay',\n", + " 'OriginCountry': 'DE',\n", + " 'dayOfWeek': 0,\n", + " 'DistanceKilometers': 16492.32665375846,\n", + " 'timestamp': '2019-11-25T00:00:00',\n", + " 'DestLocation': {'lat': '-33.94609833', 'lon': '151.177002'},\n", + " 'DestAirportID': 'SYD',\n", + " 'Carrier': 'Kibana Airlines',\n", + " 'Cancelled': False,\n", + " 'FlightTimeMin': 1030.7704158599038,\n", + " 'Origin': 'Frankfurt am Main Airport',\n", + " 'OriginLocation': {'lat': '50.033333', 'lon': '8.570556'},\n", + " 'DestRegion': 'SE-BD',\n", + " 'OriginAirportID': 'FRA',\n", + " 'OriginRegion': 'DE-HE',\n", + " 'DestCityName': 'Sydney',\n", + " 'FlightTimeHour': 17.179506930998397,\n", + " 'FlightDelayMin': 0}}]}}" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "## insert example of reading docs from ES index" + "## insert example of reading docs from ES index\n", + "\n", + "results = es_client.search(index='kibana_sample_data_flights', filter_path=['hits.hits._*'], size=1)\n", + "results" ] }, { @@ -207,18 +250,68 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 27, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'hits': {'hits': [{'_index': 'df-flight-delayed',\n", + " '_id': '-5L81W4BszKtAdTQ-PZn',\n", + " '_score': 1.0,\n", + " '_source': {'FlightNum': 'K29GWY6',\n", + " 'Origin': 'Syracuse Hancock International Airport',\n", + " 'OriginLocation': {'lon': '-76.10630035', 'lat': '43.11119843'},\n", + " 'DestLocation': {'lon': '128.445007', 'lat': '51.169997'},\n", + " 'FlightDelay': True,\n", + " 'DistanceMiles': 5774.020251542041,\n", + " 'FlightTimeMin': 799.492323179845,\n", + " 'OriginWeather': 'Cloudy',\n", + " 'dayOfWeek': 6,\n", + " 'AvgTicketPrice': 669.8655726244879,\n", + " 'Carrier': 'ES-Air',\n", + " 'FlightDelayMin': 180,\n", + " 'OriginRegion': 'US-NY',\n", + " 'FlightDelayType': 'Late Aircraft Delay',\n", + " 'DestAirportID': 'XHBU',\n", + " 'timestamp': '2019-12-01T02:53:59',\n", + " 'Dest': 'Ukrainka Air Base',\n", + " 'FlightTimeHour': 13.324872052997417,\n", + " 'Cancelled': False,\n", + " 'DistanceKilometers': 9292.384847697675,\n", + " 'OriginCityName': 'Syracuse',\n", + " 'DestWeather': 'Rain',\n", + " 'OriginCountry': 'US',\n", + " 'ml__id_copy': '-5L81W4BszKtAdTQ-PZn',\n", + " 'DestCountry': 'RU',\n", + " 'DestRegion': 'RU-AMU',\n", + " 'OriginAirportID': 'SYR',\n", + " 'DestCityName': 'Belogorsk',\n", + " 'ml': {'top_classes': [{'class_probability': 0.919598560379138,\n", + " 'class_name': 'true'},\n", + " {'class_probability': 0.08040143962086195, 'class_name': 'false'}],\n", + " 'FlightDelay_prediction': 'true',\n", + " 'prediction_probability': 0.919598560379138,\n", + " 'is_training': True}}}]}}" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# insert code to get results" + "# insert code to get results\n", + "\n", + "result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1)\n", + "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: true and false. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, false has a class_probability of 0.94 while true has only 0.06, so the prediction will be false which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." + "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: `true` and `false`. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, `true` has a class_probability of 0.92 while `false` has only 0.08, so the prediction will be `true` which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." ] }, { From 6b97effd4aa476084cb1a81349210671ffb0a25f Mon Sep 17 00:00:00 2001 From: Winterflower Date: Mon, 16 Dec 2019 17:14:54 +0100 Subject: [PATCH 4/8] Adds paragraph on training percentage --- .../ml-analytics-classification.ipynb | 150 ++++-------------- 1 file changed, 32 insertions(+), 118 deletions(-) diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb index 4d9b1bef..83acc37b 100644 --- a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -40,54 +40,11 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'hits': {'hits': [{'_index': 'kibana_sample_data_flights',\n", - " '_id': 'HZL81W4BszKtAdTQ9e-h',\n", - " '_score': 1.0,\n", - " '_source': {'FlightNum': '9HY9SWR',\n", - " 'DestCountry': 'AU',\n", - " 'OriginWeather': 'Sunny',\n", - " 'OriginCityName': 'Frankfurt am Main',\n", - " 'AvgTicketPrice': 841.2656419677076,\n", - " 'DistanceMiles': 10247.856675613455,\n", - " 'FlightDelay': False,\n", - " 'DestWeather': 'Rain',\n", - " 'Dest': 'Sydney Kingsford Smith International Airport',\n", - " 'FlightDelayType': 'No Delay',\n", - " 'OriginCountry': 'DE',\n", - " 'dayOfWeek': 0,\n", - " 'DistanceKilometers': 16492.32665375846,\n", - " 'timestamp': '2019-11-25T00:00:00',\n", - " 'DestLocation': {'lat': '-33.94609833', 'lon': '151.177002'},\n", - " 'DestAirportID': 'SYD',\n", - " 'Carrier': 'Kibana Airlines',\n", - " 'Cancelled': False,\n", - " 'FlightTimeMin': 1030.7704158599038,\n", - " 'Origin': 'Frankfurt am Main Airport',\n", - " 'OriginLocation': {'lat': '50.033333', 'lon': '8.570556'},\n", - " 'DestRegion': 'SE-BD',\n", - " 'OriginAirportID': 'FRA',\n", - " 'OriginRegion': 'DE-HE',\n", - " 'DestCityName': 'Sydney',\n", - " 'FlightTimeHour': 17.179506930998397,\n", - " 'FlightDelayMin': 0}}]}}" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "## insert example of reading docs from ES index\n", - "\n", - "results = es_client.search(index='kibana_sample_data_flights', filter_path=['hits.hits._*'], size=1)\n", - "results" + "## insert example of reading docs from ES index" ] }, { @@ -108,29 +65,21 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'allow_lazy_start': False,\n", - " 'analysis': {'classification': {'dependent_variable': 'FlightDelay',\n", - " 'num_top_classes': 2,\n", - " 'prediction_field_name': 'FlightDelay_prediction',\n", - " 'training_percent': 10.0}},\n", - " 'analyzed_fields': {'excludes': ['Cancelled',\n", - " 'FlightDelayMin',\n", - " 'FlightDelayType'],\n", - " 'includes': []},\n", - " 'create_time': 1575548781478,\n", - " 'dest': {'index': 'df-flight-delayed', 'results_field': 'ml'},\n", - " 'id': 'model-flight-delay-classification',\n", - " 'model_memory_limit': '100mb',\n", - " 'source': {'index': ['kibana_sample_data_flights'],\n", - " 'query': {'match_all': {}}},\n", - " 'version': '8.0.0'}\n" + "{'error': {'reason': 'A data frame analytics with id '\n", + " '[model-flight-delay-classification] already exists',\n", + " 'root_cause': [{'reason': 'A data frame analytics with id '\n", + " '[model-flight-delay-classification] '\n", + " 'already exists',\n", + " 'type': 'resource_already_exists_exception'}],\n", + " 'type': 'resource_already_exists_exception'},\n", + " 'status': 400}\n" ] } ], @@ -152,7 +101,7 @@ " \"analysis\": {\n", " \"classification\": {\n", " \"dependent_variable\": \"FlightDelay\", \n", - " \"training_percent\": 10 \n", + " \"training_percent\": 10 # see comment below for a discussion on training percentage\n", " }\n", " },\n", " \"analyzed_fields\": {\n", @@ -170,6 +119,21 @@ "\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### A brief note on training percentage\n", + "\n", + "As you may have noticed, in the job configuration above we set the value of `training_percent` to 10. This means that out of the whole Flights dataset 10 percent of the data will be used to train model and the remaining 90 percent of the data will be used for testing the model. \n", + "You might wonder at this point, what is the best percentage for the train/test split and how you should choose what percentage to use in your own job? The answer will usually depend on your particular situation. In general it is useful to consider some of the following tradeoffs.\n", + "The more data you supply to the model at training time, the more examples the model will have to learn from, which usually leads to a better classification performance. However, more training data will also increase the training time of the model and at some point, providing the model with more training examples will only result in marginal increase in accuracy. \n", + "\n", + "Moreover, the more data you use for training, the less data you have for the testing phase. This means that you will have less previously unseen examples to show your model and thus perhaps your estimate for the generalization error will not be as accurate. \n", + "\n", + "In general, for datasets containing several thousand docs or more, start with a low 5-10% training percentage and see how your results and runtime evolve as you increase the training percentage. " + ] + }, { "cell_type": "code", "execution_count": 12, @@ -250,68 +214,18 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 14, "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'hits': {'hits': [{'_index': 'df-flight-delayed',\n", - " '_id': '-5L81W4BszKtAdTQ-PZn',\n", - " '_score': 1.0,\n", - " '_source': {'FlightNum': 'K29GWY6',\n", - " 'Origin': 'Syracuse Hancock International Airport',\n", - " 'OriginLocation': {'lon': '-76.10630035', 'lat': '43.11119843'},\n", - " 'DestLocation': {'lon': '128.445007', 'lat': '51.169997'},\n", - " 'FlightDelay': True,\n", - " 'DistanceMiles': 5774.020251542041,\n", - " 'FlightTimeMin': 799.492323179845,\n", - " 'OriginWeather': 'Cloudy',\n", - " 'dayOfWeek': 6,\n", - " 'AvgTicketPrice': 669.8655726244879,\n", - " 'Carrier': 'ES-Air',\n", - " 'FlightDelayMin': 180,\n", - " 'OriginRegion': 'US-NY',\n", - " 'FlightDelayType': 'Late Aircraft Delay',\n", - " 'DestAirportID': 'XHBU',\n", - " 'timestamp': '2019-12-01T02:53:59',\n", - " 'Dest': 'Ukrainka Air Base',\n", - " 'FlightTimeHour': 13.324872052997417,\n", - " 'Cancelled': False,\n", - " 'DistanceKilometers': 9292.384847697675,\n", - " 'OriginCityName': 'Syracuse',\n", - " 'DestWeather': 'Rain',\n", - " 'OriginCountry': 'US',\n", - " 'ml__id_copy': '-5L81W4BszKtAdTQ-PZn',\n", - " 'DestCountry': 'RU',\n", - " 'DestRegion': 'RU-AMU',\n", - " 'OriginAirportID': 'SYR',\n", - " 'DestCityName': 'Belogorsk',\n", - " 'ml': {'top_classes': [{'class_probability': 0.919598560379138,\n", - " 'class_name': 'true'},\n", - " {'class_probability': 0.08040143962086195, 'class_name': 'false'}],\n", - " 'FlightDelay_prediction': 'true',\n", - " 'prediction_probability': 0.919598560379138,\n", - " 'is_training': True}}}]}}" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "# insert code to get results\n", - "\n", - "result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1)\n", - "result" + "# insert code to get results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: `true` and `false`. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, `true` has a class_probability of 0.92 while `false` has only 0.08, so the prediction will be `true` which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." + "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: true and false. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, false has a class_probability of 0.94 while true has only 0.06, so the prediction will be false which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." ] }, { From 36b965a2dc21ab01ddce95cc587e83550bfb817a Mon Sep 17 00:00:00 2001 From: Winterflower Date: Tue, 17 Dec 2019 12:38:11 +0100 Subject: [PATCH 5/8] Adds query body to search to obtain a datapoint that is not part of the training set --- .../ml-analytics-classification.ipynb | 150 ++++++++++++++---- 1 file changed, 118 insertions(+), 32 deletions(-) diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb index 83acc37b..f2cdc6c8 100644 --- a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -40,11 +40,54 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'hits': {'hits': [{'_index': 'kibana_sample_data_flights',\n", + " '_id': 'HZL81W4BszKtAdTQ9e-h',\n", + " '_score': 1.0,\n", + " '_source': {'FlightNum': '9HY9SWR',\n", + " 'DestCountry': 'AU',\n", + " 'OriginWeather': 'Sunny',\n", + " 'OriginCityName': 'Frankfurt am Main',\n", + " 'AvgTicketPrice': 841.2656419677076,\n", + " 'DistanceMiles': 10247.856675613455,\n", + " 'FlightDelay': False,\n", + " 'DestWeather': 'Rain',\n", + " 'Dest': 'Sydney Kingsford Smith International Airport',\n", + " 'FlightDelayType': 'No Delay',\n", + " 'OriginCountry': 'DE',\n", + " 'dayOfWeek': 0,\n", + " 'DistanceKilometers': 16492.32665375846,\n", + " 'timestamp': '2019-11-25T00:00:00',\n", + " 'DestLocation': {'lat': '-33.94609833', 'lon': '151.177002'},\n", + " 'DestAirportID': 'SYD',\n", + " 'Carrier': 'Kibana Airlines',\n", + " 'Cancelled': False,\n", + " 'FlightTimeMin': 1030.7704158599038,\n", + " 'Origin': 'Frankfurt am Main Airport',\n", + " 'OriginLocation': {'lat': '50.033333', 'lon': '8.570556'},\n", + " 'DestRegion': 'SE-BD',\n", + " 'OriginAirportID': 'FRA',\n", + " 'OriginRegion': 'DE-HE',\n", + " 'DestCityName': 'Sydney',\n", + " 'FlightTimeHour': 17.179506930998397,\n", + " 'FlightDelayMin': 0}}]}}" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "## insert example of reading docs from ES index" + "## insert example of reading docs from ES index\n", + "\n", + "results = es_client.search(index='kibana_sample_data_flights', filter_path=['hits.hits._*'], size=1)\n", + "results" ] }, { @@ -65,21 +108,29 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'error': {'reason': 'A data frame analytics with id '\n", - " '[model-flight-delay-classification] already exists',\n", - " 'root_cause': [{'reason': 'A data frame analytics with id '\n", - " '[model-flight-delay-classification] '\n", - " 'already exists',\n", - " 'type': 'resource_already_exists_exception'}],\n", - " 'type': 'resource_already_exists_exception'},\n", - " 'status': 400}\n" + "{'allow_lazy_start': False,\n", + " 'analysis': {'classification': {'dependent_variable': 'FlightDelay',\n", + " 'num_top_classes': 2,\n", + " 'prediction_field_name': 'FlightDelay_prediction',\n", + " 'training_percent': 10.0}},\n", + " 'analyzed_fields': {'excludes': ['Cancelled',\n", + " 'FlightDelayMin',\n", + " 'FlightDelayType'],\n", + " 'includes': []},\n", + " 'create_time': 1575548781478,\n", + " 'dest': {'index': 'df-flight-delayed', 'results_field': 'ml'},\n", + " 'id': 'model-flight-delay-classification',\n", + " 'model_memory_limit': '100mb',\n", + " 'source': {'index': ['kibana_sample_data_flights'],\n", + " 'query': {'match_all': {}}},\n", + " 'version': '8.0.0'}\n" ] } ], @@ -101,7 +152,7 @@ " \"analysis\": {\n", " \"classification\": {\n", " \"dependent_variable\": \"FlightDelay\", \n", - " \"training_percent\": 10 # see comment below for a discussion on training percentage\n", + " \"training_percent\": 10 \n", " }\n", " },\n", " \"analyzed_fields\": {\n", @@ -119,21 +170,6 @@ "\n" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### A brief note on training percentage\n", - "\n", - "As you may have noticed, in the job configuration above we set the value of `training_percent` to 10. This means that out of the whole Flights dataset 10 percent of the data will be used to train model and the remaining 90 percent of the data will be used for testing the model. \n", - "You might wonder at this point, what is the best percentage for the train/test split and how you should choose what percentage to use in your own job? The answer will usually depend on your particular situation. In general it is useful to consider some of the following tradeoffs.\n", - "The more data you supply to the model at training time, the more examples the model will have to learn from, which usually leads to a better classification performance. However, more training data will also increase the training time of the model and at some point, providing the model with more training examples will only result in marginal increase in accuracy. \n", - "\n", - "Moreover, the more data you use for training, the less data you have for the testing phase. This means that you will have less previously unseen examples to show your model and thus perhaps your estimate for the generalization error will not be as accurate. \n", - "\n", - "In general, for datasets containing several thousand docs or more, start with a low 5-10% training percentage and see how your results and runtime evolve as you increase the training percentage. " - ] - }, { "cell_type": "code", "execution_count": 12, @@ -214,18 +250,68 @@ }, { "cell_type": "code", - "execution_count": 14, + "execution_count": 29, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "{'hits': {'hits': [{'_index': 'df-flight-delayed',\n", + " '_id': '-5L81W4BszKtAdTQ-PZn',\n", + " '_score': 1.0,\n", + " '_source': {'FlightNum': 'K29GWY6',\n", + " 'Origin': 'Syracuse Hancock International Airport',\n", + " 'OriginLocation': {'lon': '-76.10630035', 'lat': '43.11119843'},\n", + " 'DestLocation': {'lon': '128.445007', 'lat': '51.169997'},\n", + " 'FlightDelay': True,\n", + " 'DistanceMiles': 5774.020251542041,\n", + " 'FlightTimeMin': 799.492323179845,\n", + " 'OriginWeather': 'Cloudy',\n", + " 'dayOfWeek': 6,\n", + " 'AvgTicketPrice': 669.8655726244879,\n", + " 'Carrier': 'ES-Air',\n", + " 'FlightDelayMin': 180,\n", + " 'OriginRegion': 'US-NY',\n", + " 'FlightDelayType': 'Late Aircraft Delay',\n", + " 'DestAirportID': 'XHBU',\n", + " 'timestamp': '2019-12-01T02:53:59',\n", + " 'Dest': 'Ukrainka Air Base',\n", + " 'FlightTimeHour': 13.324872052997417,\n", + " 'Cancelled': False,\n", + " 'DistanceKilometers': 9292.384847697675,\n", + " 'OriginCityName': 'Syracuse',\n", + " 'DestWeather': 'Rain',\n", + " 'OriginCountry': 'US',\n", + " 'ml__id_copy': '-5L81W4BszKtAdTQ-PZn',\n", + " 'DestCountry': 'RU',\n", + " 'DestRegion': 'RU-AMU',\n", + " 'OriginAirportID': 'SYR',\n", + " 'DestCityName': 'Belogorsk',\n", + " 'ml': {'top_classes': [{'class_probability': 0.919598560379138,\n", + " 'class_name': 'true'},\n", + " {'class_probability': 0.08040143962086195, 'class_name': 'false'}],\n", + " 'FlightDelay_prediction': 'true',\n", + " 'prediction_probability': 0.919598560379138,\n", + " 'is_training': True}}}]}}" + ] + }, + "execution_count": 29, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# insert code to get results" + "# insert code to get results\n", + "query = {\"query\": {\"term\": {\"ml.is_training\": {\"value\": False }}}}\n", + "result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1, )\n", + "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: true and false. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, false has a class_probability of 0.94 while true has only 0.06, so the prediction will be false which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." + "The example above shows that the analysis has predicted the probability of all possible classes. In this case, there are two classes: `true` and `false`. The class names along with the probability of the given classes are displayed in the top_classes object. The most probable class is the prediction. In the example above, `true` has a class_probability of 0.92 while `false` has only 0.08, so the prediction will be `true` which coincides with the ground truth contained by the FlightDelay field. The class probability values help you understand how sure the model is about the prediction. The higher number means that the model is more confident." ] }, { From 585e4fb3cddf1d7ac81d0b1e0fa14ff991312324 Mon Sep 17 00:00:00 2001 From: Winterflower Date: Tue, 17 Dec 2019 12:43:48 +0100 Subject: [PATCH 6/8] Adds paragraph about training percent --- .../ml-analytics-classification.ipynb | 91 +++++++++++-------- 1 file changed, 53 insertions(+), 38 deletions(-) diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb index f2cdc6c8..09d449de 100644 --- a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -152,7 +152,7 @@ " \"analysis\": {\n", " \"classification\": {\n", " \"dependent_variable\": \"FlightDelay\", \n", - " \"training_percent\": 10 \n", + " \"training_percent\": 10 # see comment below on training percent\n", " }\n", " },\n", " \"analyzed_fields\": {\n", @@ -170,6 +170,21 @@ "\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### A brief note on training percentage\n", + "\n", + "As you may have noticed, in the job configuration above we set the value of `training_percent` to 10. This means that out of the whole Flights dataset 10 percent of the data will be used to train model and the remaining 90 percent of the data will be used for testing the model. \n", + "You might wonder at this point, what is the best percentage for the train/test split and how you should choose what percentage to use in your own job? The answer will usually depend on your particular situation. In general it is useful to consider some of the following tradeoffs.\n", + "The more data you supply to the model at training time, the more examples the model will have to learn from, which usually leads to a better classification performance. However, more training data will also increase the training time of the model and at some point, providing the model with more training examples will only result in marginal increase in accuracy. \n", + "\n", + "Moreover, the more data you use for training, the less data you have for the testing phase. This means that you will have less previously unseen examples to show your model and thus perhaps your estimate for the generalization error will not be as accurate. \n", + "\n", + "In general, for datasets containing several thousand docs or more, start with a low 5-10% training percentage and see how your results and runtime evolve as you increase the training percentage. " + ] + }, { "cell_type": "code", "execution_count": 12, @@ -250,52 +265,52 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'hits': {'hits': [{'_index': 'df-flight-delayed',\n", - " '_id': '-5L81W4BszKtAdTQ-PZn',\n", - " '_score': 1.0,\n", - " '_source': {'FlightNum': 'K29GWY6',\n", - " 'Origin': 'Syracuse Hancock International Airport',\n", - " 'OriginLocation': {'lon': '-76.10630035', 'lat': '43.11119843'},\n", - " 'DestLocation': {'lon': '128.445007', 'lat': '51.169997'},\n", - " 'FlightDelay': True,\n", - " 'DistanceMiles': 5774.020251542041,\n", - " 'FlightTimeMin': 799.492323179845,\n", - " 'OriginWeather': 'Cloudy',\n", - " 'dayOfWeek': 6,\n", - " 'AvgTicketPrice': 669.8655726244879,\n", - " 'Carrier': 'ES-Air',\n", - " 'FlightDelayMin': 180,\n", - " 'OriginRegion': 'US-NY',\n", - " 'FlightDelayType': 'Late Aircraft Delay',\n", - " 'DestAirportID': 'XHBU',\n", - " 'timestamp': '2019-12-01T02:53:59',\n", - " 'Dest': 'Ukrainka Air Base',\n", - " 'FlightTimeHour': 13.324872052997417,\n", + " '_id': '-5L81W4BszKtAdTQ-Pdn',\n", + " '_score': 0.10778817,\n", + " '_source': {'FlightNum': 'YHH7FJ3',\n", + " 'Origin': 'Rochester International Airport',\n", + " 'OriginLocation': {'lon': '-92.5', 'lat': '43.90829849'},\n", + " 'DestLocation': {'lon': '-117.5339966', 'lat': '47.61989975'},\n", + " 'FlightDelay': False,\n", + " 'DistanceMiles': 1231.1973824768306,\n", + " 'FlightTimeMin': 94.35333906213299,\n", + " 'OriginWeather': 'Clear',\n", + " 'dayOfWeek': 0,\n", + " 'AvgTicketPrice': 126.52980202058899,\n", + " 'Carrier': 'Logstash Airways',\n", + " 'FlightDelayMin': 0,\n", + " 'OriginRegion': 'US-MN',\n", + " 'FlightDelayType': 'No Delay',\n", + " 'DestAirportID': 'GEG',\n", + " 'timestamp': '2019-12-02T12:16:06',\n", + " 'Dest': 'Spokane International Airport',\n", + " 'FlightTimeHour': 1.5725556510355498,\n", " 'Cancelled': False,\n", - " 'DistanceKilometers': 9292.384847697675,\n", - " 'OriginCityName': 'Syracuse',\n", - " 'DestWeather': 'Rain',\n", + " 'DistanceKilometers': 1981.4201203047928,\n", + " 'OriginCityName': 'Rochester',\n", + " 'DestWeather': 'Cloudy',\n", " 'OriginCountry': 'US',\n", - " 'ml__id_copy': '-5L81W4BszKtAdTQ-PZn',\n", - " 'DestCountry': 'RU',\n", - " 'DestRegion': 'RU-AMU',\n", - " 'OriginAirportID': 'SYR',\n", - " 'DestCityName': 'Belogorsk',\n", - " 'ml': {'top_classes': [{'class_probability': 0.919598560379138,\n", - " 'class_name': 'true'},\n", - " {'class_probability': 0.08040143962086195, 'class_name': 'false'}],\n", - " 'FlightDelay_prediction': 'true',\n", - " 'prediction_probability': 0.919598560379138,\n", - " 'is_training': True}}}]}}" + " 'ml__id_copy': '-5L81W4BszKtAdTQ-Pdn',\n", + " 'DestCountry': 'US',\n", + " 'DestRegion': 'US-WA',\n", + " 'OriginAirportID': 'RST',\n", + " 'DestCityName': 'Spokane',\n", + " 'ml': {'top_classes': [{'class_probability': 0.9876257877232794,\n", + " 'class_name': 'false'},\n", + " {'class_probability': 0.012374212276720642, 'class_name': 'true'}],\n", + " 'FlightDelay_prediction': 'false',\n", + " 'prediction_probability': 0.9876257877232794,\n", + " 'is_training': False}}}]}}" ] }, - "execution_count": 29, + "execution_count": 30, "metadata": {}, "output_type": "execute_result" } @@ -303,7 +318,7 @@ "source": [ "# insert code to get results\n", "query = {\"query\": {\"term\": {\"ml.is_training\": {\"value\": False }}}}\n", - "result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1, )\n", + "result = es_client.search(index='df-flight-delayed', filter_path=['hits.hits._*'], size=1, body=query)\n", "result" ] }, From 3462c21185da432cea77af35062ded334c57bc10 Mon Sep 17 00:00:00 2001 From: Winterflower Date: Tue, 17 Dec 2019 13:12:38 +0100 Subject: [PATCH 7/8] Fixes exception returned from ES --- .../ml-analytics-classification.ipynb | 20 ++++--------------- 1 file changed, 4 insertions(+), 16 deletions(-) diff --git a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb index 09d449de..76c83e3d 100644 --- a/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb +++ b/Machine Learning/Analytics Jupyter Notebooks/ml-analytics-classification.ipynb @@ -108,7 +108,7 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": 31, "metadata": {}, "outputs": [ { @@ -124,7 +124,7 @@ " 'FlightDelayMin',\n", " 'FlightDelayType'],\n", " 'includes': []},\n", - " 'create_time': 1575548781478,\n", + " 'create_time': 1576584651508,\n", " 'dest': {'index': 'df-flight-delayed', 'results_field': 'ml'},\n", " 'id': 'model-flight-delay-classification',\n", " 'model_memory_limit': '100mb',\n", @@ -187,26 +187,14 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "{'error': {'caused_by': {'reason': 'task with id '\n", - " '{data_frame_analytics-model-flight-delay-classification} '\n", - " 'already exist',\n", - " 'type': 'resource_already_exists_exception'},\n", - " 'reason': 'Cannot start data frame analytics '\n", - " '[model-flight-delay-classification] because it has '\n", - " 'already been started',\n", - " 'root_cause': [{'reason': 'task with id '\n", - " '{data_frame_analytics-model-flight-delay-classification} '\n", - " 'already exist',\n", - " 'type': 'resource_already_exists_exception'}],\n", - " 'type': 'status_exception'},\n", - " 'status': 409}\n" + "{'acknowledged': True}\n" ] } ], From 1bf5b157c3ad787ca6275b179cf38a0574460004 Mon Sep 17 00:00:00 2001 From: Winterflower Date: Tue, 17 Dec 2019 13:49:17 +0100 Subject: [PATCH 8/8] Adds README with installation instructions --- .../Analytics Jupyter Notebooks/README.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) create mode 100644 Machine Learning/Analytics Jupyter Notebooks/README.md diff --git a/Machine Learning/Analytics Jupyter Notebooks/README.md b/Machine Learning/Analytics Jupyter Notebooks/README.md new file mode 100644 index 00000000..e1e65a5b --- /dev/null +++ b/Machine Learning/Analytics Jupyter Notebooks/README.md @@ -0,0 +1,29 @@ +## Kibana Sample Flights Data Classification Example + +Set up a local instance of Jupyter using the following instructions + +1. Set up a virtual environment called `env` + +``` +python3 -m venv env +``` + +2. Activate it + +``` +source env/bin/activate +``` + +3. Install the required dependencies for your chosen Jupyter notebook + +``` +pip install -r some-requirements-file-name.txt +``` + +4. Launch Jupyter + +``` +jupyter notebook +``` + +