Name		Name	Last commit message	Last commit date
parent directory ..
.devcontainer		.devcontainer
environment_setup/provisioning		environment_setup/provisioning
local_development		local_development
ml_model		ml_model
ml_service		ml_service
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
README.md		README.md

README.md

Overview

What does this sample demonstrate:

A simple example of how to use ParallelRunScript to process data in parallel.
- A ParallelRunScript can run user script asynchronously and in parallel on multiple AmlCompute targets. This functionality suits for a scenario which processing of large amounts of data is needed.
- By setting up m nodes of AmlCompute and n processes per each node, the total time is expected to be reduced to 1/mn of a single process running (not considering the time for environments setup).
Sample data is created based on california housing dataset obtained by sklearn.datasets.fetch_california_housing function. For more details please refer to sklearn documentation Real world dataset - 7.2.7. California Housing dataset

According to above documentation,
"This dataset was obtained from the StatLib repository. http://lib.stat.cmu.edu/datasets/"

References
Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297

What doesn't this sample demonstrate:

any tests
any CI/CD
no dataset is used

Pipeline Structure
This example assumed a scenario that an Azure Machine Learning service pipeline was used to train a simple linear regression model based on california housing data.

Three steps are included:

preparation step, which downloaded raw text files from datastore.
extraction step, a ParallelRunStep which extracted data from the text files and merge the data to a single file.
training step, which trained a simple linear regression model used the data obtained from extraction step.
- please be noticed that sklearn.linear_model.LinearRegression is used to create the model. for details of sklearn's license, please check the BSD 3-Clause License.

Getting Started

Prerequisites
Running locally

Create Azure Resources

Whether you run this project locally or in Azure DevOps CI/CD pipelines, the code needs to get Azure ML context for remote or offline runs. Create Azure resources as documented here.
Review the folder structure explained here.

Prepare input data

This sample takes input files from Azure Machine Learning Datastore which mapped to Azure Blob Storage.

Create blob container and folders

Go to Azure Storage Account and create new container named 'azureml'

Upload input data

Change directory to data and run create_sample_data.py to create sample data.

cd samples/parallel-processing-california-housing/data
python -m create_sample_data --count [count of text files(default 100)]

Upload the created files by adding 'input' directory in 'azureml' container.

Add Azure ML compute, datastore and datasets

After you have all Azure resources and input data in Azure Storage, you need to create following Azure Machine Learning components.

Azure Machine Learning compute
Azure Machine Learning datastore

Change directory to samples/parallel-processing-california-housing.
```
cd samples/parallel-processing-california-housing
```

Run following command to create compute.

python -m environment_setup.provisioning.create-compute

Run following command to create datastore.

python -m environment_setup.provisioning.create-datastore

Running locally

Make a copy of .env.example, place it in the root of this sample, configure the variables, and rename the file to .env.

Update variable values.

name	description
SUBSCRIPTION_ID	Azure Subscription ID
RESOURCE_GROUP	Azure Resource group name
WORKSPACE_NAME	Azure Machine Learning workspace name
AML_ENV_NAME	Azure Machine Learning Environment name
AML_COMPUTE_CLUSTER_NAME	Azure Machine Learning compute cluster name
AML_COMPUTE_VM_SIZE	Azure Machine Learning compute virtual machine size. i.e. "STANDARD_D2_V2"
AML_COMPUTE_IDLE_TIME	Azure Machine Learning compute idle time (seconds) before scaling down
AML_COMPUTE_MIN_NODES	Azure Machine Learning compute minimum nodes number
AML_COMPUTE_MAX_NODES	Azure Machine Learning compute maximum nodes number
AML_BLOB_DATASTORE_NAME	Azure Machine Learning blob datastore name
AML_STORAGE_ACCOUNT_NAME	Azure Storage Account name for Azure Machine Learning blob datastore
AML_BLOB_CONTAINER_NAME	Blob container name which contains input data
AML_STORAGE_ACCOUNT_KEY	Azure Storage Account Key
PIPELINE_ENDPOINT_NAME	Azure Machine Learning pipeline endpoint name
PIPELINE_NAME	Azure Machine Learning pipeline name
INPUT_DIR	folder name which is used to save input files. Don't change this value.
SOURCES_DIR_TRAIN	source code directory for Azure Machine Learning pipeline
PREPARATION_STEP_SCRIPT_PATH	python script path for the preparation step
EXTRACTION_STEP_SCRIPT_PATH	python script path for the extraction step
TRAINING_STEP_SCRIPT_PATH	python script path for the training step

There are also special settings needed for the ParallelRunStep. You can also find the details of them in another sample Azure Machine Learning Batch Inference.

name	description
ERROR_THRESHOLD	The number of failures that should be ignored during processing. If the error count goes above this value, then the job will be aborted. - If the input dataset is `TabularDataset` type, this will be the number of record(row) failures; - If the input dataset is `FileDataset` type, this will be the number of file failures
NODE_COUNT	Number of nodes in the compute target used for running the `ParallelRunStep`
MINI_BATCH_SIZE	Size of data can be processed in one run() call. i.e. if a input dataset is `FileDataset` type consisting of total 10,000 files, and MINI_BATCH_SIZE set to "25", then the `ParallelRunStep` will partition the input dataset to 400 mini batches(10,000 divided by 25). - If the input dataset is `TabularDataset` type, this will be the approximate size of data passed to each run(). i.e. "1024", "1024KB", "10MB", "1GB"; - If the input dataset is `FileDataset` type, this will be the approximate number of files passed to each run. * Please be noticed that MINI_BATCH_SIZE is string type, even if the value is number like such as 10.
PROCESS_COUNT_PER_NODE	Number of processes executed on each node. Optional, default value is number of cores on node
RUN_INVOCATION_TIMEOUT	Timeout in seconds for each invocation of the run() method

Use the VSCode dev container, or install Anaconda or Mini Conda and create a Conda envrionment by running local_install_requirements.sh.
In VSCode, open the root folder of this sample, select the Conda environment created above as the Python interpretor.
Publish and run the Azure ML pipelines.
- To publish and run Azure ML pipelines, run:
```
# publish the Azure ML pipeline
python -m ml_service.pipelines.build_pipeline
```
  The above command will create a pipeline endpoint (with default name california_housing_pipeline_endpoint). Now you can access to Azure Machine Learning, click Pipeline on the left panel, choose Endpoints, find the newly updated pipeline endpoint and submit a new pipeline run.

Linting

Flake8

This sample uses Flake8 as linting tool. Ideally we should do linting for all python code, however we exclude Kaldi sample source code as it comes from another repo. This happens a lot in real project that some code comes from outside of the project and you don't want to modify the code.

See .flake8 for rule settings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel-processing-california-housing-data

parallel-processing-california-housing-data

README.md

Overview

Getting Started

Prerequisites

Create Azure Resources

Prepare input data

Create blob container and folders

Upload input data

Add Azure ML compute, datastore and datasets

Running locally

Linting

Flake8

Files

parallel-processing-california-housing-data

Directory actions

More options

Directory actions

More options

Latest commit

History

parallel-processing-california-housing-data

Folders and files

parent directory

README.md

Overview

Getting Started

Prerequisites

Create Azure Resources

Prepare input data

Create blob container and folders

Upload input data

Add Azure ML compute, datastore and datasets

Running locally

Linting

Flake8