proj-properties-data

Ingestion and analysis of a small mock database of real estate properties.

Process

A data pipeline is built with DVC which first cleans and transforms the .csv files in data with the scripts in scripts, then produces some charts for analysis in plots.

Note that the repo does not contain the data and the pipeline structure follows dbt best practices. The scripts are very verbose and somewhat messy to allow the reader to follow the process for understanding and cleaning the data.

The data folder contains another README with thoughts on the data and a summary of the actions taken, while the plots folder contains a README with some of the gained insights and the business recommendations following from those insights.

Requirements

It's assumed that you have python 3 installed on your machine. After cloning the repo follow these steps:

Create a virtual environment

python3 -m venv venv

Activate the environment

source venv/bin/activate (Unix)
venv\Scripts\activate (Windows)

Upgrade pip and install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Finally, you need unzip to the raw data onto your machine and point the pipeline to it. Create an .env that's pointing to the folder containing the unzipped .csv files like so:

DB_PATH=<absolute path to the folder containing the csv files>

The default name of the unzipped folder is mpe_database_randomized.

Usage

All processing is done through python scripts located in scripts and orchestrated with DVC. To see the DAG of the pipeline, use the command:

dvc dag

The dependencies between the scripts are controlled by the dvc.yaml file which contains the description of the steps and what dependencies (both scripts and data) each step has.

To actually run everything, simply type:

dvc repro

This command runs everything in the project that you have not already run. If something fails in the run, just go to the script that failed, fix it, and then run dvc repro again. It will not run the steps that were already successful again.

Special case: Getting new data

If you have already run dvc repro on your local repository and would like to run the pipeline again with new data, you have to force DVC to rerun the fetching stages, as they are not rerun normally unless the SQL or python scripts used change. To force DVC to rerun everything, type:

dvc repro --force

This forces all stages to be rerun even if nothing has changed.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.dvc		.dvc
data		data
plots		plots
scripts		scripts
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proj-properties-data

Process

Requirements

Usage

Special case: Getting new data

About

Releases 2

Packages

Languages

max-pelz/proj-properties-data

Folders and files

Latest commit

History

Repository files navigation

proj-properties-data

Process

Requirements

Usage

Special case: Getting new data

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages