Skip to content

Commit 7082742

Browse files
authored
Implement Dataset and Corpus data loaders. (DistrictDataLabs#535)
This PR implements a significant change in the way yellowbrick handles datasets, moving them from data that can be downloaded and loaded using example code to prime time members of the library that can be loaded into pandas data frames and series or into well-structured numpy arrays with correct data types. We have completely overhauled dataset management using the yellowbrick-datasets repository as our data management tool. Data is still stored on S3 but contains .csv.gz and meta.json files for loading into pandas if it's installed or .npz files for loading into valid numpy arrays. New `Dataset` and `Corpus` manage access to the data, downloading it if it's not already on disk and providing access to the contents in the source directory. We maintain our security checking with sha256 hashes and a new manifest.json method. Fixes DistrictDataLabs#416
1 parent cad0d25 commit 7082742

26 files changed

+1963
-689
lines changed

.gitignore

+2-3
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ docs/_build/
6161
# IDE/editor droppings
6262
*.swp
6363
*.swo
64+
.vscode/settings.json
6465

6566
# OS droppings
6667
.DS_Store
@@ -120,6 +121,4 @@ fabric.properties
120121

121122
# Data downloaded from Yellowbrick
122123
data/
123-
.vscode/settings.json
124-
125-
yellowbrick/datasets/fixtures
124+
yellowbrick/datasets/fixtures

MANIFEST.in

+21-8
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,24 @@ include *.txt
44
include *.yml
55
include *.cfg
66
include Makefile
7-
recursive-include docs *.rst
8-
recursive-include docs *.jpg
9-
recursive-include docs *.png
10-
recursive-include docs *.py
11-
recursive-include docs Makefile
12-
recursive-include tests *.py
13-
recursive-include examples *.py
14-
recursive-include examples *.ipynb
7+
include MANIFEST.in
8+
9+
include examples/*.ipynb
10+
include examples/*.md
11+
12+
graft docs
13+
prune docs/_build
14+
15+
graft tests
16+
prune tests/fixtures
17+
prune tests/actual_images
18+
19+
graft yellowbrick
20+
prune yellowbrick/datasets/fixtures
21+
22+
global-exclude __pycache__
23+
global-exclude *.py[co]
24+
global-exclude .ipynb_checkpoints
25+
global-exclude .DS_Store
26+
global-exclude .env
27+
global-exclude .coverage.*

docs/api/datasets.rst

+9
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,12 @@ Unless otherwise specified, most of the examples currently use one or more of th
7979
- **mushroom**: suitable for classification/clustering
8080
- **occupancy**: suitable for classification
8181
- **spam**: suitable for binary classification
82+
83+
84+
API Reference
85+
-------------
86+
87+
.. automodule:: yellowbrick.datasets.path
88+
:members:
89+
:undoc-members:
90+
:show-inheritance:

docs/changelog.rst

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Version 1.0
1010
* Contributors: Benjamin Bengfort, Rebecca Bilbro, Nathan Danielsen, Kristen McIntyre, Larry Gray, Prema Roman, John Healy, Sourav Singh, Francois Dion, Jerome Massot
1111

1212
Major Changes:
13+
- New datasets module that provide greater support for interacting with Yellowbrick example datasets including support for Pandas, npz, and text corpora.
14+
- Management repository for Yellowbrick example data, yellowbrick-datasets.
1315
- Add support for matplotlib 3.0.1 or greater.
1416
- ``UMAPVisualizer`` as an alternative manifold to TSNE for corpus visualization that is fast enough to not require preprocessing PCA or SVD decomposition and preserves higher order similarities and distances.
1517

examples/README.md

+22-45
Original file line numberDiff line numberDiff line change
@@ -1,66 +1,43 @@
1-
# Yellowbrick Examples
1+
# Yellowbrick Examples
22

33
[![Visualizers](../docs/images/visualizers.png)](../docs/images/visualizers.png)
44

5-
Welcome to the yellowbrick examples directory! This directory contains a gallery of visualizers and their application to classification, regression, clustering, and other machine learning techniques with Scikit-Learn. Examples have been submitted both by the Yellowbrick team and also users like you! The result is a rich gallery of tools and techniques to equip your machine learning with visual diagnostics and visualizer workflows!
5+
Welcome to the yellowbrick examples directory! This directory contains a gallery of visualizers and their application to classification, regression, clustering, and other machine learning techniques with scikit-learn. Examples have been submitted both by the Yellowbrick team and also users like you! The result is a rich gallery of tools and techniques to equip your machine learning with visual diagnostics and visualizer workflows!
66

7-
## Getting Started
7+
## Getting Started
88

9-
The notebook to explore first is the `examples.ipynb` Jupyter notebook. This notebook contains the executable examples from the tutorial in the documentation. However, before you can successfully run this notebook, you must first download the sample datasets. To download the samples run the downloader script:
9+
The notebook to explore first is the `examples.ipynb` Jupyter notebook. This notebook contains the executable examples from the tutorial in the documentation. You can run the notebook as follows:
1010

1111
```
12-
$ python download.py
12+
$ jupyter notebook examples.ipynb
1313
```
1414

15-
This should create a directory called `examples/data`, which in turn will contain CSV or text datasets. There are two primary problems that the download script may have: first, you may get the error `"The requests module is required to download data"`. To fix this problem:
15+
If you don't have jupyter installed, or other dependencies, you may have to `pip install` them.
1616

17-
```
18-
$ pip install requests
19-
```
20-
21-
The second problem may be `"Download signature does not match hardcoded signature!"` This problem means that the file you're trying to download has changed. Either download a more recent version of Yellowbrick, or use the URLs in the `download.py` script to fetch the data manually. If there are any other problems, please notify us via [GitHub Issues](https://github.com/DistrictDataLabs/yellowbrick/issues).
22-
23-
Once the example data has been downloaded, you can run the examples notebook as follows:
24-
25-
```
26-
$ jupyter notebook examples.ipynb
27-
```
28-
29-
If you don't have jupyter installed, or other dependencies, you may have to `pip install` them.
30-
31-
## Organization
17+
## Organization
3218

3319
The examples directory contains many notebooks, folders and files. At the top level you will see the following:
3420

35-
- examples.ipynb: a notebook with executable versions of the tutorial visualizers
36-
- download.py: a script to download the example data sets
37-
- palettes.ipynb: a visualization of the Yellowbrick palettes
38-
- data: a directory containing the example datasets.
21+
- examples.ipynb: a notebook with executable versions of the tutorial visualizers
22+
- palettes.ipynb: a visualization of the Yellowbrick palettes
23+
- regression.ipynb: a notebook exploring the regression model visualizers.
3924

40-
In addition to these files and directory, you will see many other directories, whose names are the GitHub usernames of their contributors. You can explore these user submitted examples or submit your own!
25+
In addition to these files and directory, you will see many other directories, whose names are the GitHub usernames of their contributors. You can explore these user submitted examples or submit your own!
4126

42-
### Contributing
27+
### Contributing
4328

4429
To contribute an example notebook of your own, perform the following steps:
4530

46-
1. Fork the repository into your own account
47-
2. Checkout the develop branch (see [contributing to Yellowbrick](http://www.scikit-yb.org/en/latest/about.html#contributing) for more.
48-
3. Create a directory in the repo, `examples/username` where username is your GitHub username.
49-
4. Create a notebook in that directory with your example. See [user testing](http://www.scikit-yb.org/en/latest/evaluation.html) for more.
50-
5. Commit your changes back to your fork.
51-
6. Submit a pull-request from your develop branch to the Yellowbrick develop branch.
52-
7. Complete the code review steps with a Yellowbrick team member.
53-
54-
That's it -- thank you for contributing your example!
55-
56-
A couple of notes. First, please make sure that the Jupyter notebook you submit is "run" -- that is it has the output saved to the notebook and is viewable on GitHub (empty notebooks don't serve well as a gallery). Second, please do not commit datasets, but instead provide instructions for downloading the dataset. You can create a downloader utility similar to ours.
57-
58-
One great tip, is to create your PR right after you fork the repo; that way we can work with you on the changes you're making and communicate about how to have a very successful contribution!
31+
1. Fork the repository into your own account
32+
2. Checkout the develop branch (see [contributing to Yellowbrick](http://www.scikit-yb.org/en/latest/about.html#contributing) for more.
33+
3. Create a directory in the repo, `examples/username` where username is your GitHub username.
34+
4. Create a notebook in that directory with your example. See [user testing](http://www.scikit-yb.org/en/latest/evaluation.html) for more.
35+
5. Commit your changes back to your fork.
36+
6. Submit a pull-request from your develop branch to the Yellowbrick develop branch.
37+
7. Complete the code review steps with a Yellowbrick team member.
5938

60-
### User Examples
39+
That's it -- thank you for contributing your example!
6140

62-
In this section we want to thank our examples contributors, and describe their notebooks so that you can find an example similar to your application!
41+
A couple of notes. First, please make sure that the Jupyter notebook you submit is "run" -- that is it has the output saved to the notebook and is viewable on GitHub (empty notebooks don't serve well as a gallery). Second, please do not commit datasets, but instead provide instructions for downloading the dataset. You can create a downloader utility similar to ours.
6342

64-
- [bbengfort](https://github.com/bbengfort): visualizing text classification
65-
- [rebeccabilbro](https://github.com/rebeccabilbro): visualizing book reviews data
66-
- [nathan](https://github.com/ndanielsen/): visualizing the Iris dataset
43+
One great tip, is to create your PR right after you fork the repo; that way we can work with you on the changes you're making and communicate about how to have a very successful contribution!

examples/download.py

-149
This file was deleted.

setup.cfg

+1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
[metadata]
22
description-file = DESCRIPTION.txt
3+
license_file = LICENSE.txt
34

45
[wheel]
56
universal = 1

0 commit comments

Comments
 (0)