Skip to content

Commit

Permalink
[GSProcessing] Add support for numerical and multi-numerical transfor…
Browse files Browse the repository at this point in the history
…mations. (#575)

*Issue #, if available:*
Closes #572 

*Description of changes:*
* Adds a single-column and multi-column numerical transformations.
* We support missing data imputation, and value normalization.
* For missing data we support `mean`, `median` and `most_frequent`
imputation.
* For normalization we support `mix-max` and `standard` which divides
all values
      by their column sum.
* We update the GConstruct config converter to allow pass-through of the
min-max transformation.
* We change the GSProcessing feature config key that describes the
transformation from `transform` to `transformation`.


By submitting this pull request, I confirm that you can use, modify,
copy, and redistribute this contribution, under the terms of your
choice.

---------

Co-authored-by: xiang song(charlie.song) <[email protected]>
  • Loading branch information
thvasilo and classicsong authored Oct 27, 2023
1 parent 167c71c commit d6d5819
Show file tree
Hide file tree
Showing 18 changed files with 1,123 additions and 82 deletions.
99 changes: 69 additions & 30 deletions docs/source/gs-processing/developer/input-configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ between other config formats, such as the one used
by the single-machine GConstruct module.

GSProcessing can take a GConstruct-formatted file
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`
directly, and we also provide `a script <https://github.com/awslabs/graphstorm/blob/main/graphstorm-processing/scripts/convert_gconstruct_config.py>`_
that can convert a `GConstruct <https://graphstorm.readthedocs.io/en/latest/configuration/configuration-gconstruction.html#configuration-json-explanations>`_
input configuration file into the ``GSProcessing`` format,
although this is mostly aimed at developers, users are
can rely on the automatic conversion.
Expand All @@ -30,11 +30,11 @@ The GSProcessing input data configuration has two top-level objects:
- ``version`` (String, required): The version of configuration file being used. We include
the package name to allow self-contained identification of the file format.
- ``graph`` (JSON object, required): one configuration object that defines each
of the node types and edge types that describe the graph.
of the edge and node types that constitute the graph.

We describe the ``graph`` object next.

``graph`` configuration object
Contents of the ``graph`` configuration object
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``graph`` configuration object can have two top-level objects:
Expand Down Expand Up @@ -71,7 +71,7 @@ objects:
},
"source": {"column": "String", "type": "String"},
"relation": {"type": "String"},
"destination": {"column": "String", "type": "String"},
"dest": {"column": "String", "type": "String"},
"labels" : [
{
"column": "String",
Expand All @@ -82,8 +82,8 @@ objects:
"test": "Float"
}
},
]
"features": [{}]
],
"features": [{}]
}
- ``data`` (JSON Object, required): Describes the physical files
Expand Down Expand Up @@ -135,13 +135,12 @@ objects:
``source`` key, with a JSON object that contains
``{“column: String, and ”type“: String}``.
- ``relation``: (JSON object, required): Describes the relation
modeled by the edges. A relation can be common among all edges, or it
can have sub-types. The top-level objects for the object are:
modeled by the edges. The top-level keys for the object are:

- ``type`` (String, required): The type of the relation described by
the edges. For example, for a source type ``user``, destination
``movie`` we can have a relation type ``interacted_with`` for an
edge type ``user:interacted_with:movie``.
``movie`` we can have a relation type ``rated`` for an
edge type ``user:rated:movie``.

- ``labels`` (List of JSON objects, optional): Describes the label
for the current edge type. The label object has the following
Expand Down Expand Up @@ -171,9 +170,9 @@ objects:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional)\ **:** Describes
the set of features for the current edge type. See the :ref:`features-object` section for details.
Expand All @@ -194,13 +193,12 @@ following top-level keys:
"files": ["String"],
"separator": "String"
},
"column" : "String",
"type" : "String",
"column": "String",
"type": "String",
"labels" : [
{
"column": "String",
"type": "String",
"separator": "String",
"split_rate": {
"train": "Float",
"val": "Float",
Expand All @@ -215,8 +213,8 @@ following top-level keys:
the edges object, with one top-level key for the ``format`` that
takes a String value, and one for the ``files`` that takes an array
of String values.
- ``column``: (String, required): The column in the data that
corresponds to the column that stores the node ids.
- ``column``: (String, required): The name of the column in the data that
stores the node ids.
- ``type:`` (String, optional): A type name for the nodes described
in this object. If not provided the ``column`` value is used as the
node type.
Expand Down Expand Up @@ -248,12 +246,12 @@ following top-level keys:
- ``train``: The percentage of the data with available labels to
assign to the train set (0.0, 1.0].
- ``val``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the validation set [0.0, 1.0).
- ``test``: The percentage of the data with available labels to
assign to the train set [0.0, 1.0).
assign to the test set [0.0, 1.0).

- ``features`` (List of JSON objects, optional): Describes
the set of features for the current edge type. See the next section, :ref:`features-object`
the set of features for the current node type. See the section :ref:`features-object`
for details.

--------------
Expand All @@ -272,10 +270,10 @@ can contain the following top-level keys:
"column": "String",
"name": "String",
"transformation": {
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
"name": "String",
"kwargs": {
"arg_name": "<value>"
}
},
"data": {
"format": "String",
Expand All @@ -285,7 +283,7 @@ can contain the following top-level keys:
}
- ``column`` (String, required): The column that contains the raw
feature values in the dataset
feature values in the data.
- ``transformation`` (JSON object, optional): The type of
transformation that will be applied to the feature. For details on
the individual transformations supported see :ref:`supported-transformations`.
Expand All @@ -309,7 +307,7 @@ can contain the following top-level keys:
# Example node config with multiple features
{
# This is where the node structure data exist just need an id col
# This is where the node structure data exist, just need an id col in these files
"data": {
"format": "parquet",
"files": ["path/to/node_ids"]
Expand Down Expand Up @@ -356,7 +354,7 @@ Supported transformations

In this section we'll describe the transformations we support.
The name of the transformation is the value that would appear
in the ``transform['name']`` element of the feature configuration,
in the ``['transformation']['name']`` element of the feature configuration,
with the attached ``kwargs`` for the transformations that support
arguments.

Expand All @@ -373,6 +371,35 @@ arguments.
split the values in the column and create a vector column
output. Example: for a separator ``'|'`` the CSV value
``1|2|3`` would be transformed to a vector, ``[1, 2, 3]``.
- ``numerical``

- Transforms a numerical column using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): A method to fill in missing values in the data.
Valid values are:
``none`` (Default), ``mean``, ``median``, and ``most_frequent``. Missing values will be replaced
with the respective value computed from the data.
- ``normalizer`` (String, optional): Applies a normalization to the data, after
imputation. Can take the following values:
- ``none``: (Default) Don't normalize the numerical values during encoding.
- ``min-max``: Normalize each value by subtracting the minimum value from it,
and then dividing it by the difference between the maximum value and the minimum.
- ``standard``: Normalize each value by dividing it by the sum of all the values.
- ``multi-numerical``

- Column-wise transformation for vector-like numerical data using a missing data imputer and an
optional normalizer.
- ``kwargs``:

- ``imputer`` (String, optional): Same as for ``numerical`` transformation, will
apply the ``mean`` transformation by default.
- ``normalizer`` (String, optional): Same as for ``numerical`` transformation, no
normalization is applied by default.
- ``separator`` (String, optional): Same as for ``no-op`` transformation, used to separate numerical
values in CSV input. If the input data are in Parquet format, each value in the
column is assumed to be an array of floats.

--------------

Expand Down Expand Up @@ -403,15 +430,27 @@ OAG-Paper dataset
],
"nodes" : [
{
"type": "paper",
"column": "ID",
"data": {
"format": "csv",
"separator": ",",
"files": [
"node_feat.csv"
]
},
"type": "paper",
"column": "ID",
"features": [
{
"column": "n_citation",
"transformation": {
"name": "numerical",
"kwargs": {
"imputer": "mean",
"normalizer": "min-max"
}
}
}
],
"labels": [
{
"column": "field",
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Welcome to the GraphStorm Documentation and Tutorials
gs-processing/usage/example
gs-processing/usage/distributed-processing-setup
gs-processing/usage/amazon-sagemaker
gs-processing/developer/input-configuration

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,7 @@ def convert_to_gsprocessing(self, input_dictionary: dict) -> dict:

gsprocessing_dict: dict[str, Any] = {}

# hardcode the version number for the first version
gsprocessing_dict["version"] = "gsprocessing-1.0"
gsprocessing_dict["version"] = "gsprocessing-v1.0"
gsprocessing_dict["graph"] = {}

# deal with nodes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
See the License for the specific language governing permissions and
limitations under the License.
"""
from typing import Any

from .converter_base import ConfigConverter
from .meta_configuration import NodeConfig, EdgeConfig

Expand Down Expand Up @@ -80,27 +82,40 @@ def _convert_feature(feats: list[dict]) -> list[dict]:
list[dict]
The feature information in the GSProcessing format
"""
feats_list = []
gsp_feats_list = []
if feats in [[], [{}]]:
return []
for ele in feats:
if "transform" in ele:
raise ValueError(
"Currently only support no-op operation, "
"we do not support any other no-op operation"
)
feat_dict = {}
kwargs = {"name": "no-op"}
for col in ele["feature_col"]:
feat_dict = {"column": col, "transform": kwargs}
feats_list.append(feat_dict)
if "out_dtype" in ele:
for gconstruct_feat_dict in feats:
gsp_feat_dict = {}
gsp_feat_dict["column"] = gconstruct_feat_dict["feature_col"][0]
if "feature_name" in gconstruct_feat_dict:
gsp_feat_dict["name"] = gconstruct_feat_dict["feature_name"]

gsp_transformation_dict: dict[str, Any] = {}
if "transform" in gconstruct_feat_dict:
gconstruct_transform_dict = gconstruct_feat_dict["transform"]

if gconstruct_transform_dict["name"] == "max_min_norm":
gsp_transformation_dict["name"] = "numerical"
gsp_transformation_dict["kwargs"] = {"normalizer": "min-max", "imputer": "mean"}
# TODO: Add support for other common transformations here
else:
raise ValueError(
"Unsupported GConstruct transformation name: "
f"{gconstruct_transform_dict['name']}"
)
else:
gsp_transformation_dict["name"] = "no-op"

if "out_dtype" in gconstruct_feat_dict:
assert (
ele["out_dtype"] == "float32"
gconstruct_feat_dict["out_dtype"] == "float32"
), "GSProcessing currently only supports float32 features"
if "feature_name" in ele:
feat_dict["name"] = ele["feature_name"]
return feats_list

gsp_feat_dict["transformation"] = gsp_transformation_dict
gsp_feats_list.append(gsp_feat_dict)

return gsp_feats_list

@staticmethod
def convert_nodes(nodes_entries):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
Configuration parsing for edges and nodes
"""
from abc import ABC

from typing import Any, Dict, List, Optional, Sequence

from graphstorm_processing.constants import SUPPORTED_FILE_TYPES
from .label_config_base import LabelConfig, EdgeLabelConfig, NodeLabelConfig
from .feature_config_base import FeatureConfig, NoopFeatureConfig
from .numerical_configs import MultiNumericalFeatureConfig, NumericalFeatureConfig
from .data_config_base import DataStorageConfig


Expand Down Expand Up @@ -52,6 +52,10 @@ def parse_feat_config(feature_dict: Dict) -> FeatureConfig:

if transformation_name == "no-op":
return NoopFeatureConfig(feature_dict)
elif transformation_name == "numerical":
return NumericalFeatureConfig(feature_dict)
elif transformation_name == "multi-numerical":
return MultiNumericalFeatureConfig(feature_dict)
else:
raise RuntimeError(f"Unknown transformation name: '{transformation_name}'")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,13 @@ def _sanity_check(self) -> None:


class NoopFeatureConfig(FeatureConfig):
"""
Feature configuration for features that do not need to be transformed.
"""Feature configuration for features that do not need to be transformed.
Supported kwargs
----------------
separator: str
When provided will treat the input as strings, split each value in the string using
the separator, and convert the resulting list of floats into a float-vector feature.
"""

def __init__(self, config: Mapping):
Expand All @@ -98,4 +103,4 @@ def __init__(self, config: Mapping):
def _sanity_check(self) -> None:
super()._sanity_check()
if self._data_config and self.value_separator and self._data_config.format != "csv":
raise RuntimeError("value_separator should only be provided for CSV data")
raise RuntimeError("separator should only be provided for CSV data")
Loading

0 comments on commit d6d5819

Please sign in to comment.