Skip to content

Commit

Permalink
Merge branch 'main' into gsprocessing-hard-negative
Browse files Browse the repository at this point in the history
  • Loading branch information
jalencato authored Nov 14, 2024
2 parents 35d4cbd + f5cf632 commit 11007eb
Show file tree
Hide file tree
Showing 13 changed files with 675 additions and 108 deletions.
6 changes: 4 additions & 2 deletions docs/source/cli/graph-construction/distributed/example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,9 @@ the graph structure, features, and labels. In more detail:
GSProcessing will use the transformation values listed here
instead of creating new ones, ensuring that models trained with the original
data can still be used in the newly transformed data. Currently only
categorical transformations can be re-applied.
categorical and numerical transformations can be re-applied. Note that
the Rank-Gauss transformation does not support re-application, it may
only work for transductive tasks.
* ``updated_row_counts_metadata.json``:
This file is meant to be used as the input configuration for the
distributed partitioning pipeline. ``gs-repartition`` produces
Expand Down Expand Up @@ -313,7 +315,7 @@ you can use the following command to run the partition job locally:
--num-parts 2 \
--dgl-tool-path ./dgl/tools \
--partition-algorithm random \
--ip-config ip_list.txt
--ip-config ip_list.txt
The command above will first do graph partitioning to determine the ownership for each partition and save the results.
Then it will do data dispatching to physically assign the partitions to graph data and dispatch them to each machine.
Expand Down
21 changes: 21 additions & 0 deletions docs/source/cli/model-training-inference/distributed/sagemaker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -212,6 +212,27 @@ Users can use the following commands to check the corresponding outputs:
aws s3 ls s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/
aws s3 ls s3://<PATH_TO_SAVE_PREDICTION_RESULTS>/
Launch embedding generation task
``````````````````````````````````
Users can use the following example command to launch a GraphStorm embedding generation job in the ``ogbn-mag`` data without generating predictions.

.. code:: bash
python3 launch/launch_infer.py \
--image-url <AMAZON_ECR_IMAGE_URI> \
--region <REGION> \
--entry-point run/infer_entry.py \
--role <ROLE_ARN> \
--instance-count 3 \
--graph-data-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p \
--yaml-s3 s3://<PATH_TO_TRAINING_CONFIG>/mag_lp.yaml \
--model-artifact-s3 s3://<PATH_TO_SAVE_TRAINED_MODEL>/ \
--raw-node-mappings-s3 s3://<PATH_TO_DATA>/ogbn_mag_lp_3p/raw_id_mappings \
--task-type compute_emb \
--output-emb-s3 s3://<PATH_TO_SAVE_GENERATED_NODE_EMBEDDING>/ \
--graph-name ogbn-mag \
--restore-model-layers embed,gnn
Launch graph partitioning task
```````````````````````````````
If your data are in the `DGL chunked
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"""

from dataclasses import dataclass
from typing import Sequence, Optional
from typing import Optional

from graphstorm_processing.constants import SUPPORTED_FILE_TYPES

Expand All @@ -27,7 +27,7 @@ class DataStorageConfig:
"""

format: str
files: Sequence[str]
files: list[str]
separator: Optional[str] = None

def __post_init__(self):
Expand All @@ -39,3 +39,7 @@ def __post_init__(self):
raise ValueError(
f"File paths need to be relative (not starting with '/'), got : {file}"
)

for idx, file in enumerate(self.files):
if file.startswith("./"):
self.files[idx] = file[2:]
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
)


class DistFeatureTransformer(object):
class DistFeatureTransformer:
"""
Given a feature configuration selects the correct transformation type,
which can then be be applied through a call to apply_transformation.
Expand All @@ -57,7 +57,9 @@ def __init__(
if feat_type == "no-op":
self.transformation = NoopTransformation(**default_kwargs, **args_dict)
elif feat_type == "numerical":
self.transformation = DistNumericalTransformation(**default_kwargs, **args_dict)
self.transformation = DistNumericalTransformation(
**default_kwargs, **args_dict, json_representation=json_representation
)
elif feat_type == "multi-numerical":
self.transformation = DistMultiNumericalTransformation(**default_kwargs, **args_dict)
elif feat_type == "bucket-numerical":
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def get_transformation_name() -> str:
return "DistBucketNumericalTransformation"

def apply(self, input_df: DataFrame) -> DataFrame:
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df)
imputed_df = apply_imputation(self.cols, self.shared_imputation, input_df).imputed_df
# TODO: Make range optional by getting min/max from data.
min_val, max_val = self.range

Expand Down
Loading

0 comments on commit 11007eb

Please sign in to comment.