-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NoDatetime result error because of duplicate transforming datetime columns by different processors. #248
Comments
Only ignoring it in DatetimeFormatter is not a good solution. How can do better? |
Maybe reversing the inversing order of processing is a better one? Because the FixedCombiningProcessor are using to transfrom first, so in inversing it should be the last. |
Additionally, may we use a manager to arrange the instance of processors and use it methods to transfrom or inverse transfrom is a better solution compare to using |
Hi, @cyantangerine. Can you support |
FYI, we now in sdg support user passing the @classmethod
def from_dataframe(
cls,
df: pd.DataFrame,
include_inspectors: list[str] | None = None,
exclude_inspectors: list[str] | None = None,
inspector_init_kwargs: dict[str, Any] | None = None,
check: bool = False,
) -> "Metadata":
"""Initialize a metadata from DataFrame and Inspectors
Args:
df(pd.DataFrame): the input DataFrame.
include_inspectors(list[str]): data type inspectors used in this metadata (table).
exclude_inspectors(list[str]): data type inspectors NOT used in this metadata (table).
inspector_init_kwargs(dict): inspector args.
""" Workaround If you filter out by excluding |
bug.zip |
@Wh1isper How do think about this? I think the original intention of SDG was to automate the processing pipeline, rather than requiring users to specify it manually. |
@jalr4ever I don't think @cyantangerine meant no automated processing, with more and more processors it seems necessary to introduce some processing rules and manage them. I'd be happy to see a |
As a side note, |
@cyantangerine I have ran the code below(which you supplied), and the error didn't occurred. Can you further clarify the bug's symptoms and your expectations? from sdgx.data_connectors.dataframe_connector import DataFrameConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata
import pandas as pd
from faker import Faker
fake = Faker()
df = pd.read_csv("/tests/dataset/1.csv")
data_connector = DataFrameConnector(df)
data_loader = DataLoader(data_connector)
loan_metadata = Metadata.from_dataloader(data_loader)
loan_metadata.primary_keys = {"int"}
loan_metadata.datetime_format = {
key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
loan_metadata.datetime_columns
}
loan_metadata.categorical_threshold = {
1: "label"
}
loan_metadata.discrete_columns = {
key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
}
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
metadata=loan_metadata,
model=CTGANSynthesizerModel(epochs=1),
data_connector=data_connector,
)
# Fit the model
synthesizer.fit()
# Sample
real_data = data_loader.load_all()
sampled_data = synthesizer.sample(100)
print(sampled_data) |
@jalr4ever You can have a check for whether FixedCombinationTransformer has been worked. from sdgx.data_processors.transformers.fixed_combination import FixedCombinationTransformer
fct: FixedCombinationTransformer = synthesizer.data_processors[1]
keys = fct.column_mappings.keys()
mp = {}
for k in keys:
if set(k) & loan_metadata.datetime_columns:
mp[k]= fct.column_mappings[k]
mp If the result is not empty, it has mapped datetime to int. Then check the datetime columns. sampled_data[list(loan_metadata.datetime_columns)] |
DescriptionSome of date columns value convert unexpectedly in sample data when synthetic data in CTGAN. ReproduceRunning the code below with 1.csv: from sdgx.data_connectors.dataframe_connector import DataFrameConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata
import pandas as pd
from faker import Faker
fake = Faker()
df = pd.read_csv("/tests/dataset/1.csv")
data_connector = DataFrameConnector(df)
data_loader = DataLoader(data_connector)
loan_metadata = Metadata.from_dataloader(data_loader)
loan_metadata.primary_keys = {"int"}
loan_metadata.datetime_format = {
key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
loan_metadata.datetime_columns
}
loan_metadata.categorical_threshold = {
1: "label"
}
loan_metadata.discrete_columns = {
key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
}
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
metadata=loan_metadata,
model=CTGANSynthesizerModel(epochs=1),
data_connector=data_connector,
)
# Fit the model
synthesizer.fit()
# Sample
real_data = data_loader.load_all()
sampled_data = synthesizer.sample(100)
datetime_columns = sampled_data[list(loan_metadata.datetime_columns)]
invalid_date_columns = []
for column in datetime_columns:
if (sampled_data[column] == "No Datetime").all():
print(f'Column: {column} has all values as "No Datetime".')
invalid_date_columns.append(column)
if not invalid_date_columns:
print("All datetime columns have valid values.") Expected behaviorNow it print:
Expected print result:
Context
Error messagePaste complete error message, logs, or stack traces here. ConfigurationPaste the contents of your configuration file here. Additional contextAdd any other context about the problem here. |
@cyantangerine I think that for To implement this approach, the What are your thoughts on this issue? |
@jalr4ever Yes. I am agree to not using FCT processor for datetime columns. It can using metadata in fit method to record columns' type. But, it can't solve the problem in original. It's better to use a certain strategy to manage the order of fitting, transformation, and reverse transformation of data processors, such as providing a manager as the number and types of data processors increase. Meanwhile, parallelization techniques can be considered in the manager to optimize the performance of the data processor. |
The reason for this issue is that FCT and date converter processed certain columns twice in sequence. But if the order is reversed (using a date converter first), the data is feasible. But I don't recommend using FCT for continuous values, it's better to only use it for discrete variables and integer data. |
@cyantangerine Well, from the issue you mentioned, aside from the Point 1 Point 2 In summary, I think the first thing we should do is disable the automatic activation of the |
Description
When we both using FixedCombinationTransformer and DatetimeFormatter in data_processors, some date columns of sample data are full of
No Datetime
, because some (NOT ALL) date columns matched FixedCombinationTransformer, and are transformed from timestamp to str, thus when DatetimeFormatter transforming, the bug occred because of transforming str (not timestamp) to str.Reproduce
Expected behavior
Context
Error message
Configuration
Additional context
The text was updated successfully, but these errors were encountered: