NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

cyantangerine · 2024-12-06T05:55:28Z

Description

When we both using FixedCombinationTransformer and DatetimeFormatter in data_processors, some date columns of sample data are full of No Datetime, because some (NOT ALL) date columns matched FixedCombinationTransformer, and are transformed from timestamp to str, thus when DatetimeFormatter transforming, the bug occred because of transforming str (not timestamp) to str.

Reproduce

Go to '...'
Click on '...'
Scroll down to '...'
See error '...'

Expected behavior

Context

Operating System and version:
Browser and version(if necessary):
Which version are you using:

Error message

Paste complete error message, logs, or stack traces here.

Configuration

Paste the contents of your configuration file here.

Additional context

Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

cyantangerine · 2024-12-06T06:15:03Z

Only ignoring it in DatetimeFormatter is not a good solution. How can do better?

cyantangerine · 2024-12-06T06:43:46Z

Maybe reversing the inversing order of processing is a better one? Because the FixedCombiningProcessor are using to transfrom first, so in inversing it should be the last.

cyantangerine · 2024-12-06T06:58:51Z

Additionally, may we use a manager to arrange the instance of processors and use it methods to transfrom or inverse transfrom is a better solution compare to using for d in processers cycle to processing.

jalr4ever · 2024-12-06T07:32:18Z

Hi, @cyantangerine.

Can you support Reproduce、Expected behavior、Context? So that we can quickly identify problems.

jalr4ever · 2024-12-06T07:38:36Z

FYI, we now in sdg support user passing the exclude_inspectors in Metadata:

    @classmethod
    def from_dataframe(
        cls,
        df: pd.DataFrame,
        include_inspectors: list[str] | None = None,
        exclude_inspectors: list[str] | None = None,
        inspector_init_kwargs: dict[str, Any] | None = None,
        check: bool = False,
    ) -> "Metadata":
        """Initialize a metadata from DataFrame and Inspectors

        Args:
            df(pd.DataFrame): the input DataFrame.
            include_inspectors(list[str]): data type inspectors used in this metadata (table).
            exclude_inspectors(list[str]): data type inspectors NOT used in this metadata (table).
            inspector_init_kwargs(dict): inspector args.
        """

Workaround

If you filter out by excluding FixedCombinationInspector, you may temporarily avoid this bug.

cyantangerine · 2024-12-06T08:46:39Z

bug.zip
This dataset can reproduct this bug. Have a try!

jalr4ever · 2024-12-06T10:11:15Z

Additionally, may we use a manager to arrange the instance of processors and use it methods to transfrom or inverse transfrom is a better solution compare to using for d in processers cycle to processing.

@Wh1isper How do think about this? I think the original intention of SDG was to automate the processing pipeline, rather than requiring users to specify it manually.

Wh1isper · 2024-12-06T10:20:03Z

@jalr4ever I don't think @cyantangerine meant no automated processing, with more and more processors it seems necessary to introduce some processing rules and manage them.

I'd be happy to see a manager for processor and we should use it in Synthesizer

Wh1isper · 2024-12-06T10:23:42Z

As a side note, sdg was initially designed to be used both directly through Synthesizer and as a lib for users to design their own Synthesizer. Any improvements that help achieve this are encouraged!

jalr4ever · 2024-12-18T08:56:52Z

@cyantangerine I have ran the code below(which you supplied), and the error didn't occurred. Can you further clarify the bug's symptoms and your expectations?

from sdgx.data_connectors.dataframe_connector import DataFrameConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata
import pandas as pd
from faker import Faker

fake = Faker()
df = pd.read_csv("/tests/dataset/1.csv")
data_connector = DataFrameConnector(df)
data_loader = DataLoader(data_connector)
loan_metadata = Metadata.from_dataloader(data_loader)

loan_metadata.primary_keys = {"int"}
loan_metadata.datetime_format = {
    key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
    loan_metadata.datetime_columns
}
loan_metadata.categorical_threshold = {
    1: "label"
}
loan_metadata.discrete_columns = {
    key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
}
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    metadata=loan_metadata,
    model=CTGANSynthesizerModel(epochs=1),
    data_connector=data_connector,
)
# Fit the model
synthesizer.fit()
# Sample
real_data = data_loader.load_all()
sampled_data = synthesizer.sample(100)
print(sampled_data)

cyantangerine · 2024-12-18T09:15:53Z

@jalr4ever You can have a check for whether FixedCombinationTransformer has been worked.

from sdgx.data_processors.transformers.fixed_combination import FixedCombinationTransformer

fct: FixedCombinationTransformer = synthesizer.data_processors[1]
keys = fct.column_mappings.keys()
mp = {}
for k in keys:
    if set(k) & loan_metadata.datetime_columns:
        mp[k]= fct.column_mappings[k]
mp

If the result is not empty, it has mapped datetime to int.

Then check the datetime columns.

sampled_data[list(loan_metadata.datetime_columns)]

jalr4ever · 2024-12-18T10:55:50Z

Okay, now let’s clarify the misconceptions and expectations.

Description

Some of date columns value convert unexpectedly in sample data when synthetic data in CTGAN.

Reproduce

Running the code below with 1.csv:

  from sdgx.data_connectors.dataframe_connector import DataFrameConnector
  from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
  from sdgx.synthesizer import Synthesizer
  from sdgx.data_loader import DataLoader
  from sdgx.data_models.metadata import Metadata
  import pandas as pd
  from faker import Faker

  fake = Faker()
  df = pd.read_csv("/tests/dataset/1.csv")
  data_connector = DataFrameConnector(df)
  data_loader = DataLoader(data_connector)
  loan_metadata = Metadata.from_dataloader(data_loader)

  loan_metadata.primary_keys = {"int"}
  loan_metadata.datetime_format = {
      key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
      loan_metadata.datetime_columns
  }
  loan_metadata.categorical_threshold = {
      1: "label"
  }
  loan_metadata.discrete_columns = {
      key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
  }
  # Initialize synthesizer, use CTGAN model
  synthesizer = Synthesizer(
      metadata=loan_metadata,
      model=CTGANSynthesizerModel(epochs=1),
      data_connector=data_connector,
  )
  # Fit the model
  synthesizer.fit()
  # Sample
  real_data = data_loader.load_all()
  sampled_data = synthesizer.sample(100)
  datetime_columns = sampled_data[list(loan_metadata.datetime_columns)]

  invalid_date_columns = []
  for column in datetime_columns:
      if (sampled_data[column] == "No Datetime").all():
          print(f'Column: {column} has all values as "No Datetime".')
          invalid_date_columns.append(column)
  if not invalid_date_columns:
      print("All datetime columns have valid values.")

Expected behavior

Now it print:

Column: Student_TABLE_date_of_birth_COPY0, Row count: 100
Column: BookLoan_TABLE_return_date, Row count: 100
Column: BookLoan_TABLE_loan_date, Row count: 100
Column: Submission_TABLE_submission_date, Row count: 100
Column: BookLoan_TABLE_loan_date_COPY0, Row count: 100
Column: Submission_TABLE_submission_date_COPY1, Row count: 100

Expected print result:

All datetime columns have valid values.

Context

Operating System and version: 0.2.4
Browser and version(if necessary):
Which version are you using:

Error message

Paste complete error message, logs, or stack traces here.

Configuration

Paste the contents of your configuration file here.

Additional context

Add any other context about the problem here.

jalr4ever · 2024-12-18T11:40:09Z

@cyantangerine I think that for FixedCombinationInspector, if it is a datetime column, it should not be treated as an categorical column for relationship detection.

To implement this approach, the FixedCombinationInspector needs to detect datetime_columns in order to filter combinations. Somehow makes this logic is redundant. It would be better to have an inspector_context passed into the FixedCombinationInspector so that it can access datetime_columns.

What are your thoughts on this issue?

cyantangerine · 2024-12-18T11:54:13Z

@jalr4ever Yes. I am agree to not using FCT processor for datetime columns.

It can using metadata in fit method to record columns' type.

But, it can't solve the problem in original. It's better to use a certain strategy to manage the order of fitting, transformation, and reverse transformation of data processors, such as providing a manager as the number and types of data processors increase. Meanwhile, parallelization techniques can be considered in the manager to optimize the performance of the data processor.

cyantangerine · 2024-12-18T11:59:43Z

The reason for this issue is that FCT and date converter processed certain columns twice in sequence. But if the order is reversed (using a date converter first), the data is feasible. But I don't recommend using FCT for continuous values, it's better to only use it for discrete variables and integer data.

jalr4ever · 2024-12-18T12:33:58Z

@cyantangerine Well, from the issue you mentioned, aside from the manager with the processor in sdg, I've alsl realized:

Point 1
The issue with the automatic detection combinations algorithm. Now the FixedCombinationInspector's detection of combinations through covariance and one-to-one correspondence algorithms is too idealistic. it only applies to the "ideal data" presented in the papers, while real-world data is much more complex. I think one approach we could try in the short term is to integrate a powerful 3B LLM model into the SDG for automatic combination detection.

Point 2
Default enbale behavior. for combination problems, we provide a manually solution with SpecificCombinationTransformer , I believe that for the automatic detection feature, it shall not be enabled by default, but offer an option for users to enable it.

In summary, I think the first thing we should do is disable the automatic activation of the FixedCombinationInspector, and then provide an option to enable it based on the idea (see point 2). This can proceed in parallel with the algorithm and engineering optimizations we've discussed, as they are independent of each other.

cyantangerine added the bug Something isn't working label Dec 6, 2024

cyantangerine mentioned this issue Dec 17, 2024

Model fit error when replicating example notebook sdgx_example_ctgan.ipynb #249

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

cyantangerine commented Dec 6, 2024

cyantangerine commented Dec 6, 2024

cyantangerine commented Dec 6, 2024

cyantangerine commented Dec 6, 2024

jalr4ever commented Dec 6, 2024

jalr4ever commented Dec 6, 2024 •

edited

Loading

cyantangerine commented Dec 6, 2024

jalr4ever commented Dec 6, 2024

Wh1isper commented Dec 6, 2024 •

edited

Loading

Wh1isper commented Dec 6, 2024

jalr4ever commented Dec 18, 2024

cyantangerine commented Dec 18, 2024 •

edited

Loading

jalr4ever commented Dec 18, 2024

jalr4ever commented Dec 18, 2024

cyantangerine commented Dec 18, 2024

cyantangerine commented Dec 18, 2024

jalr4ever commented Dec 18, 2024

NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

Comments

cyantangerine commented Dec 6, 2024

Description

Reproduce

Expected behavior

Context

cyantangerine commented Dec 6, 2024

cyantangerine commented Dec 6, 2024

cyantangerine commented Dec 6, 2024

jalr4ever commented Dec 6, 2024

jalr4ever commented Dec 6, 2024 • edited Loading

cyantangerine commented Dec 6, 2024

jalr4ever commented Dec 6, 2024

Wh1isper commented Dec 6, 2024 • edited Loading

Wh1isper commented Dec 6, 2024

jalr4ever commented Dec 18, 2024

cyantangerine commented Dec 18, 2024 • edited Loading

jalr4ever commented Dec 18, 2024

Description

Reproduce

Expected behavior

Context

jalr4ever commented Dec 18, 2024

cyantangerine commented Dec 18, 2024

cyantangerine commented Dec 18, 2024

jalr4ever commented Dec 18, 2024

jalr4ever commented Dec 6, 2024 •

edited

Loading

Wh1isper commented Dec 6, 2024 •

edited

Loading

cyantangerine commented Dec 18, 2024 •

edited

Loading