Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoDatetime result error because of duplicate transforming datetime columns by different processors. #248

Open
cyantangerine opened this issue Dec 6, 2024 · 16 comments
Labels
bug Something isn't working

Comments

@cyantangerine
Copy link
Contributor

Description

When we both using FixedCombinationTransformer and DatetimeFormatter in data_processors, some date columns of sample data are full of No Datetime, because some (NOT ALL) date columns matched FixedCombinationTransformer, and are transformed from timestamp to str, thus when DatetimeFormatter transforming, the bug occred because of transforming str (not timestamp) to str.

Reproduce

  1. Go to '...'
  2. Click on '...'
  3. Scroll down to '...'
  4. See error '...'

Expected behavior

Context

  • Operating System and version:
  • Browser and version(if necessary):
  • Which version are you using:
Error message
Paste complete error message, logs, or stack traces here.
Configuration
Paste the contents of your configuration file here.
Additional context
Add any other context about the problem here.
@cyantangerine cyantangerine added the bug Something isn't working label Dec 6, 2024
@cyantangerine
Copy link
Contributor Author

Only ignoring it in DatetimeFormatter is not a good solution. How can do better?

@cyantangerine
Copy link
Contributor Author

Maybe reversing the inversing order of processing is a better one? Because the FixedCombiningProcessor are using to transfrom first, so in inversing it should be the last.

@cyantangerine
Copy link
Contributor Author

Additionally, may we use a manager to arrange the instance of processors and use it methods to transfrom or inverse transfrom is a better solution compare to using for d in processers cycle to processing.

@jalr4ever
Copy link
Collaborator

Hi, @cyantangerine.

Can you support ReproduceExpected behaviorContext? So that we can quickly identify problems.

@jalr4ever
Copy link
Collaborator

jalr4ever commented Dec 6, 2024

FYI, we now in sdg support user passing the exclude_inspectors in Metadata:

    @classmethod
    def from_dataframe(
        cls,
        df: pd.DataFrame,
        include_inspectors: list[str] | None = None,
        exclude_inspectors: list[str] | None = None,
        inspector_init_kwargs: dict[str, Any] | None = None,
        check: bool = False,
    ) -> "Metadata":
        """Initialize a metadata from DataFrame and Inspectors

        Args:
            df(pd.DataFrame): the input DataFrame.
            include_inspectors(list[str]): data type inspectors used in this metadata (table).
            exclude_inspectors(list[str]): data type inspectors NOT used in this metadata (table).
            inspector_init_kwargs(dict): inspector args.
        """

Workaround

If you filter out by excluding FixedCombinationInspector, you may temporarily avoid this bug.

@cyantangerine
Copy link
Contributor Author

bug.zip
This dataset can reproduct this bug. Have a try!

@jalr4ever
Copy link
Collaborator

Additionally, may we use a manager to arrange the instance of processors and use it methods to transfrom or inverse transfrom is a better solution compare to using for d in processers cycle to processing.

@Wh1isper How do think about this? I think the original intention of SDG was to automate the processing pipeline, rather than requiring users to specify it manually.

@Wh1isper
Copy link
Collaborator

Wh1isper commented Dec 6, 2024

@jalr4ever I don't think @cyantangerine meant no automated processing, with more and more processors it seems necessary to introduce some processing rules and manage them.

I'd be happy to see a manager for processor and we should use it in Synthesizer

@Wh1isper
Copy link
Collaborator

Wh1isper commented Dec 6, 2024

As a side note, sdg was initially designed to be used both directly through Synthesizer and as a lib for users to design their own Synthesizer. Any improvements that help achieve this are encouraged!

@jalr4ever
Copy link
Collaborator

@cyantangerine I have ran the code below(which you supplied), and the error didn't occurred. Can you further clarify the bug's symptoms and your expectations?

from sdgx.data_connectors.dataframe_connector import DataFrameConnector
from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
from sdgx.synthesizer import Synthesizer
from sdgx.data_loader import DataLoader
from sdgx.data_models.metadata import Metadata
import pandas as pd
from faker import Faker

fake = Faker()
df = pd.read_csv("/tests/dataset/1.csv")
data_connector = DataFrameConnector(df)
data_loader = DataLoader(data_connector)
loan_metadata = Metadata.from_dataloader(data_loader)

loan_metadata.primary_keys = {"int"}
loan_metadata.datetime_format = {
    key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
    loan_metadata.datetime_columns
}
loan_metadata.categorical_threshold = {
    1: "label"
}
loan_metadata.discrete_columns = {
    key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
}
# Initialize synthesizer, use CTGAN model
synthesizer = Synthesizer(
    metadata=loan_metadata,
    model=CTGANSynthesizerModel(epochs=1),
    data_connector=data_connector,
)
# Fit the model
synthesizer.fit()
# Sample
real_data = data_loader.load_all()
sampled_data = synthesizer.sample(100)
print(sampled_data)

@cyantangerine
Copy link
Contributor Author

cyantangerine commented Dec 18, 2024

@jalr4ever You can have a check for whether FixedCombinationTransformer has been worked.

from sdgx.data_processors.transformers.fixed_combination import FixedCombinationTransformer

fct: FixedCombinationTransformer = synthesizer.data_processors[1]
keys = fct.column_mappings.keys()
mp = {}
for k in keys:
    if set(k) & loan_metadata.datetime_columns:
        mp[k]= fct.column_mappings[k]
mp

If the result is not empty, it has mapped datetime to int.
image

Then check the datetime columns.

sampled_data[list(loan_metadata.datetime_columns)]

image

@jalr4ever
Copy link
Collaborator

Okay, now let’s clarify the misconceptions and expectations.

Description

Some of date columns value convert unexpectedly in sample data when synthetic data in CTGAN.

Reproduce

Running the code below with 1.csv:

  from sdgx.data_connectors.dataframe_connector import DataFrameConnector
  from sdgx.models.ml.single_table.ctgan import CTGANSynthesizerModel
  from sdgx.synthesizer import Synthesizer
  from sdgx.data_loader import DataLoader
  from sdgx.data_models.metadata import Metadata
  import pandas as pd
  from faker import Faker

  fake = Faker()
  df = pd.read_csv("/tests/dataset/1.csv")
  data_connector = DataFrameConnector(df)
  data_loader = DataLoader(data_connector)
  loan_metadata = Metadata.from_dataloader(data_loader)

  loan_metadata.primary_keys = {"int"}
  loan_metadata.datetime_format = {
      key: "%Y-%m-%d" if not key.startswith("Submission_TABLE_submission_date") else "%Y-%m-%d %H:%M:%S" for key in
      loan_metadata.datetime_columns
  }
  loan_metadata.categorical_threshold = {
      1: "label"
  }
  loan_metadata.discrete_columns = {
      key for key in loan_metadata.discrete_columns if key not in loan_metadata.datetime_format
  }
  # Initialize synthesizer, use CTGAN model
  synthesizer = Synthesizer(
      metadata=loan_metadata,
      model=CTGANSynthesizerModel(epochs=1),
      data_connector=data_connector,
  )
  # Fit the model
  synthesizer.fit()
  # Sample
  real_data = data_loader.load_all()
  sampled_data = synthesizer.sample(100)
  datetime_columns = sampled_data[list(loan_metadata.datetime_columns)]

  invalid_date_columns = []
  for column in datetime_columns:
      if (sampled_data[column] == "No Datetime").all():
          print(f'Column: {column} has all values as "No Datetime".')
          invalid_date_columns.append(column)
  if not invalid_date_columns:
      print("All datetime columns have valid values.")

Expected behavior

Now it print:

Column: Student_TABLE_date_of_birth_COPY0, Row count: 100
Column: BookLoan_TABLE_return_date, Row count: 100
Column: BookLoan_TABLE_loan_date, Row count: 100
Column: Submission_TABLE_submission_date, Row count: 100
Column: BookLoan_TABLE_loan_date_COPY0, Row count: 100
Column: Submission_TABLE_submission_date_COPY1, Row count: 100

Expected print result:

All datetime columns have valid values.

Context

  • Operating System and version: 0.2.4
  • Browser and version(if necessary):
  • Which version are you using:
Error message
Paste complete error message, logs, or stack traces here.
Configuration
Paste the contents of your configuration file here.
Additional context
Add any other context about the problem here.

@jalr4ever
Copy link
Collaborator

@cyantangerine I think that for FixedCombinationInspector, if it is a datetime column, it should not be treated as an categorical column for relationship detection.

To implement this approach, the FixedCombinationInspector needs to detect datetime_columns in order to filter combinations. Somehow makes this logic is redundant. It would be better to have an inspector_context passed into the FixedCombinationInspector so that it can access datetime_columns.

What are your thoughts on this issue?

@cyantangerine
Copy link
Contributor Author

@jalr4ever Yes. I am agree to not using FCT processor for datetime columns.

It can using metadata in fit method to record columns' type.

But, it can't solve the problem in original. It's better to use a certain strategy to manage the order of fitting, transformation, and reverse transformation of data processors, such as providing a manager as the number and types of data processors increase. Meanwhile, parallelization techniques can be considered in the manager to optimize the performance of the data processor.

@cyantangerine
Copy link
Contributor Author

The reason for this issue is that FCT and date converter processed certain columns twice in sequence. But if the order is reversed (using a date converter first), the data is feasible. But I don't recommend using FCT for continuous values, it's better to only use it for discrete variables and integer data.

@jalr4ever
Copy link
Collaborator

@cyantangerine Well, from the issue you mentioned, aside from the manager with the processor in sdg, I've alsl realized:

Point 1
The issue with the automatic detection combinations algorithm. Now the FixedCombinationInspector's detection of combinations through covariance and one-to-one correspondence algorithms is too idealistic. it only applies to the "ideal data" presented in the papers, while real-world data is much more complex. I think one approach we could try in the short term is to integrate a powerful 3B LLM model into the SDG for automatic combination detection.

Point 2
Default enbale behavior. for combination problems, we provide a manually solution with SpecificCombinationTransformer , I believe that for the automatic detection feature, it shall not be enabled by default, but offer an option for users to enable it.

In summary, I think the first thing we should do is disable the automatic activation of the FixedCombinationInspector, and then provide an option to enable it based on the idea (see point 2). This can proceed in parallel with the algorithm and engineering optimizations we've discussed, as they are independent of each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants