Description
Towards the goal of adding support for computing statistics over structured data (e.g., arbitrary protocol buffers, parquet data), GenerateStatistics
API will take Arrow tables as input instead of Dict[FeatureName, ndarray]
. The API will only accept Arrow tables whose columns are ListArray
of primitive types (e.g., int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
, float16
, float32
, float64
, binary
, string
, unicode
) .
This change should be a no-op if you construct the pipeline using the default decoders (e.g., tfdv.DecodeTFExample
and tfdv.DecodeCSV
) or if you are using the utility methods to generate statistics (e.g., tfdv.generate_statistics_from_tfrecord
, tfdv.generate_statistics_from_csv
and tfdv.generate_statistics_from_dataframe
).
TFDV 0.14 will have this new behavior. Let us know if you have any issues with migrating to the new API.