17 Mar 14:37

pedro93

cbe0334

v1.0.0 Latest

Latest

DataHub v1.0.0

Release Highlights

DataHub v1.0.0 is packed with exciting updates, including:

A completely redesigned user experience focused on simplified navigation and a visually stunning interface.
Unified support for Data & AI, including AI Model Group Versions, AI Model Lineage, Model Stats, and Experiment/Run ingestion.
DataHub Iceberg Catalog, allowing users to manage Iceberg tables directly from DataHub.

Read the blog post here!

Changelog

New User Interface: Putting Usability First

With a completely re-designed user interface, DataHub v1.0 represents a fundamental rethinking of how users interact with their metadata and data assets. The new experience includes:

Intuitive Platform-Based Navigation - Hierarchically browse data by database and schema in Snowflake, BigQuery, Redshift, Databricks, and more. Combine hierarchical navigation with filtering by data owners, domain, tags, and glossary terms to find the right data fast.
Seamless Lineage Exploration - Our reimagined lineage view features multi-level expansion, name-based search, and column-level visibility, making it easier than ever to understand data relationships and impact.
Integrated Data Quality - Make confident decisions with deeply integrated quality signals throughout the platform, helping you quickly identify and trust reliable data assets.

DataHub Admins can enable the new UI for all users by setting the THEME_V2_DEFAULT environment variable to true; until then, Users can opt into the new experience by navigating to Settings > Appearance > Try New User Experience.

Comprehensive AI Asset Support: Unifying Data and AI

DataHub v1.0 treats AI assets as first-class citizens within the data ecosystem, allowing users to track their entire data-to-AI pipeline in one place.

Unified Search and Discovery: Seamlessly search across models, model groups, and traditional data assets in one unified interface.
Advanced Versioning System: Track multiple versions of datasets and ML models with detailed performance metrics and clear linkages between versions.
Rich Model Statistics: Monitor key metrics across versions, understand performance trends, and make data-driven decisions about model deployment.
End-to-End Lineage: Trace data flows from raw inputs through models to final outputs, with complete versioning support.

DataHub Iceberg REST Catalog Beta: Simplifying Data Lake Management

This release introduces an integration with Apace Iceberg, allowing users to manage Iceberg tables directly through DataHub, including:

Create and manage Iceberg tables through DataHub
Maintain consistent metadata across DataHub and Iceberg
Facilitate data discovery by exposing Iceberg table metadata in DataHub
Enable secure access to Iceberg tables through DataHub's permissions model

Read the docs here!

DataHub CLI

This release introduces the following improvements to our CLI:

Added container command to apply tags, terms, and owners on all assets within the container. [ #1241 8, #12436]
Improved delete command to optionally reference a file with a list of URNS to be deleted. [#12247]
Expanded ingest command to support ingesting MCPs from S3. [#12649]

Metadata Ingestion

We’re continuously improving our integrations to add new capabilities and squash bugs.

dbt: Added the parameter include_database_name to support including the database name in URN generation. [#12411]
Iceberg: Alongside our new Iceberg Catalog API, we’ve made various improvements to our Iceberg integration. [#12744]
MLFlow: Significantly revamped our MLFlow connector, adding support for tracking Model Group Versions and Model Stats; tracking Model lineage to underlying datasets; and capturing Experiments and Runs.
MSSQL: Improved support for extracting stored procedures from MS SQL. [ #12244, #12563]
Oracle: Improved the accuracy of column-level lineage resolution.
PowerBI: Improved lineage mapping so PowerBI Reports can now contain PowerBI Dashboards. [#12451]
Redshift: Added support for data shares and external schemas, including automatic lineage resolution across Redshift namespaces.
S3: Added functionality to the S3 ingestion process to ignore paths that do not match the specified depth, resolving warning messages triggered by mismatched paths. [#12326]
Snowflake: Added support for Snowflake Streams and Hybrid Tables, and fixed a bug with lineage resolution across table renames. [#12318]
Superset: (community contribution!): Added support for Superset virtual datasets and lineage. [#12679]

Additionally, we’re working on a new integration with Vertex AI. Please reach out if you’re interested in joining the beta.

Of course, this only scratches the surface of changes. This release contains 100+ improvements across 25 different integrations.

Thank You to our Contributors!

View the full changelog: v0.15.0.1...v1.0.0

Contributors

sgomezvillamor, shirshanka, and 47 other contributors

Assets 2

13 Mar 09:53

pedro93

v1.0.0rc5

68cc2fa

v1.0.0rc5 Pre-release

Pre-release

Full Changelog: v1.0.0rc4...v1.0.0rc5

Assets 2

12 Mar 10:33

pedro93

v1.0.0rc4

1abac3c

v1.0.0rc4 Pre-release

Pre-release

Full Changelog: v1.0.0rc3...v1.0.0rc4

Assets 2

04 Mar 18:14

pedro93

v1.0.0rc3

6097820

v1.0.0rc3 Pre-release

Pre-release

What's Changed

fix(filters) Fix autocomplete for platforms and improve advanced search builder by @chriscollins3456 in #12560
fix(ingest): handle groups in pattern_cleanup_ownership transformer by @cccs-cat001 in #12536
tests(druid): integration tests for druid ingestion by @sgomezvillamor in #12717
feat(api): let admins use granted privileges for actors by @anshbansal in #12718
feat(build): use pull_request_target for datahub-wheels by @hsheth2 in #12722
feat(ui): access management docs by @kevinkarchacryl in #12719
fix(lineage): error message for edit lineage by @anshbansal in #12724
docs: clarify limits on AI docs by @hsheth2 in #12728
fix(urn-validation): additional test cases for urn validation by @david-leifker in #12727
fix(ui) Fix NPE in pluralize function by @chriscollins3456 in #12629
Fix platform instance support on Druid ingestion by @Rasnar in #12716
ci(coverage): update patch coverage threshold by @chakru-r in #12733
fix(ui) Fix bug with date dropdown in deprecation modal by @chriscollins3456 in #12633
fix(ui) Fix group membership inconsistencies on group page by @chriscollins3456 in #12704
fix(ui) Properly get display name when downloading search results by @chriscollins3456 in #12720
fix(ingest): bump avro dep by @hsheth2 in #12729
fix(ui) Filter healthy assets out of unhealthy upstreams component by @chriscollins3456 in #12705
docs: update slack link by @hsheth2 in #12731
fix(build): support datahub-wheels from forked PRs by @hsheth2 in #12730
docs: add scarf integration by @hsheth2 in #12739
fix(iceberg-cli): add missing filter for iceberg dataplatform by @chakru-r in #12732
dev: immutable args remove by @anshbansal in #12735
build(deps): bump dompurify from 2.5.4 to 3.2.4 in /datahub-web-react by @dependabot in #12643
refactor(ui): Migrate to use the new Button component consistently by @jjoyce0510 in #12597
docs(restore-indices): added best practices by @david-leifker in #12741
feat(ui/lineageV2): Show version pill in lineage sidebar and node by @asikowitz in #12599
chore(bump): Bump kafka-setup base by @david-leifker in #12743
dev: enable ruff rule by @anshbansal in #12742
revert(ci): revert datahub-wheel build changes by @hsheth2 in #12747
feat: API key support in Metabase source by @rajatgl17 in #12711
dev: enable ruff rule by @anshbansal in #12749
refactor(ingest/s3): enhance readability by @eagle-25 in #12686
feat(ingestion/superset): superset dataset lineage for metadata ingestion by @PeteMango in #12679
chore(ci): avoid dep on confluent-kafka 2.8.1 by @hsheth2 in #12753
feat(graphql): implement sort and facet for scroll by @david-leifker in #12746
feat(ingest): improve error messages for unknown metadata objects by @hsheth2 in #12745
fix(web) accurate error message for embeddedlistsearch by @jayacryl in #12622
feat(ingestion/iceberg): Several improvements to iceberg connector by @skrydal in #12744
fix(ingest): support pydantic v2 in file-based lineage by @hsheth2 in #12723
feat(iceberg): improve concurrency control and resilience by @ksrinath in #12664
docs(users+groups): show that you can set title via users YAML by @gabe-lyons in #12767
feat(sdk): add search client by @hsheth2 in #12754
feat(operations): ES and Kafka Operations Endpoints by @david-leifker in #12756
feat(auth): support guest access by @chakru-r in #12619
fix(iceberg): listnamespaces includes warehouse name as root by @chakru-r in #12761
feat(UI): make searchbar centered and wider by @v-tarasevich-blitz-brain in #12666
fix(ui) Fix order of parent containers on v2 autocomplete item by @chriscollins3456 in #12721
fix(test): handle empty log by @david-leifker in #12768
fix(lineage) Support views and sorting in impact analysis by @chriscollins3456 in #12769
feat(versioning): Support entity versioning ingestion by @asikowitz in #12755
fix(ui): add overflow wrap for dpi / model summary tab & add custom properties in mlmodelgroup queries by @yoonhyejin in #12771
feat(sdk): add support for institutional memory links by @hsheth2 in #12770

New Contributors

@Rasnar made their first contribution in #12716
@rajatgl17 made their first contribution in #12711
@PeteMango made their first contribution in #12679
@v-tarasevich-blitz-brain made their first contribution in #12666

Full Changelog: v1.0.0rc2...v1.0.0rc3

Contributors

sgomezvillamor, skrydal, and 19 other contributors

Assets 2

24 Feb 20:04

david-leifker

v1.0.0rc2

8dfd8fb

v1.0.0rc2 Pre-release

Pre-release

What's Changed

docs(ingest/mode): add details on authentication/permissions for mode by @hsheth2 in #12508
fix(ingest/snowflake): Create all structured propery templates before assignation by @treff7es in #12469
docs: fix token to be not required in sample script by @yoonhyejin in #12511
fix(mssql): adds missing containers and browsepathsv2 for dataflow and datajob by @sgomezvillamor in #12483
fix(ingest/glue): change to warning on access denied by @anshbansal in #12519
fix(ingest/mode): remove unused field by @anshbansal in #12520
docs: fix link to executor helm chart by @anshbansal in #12522
fix(ingest): add missing dep for gcs by @hsheth2 in #12505
docs(entity-change-events): add docs for action request events by @gabe-lyons in #12493
docs(ingest): script to add ERModelRelationship Entity by @sagar-salvi-apptware in #12473
refactor(trace-model): refactor trace model package by @david-leifker in #12510
fix(ci): run smoke tests on release by @chakru-r in #12518
chore(bump): bump jmx version by @david-leifker in #12524
fix(cli): avoid false positive cli upgrade suggestions by @hsheth2 in #12497
fix(ingest/azure-ad): limit the size of the ingestion report by @hsheth2 in #12498
feat(metadata-io): enable rollback transaction support by @david-leifker in #12509
feat(snowflake): add missing pushdown_deny_usernames config to be used when use_queries_v2 by @sgomezvillamor in #12527
fix(model): fixes DashboardContainsDashboard relationship in DashboardInfo aspect by @sgomezvillamor in #12433
feat(restoreIndices): update restore indices args and docs by @RyanHolstien in #12529
fix(businessAttribute): fix business Attribute related entities by @deepgarg-visa in #12537
fix(ui): make data process instance visible in container in V2& fix model/modelgroup names by @yoonhyejin in #12513
fix(ingest): avoid multiprocessing "fork" start method by @hsheth2 in #12543
fix(ui): revert backend breaking changes to mau by @kevinkarchacryl in #12461
tests(kafka-connect): fixes integration tests setup by @sgomezvillamor in #12531
fix(ingest/unity): add row count in table profile of delta tables by @mayurinehate in #12480
fix(ingest): use lossy collections by @anshbansal in #12523
fix(misc-openapi): fix openlineage, platform events & swagger by @david-leifker in #12539
fix(test): move reading env variable inside method by @anshbansal in #12549
feat(versioning): Add V2 UI; make backend more synchronous; add to component library by @asikowitz in #12542
docs(iceberg): add iceberg user guide by @chakru-r in #12533
feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. by @brock-acryl in #12318
feat(powerbi): Report to Dashboard lineage by @sgomezvillamor in #12451
fix(no-rows-updated): fix no rows updated by @david-leifker in #12530
ci(smoke): report smoke test results to codecov by @hsheth2 in #12556
feat(UI): Confirmation before deleting Link by @pinakipb2 in #12162
feat(ingestion/s3): ignore depth mismatched path by @eagle-25 in #12326
feat(docs-site) adding case studies and updating banner by @jayacryl in #12525
feat(ingestion/mongodb) re-order aggregation logic by @Haebuk in #12428
docs(salesforce): add missing salesforce source to cli doc by @remisalmon in #12550
feat(openapi): precondition exceptions return 412 by @david-leifker in #12552
feat(openapi): point in time parameter (elasticsearch only) by @david-leifker in #12553
fix(openapi-spec): fix openapi spec oneOf schema by @david-leifker in #12561
fix(autocomplete): fix autocomplete duplicate field by @david-leifker in #12558
build(deps): bump black from 23.7.0 to 24.3.0 in /metadata-service/iceberg-catalog by @dependabot in #12502
feat(sdk): add scaffolding for sdk v2 by @hsheth2 in #12554
doc(dbt): Add missing dbt extra requirement to cli doc by @remisalmon in #12568
feat(docs): Add live secret reload docs to k8s remote executor page by @pedro93 in #12541
fix(ingest): remove duplicate mcps,more typing by @mayurinehate in #12557
doc: update doc of first release by @anshbansal in #12574
fix(docs) explain need to restore indices when adding @searchable by @jayacryl in #12576
fix(sdk): fix platform instance generation in the sdk by @hsheth2 in #12573
fix(looker): sort user mapping for consistency by @hsheth2 in #12569
fix(ingestion/teradata): teradata profiling fix for pooling by @brock-acryl in #12507
fix(structuredProps) Add validation for allowedTypes and harden API for invalid types by @chriscollins3456 in #12578
fix(ui): better experience for analytics charts by @kevinkarchacryl in #12462
feat(ingest/mssql): improve stored procedure splitting by @hsheth2 in #12563
docs: add page on metadata standards by @hsheth2 in #12584
feat(gh-workflows) adding jayacryl to pr-labeler by @jayacryl in #12579
fix(iceberg): delete associated platform resources when deleting warehouse by @chakru-r in #12564
feat(ingest): add display name for dynamodb tables by @mayurinehate in #12534
fix(ui) Show editable field info for fields based on exact fieldPath version by @chriscollins3456 in #12570
fix(openapi-schema): fix openapi schema generator by @david-leifker in #12590
feat(ingestion/dbt): Add include_database_name parameter for dbt core by @svdimchenko in #12411
fix(web) ingestion page resets when filter updated by @jayacryl in #12589
dev: update pre-commit config by @anshbansal in #12592
feat(UI): add user location to user profile page by @samanthafigueredo5 in #12016
fix(graphql): Skip schema fields with empty fieldPaths to prevent the dataset mapper from erroring out by @jayasimhankv in #12562
feat(graphql,ui): Update ML system V2 UI by @asikowitz in #12598
fix(url-encoding): fix regression in url encoding by @david-leifker in #12601
fix(ingest/snowflake): order queries for queries_v2 by @hsheth2 in #12551
feat(ci): add pytest hooks for updating golden files by @hsheth2 in #12581
fix(ingest): pick topics from config for sink connector by @mayurinehate in #12535
doc: add note for subscription by @anshbansal in #12607
feat(okta): adds ingest_groups_users config parameter by @sgomezvillamor in #12371
feat(urn-validation): Add UrnValidation PDL annotation by @david-leifker in #12572
feat(search): include timestamp for entity metadata change by @deepgarg-visa in https://github.com/d...

Contributors

sgomezvillamor, treff7es, and 32 other contributors

Assets 2

30 Jan 15:01

david-leifker

v1.0.0rc1

a155470

v1.0.0rc1 Pre-release

Pre-release

fix(ci): disable ci telemetry modelDocUpload (#12504)

Assets 2

21 Jan 15:43

pedro93

v0.15.0.1

476df77

v0.15.0.1

DataHub v0.15.0.1 Release Notes

🎵 Listen to this release's theme song on Suno: Structured Flow
Shoutout to @DSchmidtDev for this genre inspo for this round!

Structured Properties
- Added comprehensive support for managing structured properties, including creation, editing, deletion, and display preferences. Introduced timestamps for tracking creation and modification. [#12100, #11419]
- Enhanced property display options with badge styling, custom column types, and configurable visibility settings in asset sidebars and schema fields. [#12111, #12052]
- Added structured property filtering in UI with improved aggregation logic and entity metadata display. Introduced new property validators and display settings. [#12097, #12099]
UI Enhancements
- Enhanced container organization with parent hierarchy labels. [#11705]
- Added support for markdown in incident descriptions, enabling rich formatting capabilities. [#11759]
- Improved ingestion reporting with better visibility of successful ingestions with warnings. Enhanced browse paths display for business attributes and schema fields. [#11704, #11585]
- Added support for timeseries aspects in OpenAPI and customizable date range fields for Analytics charts. [#12096, #11366]
Authorization & Authentication
- Enabled authentication and API authorization by default, with support for URN-wildcard-based policies using STARTS_WITH condition. [#11484, #11441]
- Added authorization checks for managing Glossary terms, including privileges for ownership, domain management, and link actions. [#11337]

Metadata Ingestion

Ingestion Framework Improvements

Enhanced Data Source Support: Expanded ingestion capabilities for multiple platforms, including Superset (with dataset entities, schema fields, and column-level lineage), Feast (supporting tags and owners ingestion), Neo4j, and Cassandra. Added stateful ingestion support for file sources. [#11688, #11784, #11804, #11526, #11822]
SQL Processing Improvements: Replaced vulnerable sqlparse dependency with an in-house SQL parser, optimized CLL generation with reduced memory usage, and added special handling for MSSQL case sensitivity. Enhanced multi-query lineage support for Snowflake temporary tables. [#11645, #11708, #11920, #12020]
CLI Enhancements: Introduced new commands for managing ingestion, including listing source runs with filtering capabilities, undoing soft deletes with platform filtering, and listing structured properties. Added an offline flag to the SQL parser CLI. [#11740, #11980, #12012, #12283, #11635]
Ownership and Metadata Management: Extended ownership transformer capabilities across entities, improved glossary sync to preserve custom ownership types, and added support for multiple ownership types in glossaries and terms. Enhanced Forms CLI with additional filters for subtypes, platform instances, owners, tags, and glossary terms. [#11700, #11545, #12050, #10979]
Core Infrastructure Improvements: Implemented unique URN generation for all entities, added support for efficient entity ingestion through get_entity_as_mcps, improved empty field handling, and introduced progress reporting during ingestion. Added execution request cleanup job and support for dropping duplicate schema fields. [#11676, #11425, #11613, #12117, #11765, #12308]

Source-Specific Ingestion Improvements

Airflow

Upgraded infrastructure with support for Airflow 2.10, deprecated versions below 2.3, and improved template handling with Jinja support. Added configuration options for dag patterns and environment variables. [#11300, #11371, #11472, #11537, #11579, #12056]
Enhanced error handling and debugging with improved logging, fixed plugin stability issues on EMR, and added support for AthenaOperator lineage extraction. Introduced ability to disable plugin without restart. [#11857, #11877, #11880, #12098]

BigQuery

Enhanced data modeling capabilities with support for foreign/primary keys, BigLake tables, and improved handling of external tables. Added support for region qualifiers and partition management. [#11686, #11728, #11874, #11940]
Improved lineage tracking with GCS data source support and optimized query performance. Added platform resource entity generation from BigQuery labels. [#11442, #11492, #11534, #11602]
Enhanced profiling and performance with better type handling and size limits. Fixed issues with tag synchronization and platform instance settings. [#11807, #12060]

Dagster

Added support for skipping Asset ingestion, fixed input/output value formatting, and improved compatibility with latest Dagster versions (v1.9.6). Deprecated Python 3.8 support. [#11262, #11481, #12121, #12189]

dbt

Improved performance and functionality with node_name_patterns for faster CLL processing, support for multiple test paths, and better handling of custom owner types. [#11450, #11460, #11848]
Enhanced lineage handling by preventing cycles in SQL parsing and supporting multiple dataset assertions for tests. Added support for dbt Cloud's Explore page. [#11666, #11451, #12223]

Snowflake

Expanded support for various table types, including secure, dynamic, and hybrid tables. Enhanced lineage capabilities for renames, swaps, and external tables. [#11600, #12039, #12094, #12179]
Improved authentication with OAuth support and token management. Added incremental property processing and structured property support for tags. [#11888, #12048, #12080, #12285]
Enhanced error handling and logging with better parse failure reporting and dot handling in table names. [#12105, #12110, #12153]

Tableau

Enhanced project management with new path pattern filtering and improved handling of hidden assets. Added support for access roles and group permissions. [#10855, #11157, #11559]
Improved API integration with retry logic for various error codes (502, 504), better authentication handling, and consistent page size application. [#12213, #12216, #12233]
Enhanced reporting and debugging capabilities while maintaining efficient performance and proper permission handling. [#12015, #12024, #12175]

PowerBI

Improved M-query parsing with support for comments, better handling of quotes, and DatabricksMultiCloud native query functionality. [#12177, #11743, #11756]
Enhanced workspace management with cross-workspace dataset linking and app ingestion support. Added timeouts for M-query parsing. [#11560, #11629, #11753]
Improved error reporting and performance optimization with reduced type casting and better organization of responsibilities. [#11763, #12004]

Developer Experience

Entity Management: Introduced entity versioning for Datasets and ML Models, with support for version set linking. Improved timeline functionality with better handling of primary key changes and rename events. Added data transformation logic models to enhance data processing capabilities. [#11819, #11843, #12166, #12198]
Enhanced Configuration Management: Added new customization options through environment variables and Helm charts, including editable dataset names and configurable garbage collection scheduling. The bootstrap process has been optimized to reduce latency during installation. [#11391, #11518]
Development Environment Updates: Added Git support to the ingestion-base image, enabling better source control integration for ingestion workflows. [#11477]
Security Logging Enhancement: Improved security audit trails by adding actor URN tracking for unauthorized access attempts. [#12030]

NEW: Garbage Collection

Comprehensive Metadata Cleanup: Introduced a new ingestion source: DataHubG C to function as a garbage collector for managing dataflows, data jobs, and data process instances, with configurable retention policies and deletion parameters. Added dry run mode for testing cleanup operations. [#11102, #11413]
Performance Optimizations: Significantly improved processing speed from 1 hour to 15 minutes by implementing batch processing, optimizing queries, and removing unnecessary operations. Increased default hard delete limit from 10k to 25k entities. [#11809, #12093, #12238]
Reliability Improvements: Enhanced garbage collection stability with additional validation checks, improved error handling, and better process visibility through ingestion stage reporting. Fixed issues with entity deletion logic and reference handling to preserve critical lineage relationships. [#12011, #12013, #12027, #12049, #12124, #12226]

Thank You to Our Contributors!

First-Time Contributors

@AColocho, @alberttwong, @Alice-608, @Bumyu, @chakru-r, @chriscc2, @dejan2609, @donovan-acryl, @eagle-25, @hwmarkcheng, @k-bartlett, @kanavnarula, @kartikey-visa, @kevinkarchacryl, @kousiknandy, @kris48k, @llance, @margaridafernandes-trip, @mikeburke24, @raudzis, @ronybony1990, @ryota-cloud, @shepherd44, @siong-tcha, @ssidorenko, @tanguyantoine, @th0ger, @udays-visa, @udbhav-hbk, @vejeta

Repeat Contributors

@aviv-julienjehannet, @bda618, @bossenti, @darnaut, @deepgarg-visa, @DSchmidtDev, @dushayntAW, @eboneil, @ethan-cartwright, @feldjay, @githendrik, @haeniya, @Jorricks, @Masterchen09, @mkamalas, @Nbagga14, @nicholas-fwang, @noggi, @pankajmahato-visa, @pinakipb2, @rtekal, @sagar-salvi-apptware, @steffengr

DataHub Maintainers

@acrylJonny, @anshbansal, @asikowitz, @chriscollins3456, @david-leifker, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @maggiehays, @mayurinehate, @pedro93, @RyanHolstien, @sakethvarma397, @sgomezvillamor, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin...

Contributors

tanguyantoine, githendrik, and 72 other contributors

Assets 2

16 Jan 15:37

pedro93

v0.15.0

3108b53

V0.15.0

DataHub v0.15.0 Release Notes

Please refer to v0.15.0.1 for full release notes.

What's Changed

fix(ingest): override setdefault in file-backed dict by @hsheth2 in #11359
fix(ingest/airflow): simplify env configuration by @hsheth2 in #11371
fix(airflow): added support for jinja template for datahub emitter operator by @dushayntAW in #11300
fix(smoke-test): add wait for sync to smoke-test by @david-leifker in #11405
fix(customSearch): apply query string interpolation to function score by @RyanHolstien in #11406
fix(docs): Fix typo in bigquery permissions error by @gabe-lyons in #11401
build(deps-dev): bump vite from 4.5.3 to 4.5.5 in /datahub-web-react by @dependabot in #11410
feat(ingest/gc): Add dataflow and soft deleted entities cleanup by @treff7es in #11102
feat(ingestion): adds env property in ContainerProperties by @sgomezvillamor in #11214
feat(ingest/gc): Add dry run mode to gc recipe by @treff7es in #11413
feat(kafka-setup): allow override KAFKA_HEAP_OPTS by @david-leifker in #11400
chore(docs): update release notes v0.14.1 by @david-leifker in #11408
Advance search - Added case sensitive flag for wildcard searches by @kanavnarula in #11272
fix(changeGenerator): fixes schema change generator corner cases v2 by @RyanHolstien in #11386
feat(docs-site) datahub homepage v2 by @jayacryl in #11342
feat(structuredProps) Add created and lastModified timestamps to structured prop entity by @chriscollins3456 in #11419
feat(docs-site) tours to open in a modal by @jayacryl in #11420
feat(structuredProps) Add delete structured props endpoint and handle null props by @chriscollins3456 in #11418
test(ingest/mcp_diff): Fallback to overwriting file on more complicated diffs by @asikowitz in #11407
feat(docs-site) fixed home paddings on mobile site by @jayacryl in #11431
feat(ingest): add get_entity_as_mcps method to client by @hsheth2 in #11425
fix(siblings) Combine siblings in embedded search results by @chriscollins3456 in #11421
fix(structuredProps) Fix adding new allowed types in updateStructuredProp endpoint by @chriscollins3456 in #11424
build(deps): bump path-to-regexp from 1.8.0 to 1.9.0 in /datahub-web-react by @dependabot in #11356
chore(bump): bump spring versions by @david-leifker in #11435
chore(bump): pac4j version by @david-leifker in #11436
fix(graphql/getDataset): Do not fetch parent for schema fields by @asikowitz in #11434
build: allow gradle daemon by @hsheth2 in #11437
fix(ingest/dbt): handle null index values by @hsheth2 in #11433
build(deps): bump dompurify from 2.3.8 to 2.5.4 in /datahub-web-react by @dependabot in #11387
build(deps): bump dset from 3.1.3 to 3.1.4 in /datahub-web-react by @dependabot in #11361
fix(ingest/dbt): fix dbt catalog version check by @sid-acryl in #11350
Add STARTS_WITH policy condition to allow for URN-wildcard-based policies by @githendrik in #11441
fix(restoreIndices): fix bug in urn paginated restoreIndices exit code by @david-leifker in #11443
feat(Analytics) Allow dateRangeField to be configurable for timeSeries chart by @mkamalas in #11366
fix(SearchDocumentTransformer): Use correct variable to update ES by @pinakipb2 in #11430
chore(version): bump protobuf version by @david-leifker in #11446
fix(search): restore prefix phrase match on quoted search by @david-leifker in #11444
fix(oidc): apply acr values to redirect url by @RyanHolstien in #11447
refactor(ui/lineage): Replace FetchedEntities Object with Map by @asikowitz in #11440
feat(ingest/sink): report datahub-rest sink mode by @hsheth2 in #11422
docs(ingest): add docs on pydantic compatibility by @hsheth2 in #11423
chore: use unique docker log artifact names by @hsheth2 in #11445
test(graphql): fix searchFlags in searchAcrossLineage by @david-leifker in #11448
Update pr-labeler.yml by @donovan-acryl in #11393
fix(docs): fix layout in documentation after #11380 by @Masterchen09 in #11390
Group Modal Css fix by @kanavnarula in #11403
feat(graph): graph index soft-delete support by @david-leifker in #11453
config(reindex): create reindex timeout configuration by @david-leifker in #11456
fix(ingest): sort by last modified not working in the UI by @sid-acryl in #11343
feat(data-contracts): support custom assertions in the data contracts builder by @jayacryl in #11454
fix(ingest/sqlglot): Make detach_ctes more robust by @asikowitz in #11449
fix(ingest/mode): add connection timeouts to avoid RemoteDisconnected errors by @sagar-salvi-apptware in #11245
build(gradle): Update gradle.properties by @david-leifker in #11458
build(yarn): increase yarn timeout and version bump by @david-leifker in #11461
fix(ingest/dbt): allow custom owner types for dbt meta by @hsheth2 in #11460
doc: fix typo by @anshbansal in #11464
fix(ingestion/looker): skip personal folder independent looks by @sid-acryl in #11415
bump(version): zookeeper by @david-leifker in #11465
fix(ingest): do not cache temporary tables schema resolvers by @mayurinehate in #11432
fix(structuredProps) Allow upserting structured props on schema fields that don't exist by @chriscollins3456 in #11466
fix(docs-website): disable dark mode by @hsheth2 in #11468
fix: fix broken global style by @yoonhyejin in #11470
feat(ingest/dbt): speed up dbt CLL with node_name_patterns by @hsheth2 in #11450
feat(ingest/dbt): produce multiple assertions for multi-table dbt tests by @hsheth2 in #11451
feat(ingest): add git to ingestion-base image by @hsheth2 in #11477
fix(ingest): include platform instance in looker usage urns by @hsheth2 in #11469
fix(ingest/openapi): update recipe for DataHub OpenAPI with url_complement and bearer token by @sagar-salvi-apptware in #10980
docs(aws): Update AWS docs to keep consistency with Docker docs by @AColocho in #11284
feat: add second navbar by @yoonhyejin in #11471
feat: CTA to live demos in cloud section and a few more case studies on home by @jayacryl in #11488
fix(ingest/bq): do not query PARTITIONS for biglake tables by @mayurinehate in #11463
config(rest-api): enable authentication and api authorization by default by @david-leifker in #11484
feat(ingest/databricks): add usage perf report by @mayurinehate in #11480
feat(ingestion/tableau): introduce project_path_pattern by @haeniya in #10855
docs(ingest/dbt): update run result paths examples...

Contributors

tanguyantoine, githendrik, and 72 other contributors

Assets 2

17 Sep 21:48

david-leifker

v0.14.1

6a165a8

v0.14.1

DataHub v0.14.1 Release Notes

User Experience

Enhanced Data Propagation UI: New features allow viewing propagated column documentation, source information, and asset-level propagation details. This improves visibility into data lineage and enables better understanding of data flow across the organization. (#11047)
Improved Search Result Tracking: Added page number to search result click events, enabling better measurement of search ranking performance. This helps users understand and optimize their search experience. (#11151)
Fixed Display Issues: Resolved issues with displaying "0" values for last ingested data and improved handling of multilingual characters in descriptions. These fixes ensure more accurate and readable information presentation. (#10840, #10975)

Developer Experience

Performance Improvements:
- Implemented lazy dataLoaders for GraphQL queries, significantly reducing latency for local environments. (#11293)
- Added option to log slow GraphQL queries, helping identify and address performance bottlenecks. (#11308)
- Introduced session authorization caching for faster access checks. (#11327)
Enhanced Search Capabilities:
- Added support for custom highlighting fields in GraphQL queries, allowing faster and more customizable data retrieval. (#11339)
- Implemented new search query functionality to filter by parents/children of Domains or Containers. (#11279)
- Added support for multiple values in 'CONTAIN', 'START_WITH', and 'END_WITH' operators, enabling more flexible and precise searches. (#11068)
API Improvements:
- Extended throttling to API requests, supporting non-browser ingestion/write requests and manual throttling for better control over system load. (#11325)
- Added support for 'START_WITH' and 'END_WITH' operators in GraphQL API, enhancing string query capabilities. (#11026)
Bug Fixes:
- Resolved issues with forward slash handling in search queries, empty key-value pairs in Elasticsearch mapping, and support for various data types in object fields. These fixes improve search accuracy and data representation. (#10932, #11004, #11066)
- Addressed Postgres regression by upgrading the ebean library from version 12.x to 15.x, resolving a read lock NPE issue. (#11379)

Metadata Ingestion

S3 Integration Enhancements:
- Enhanced partition support for S3 dataset ingestion, improving metadata representation and enabling advanced partition detection. (#11083)
- Enhanced S3 ingestion process to support reading specific file types, allowing more granular control over data ingestion. (#11177)
BigQuery Improvements:
- Implemented query log extractor for BigQuery, creating "Query" entities with usage statistics, lineage, and operation details. (#10994)
- Added support for filtering GCP project ingestion based on project labels, enabling more targeted data collection. (#11169)
- Implemented query job retries for transient errors, improving system robustness. (#11162)
Snowflake Updates:
- Added support for Iceberg tables in Snowflake access history, enhancing lineage capture capabilities. (#10961)
- Introduced ability to define clustering key formulas for Snowflake datasets. (#11254)
- Fixed tag exclusion issues in Snowflake ingestion process. (#11250)
New and Updated Connectors:
- Added ingestion source for SAP Analytics Cloud, expanding DataHub's integration capabilities. (#109 58)
- Enhanced Salesforce connector with customizable API version and improved error messages. (#11145, #11266)
- Updated Tableau ingestion process with new parameters and improved field type parsing. (#11255, #11202)
Other Ingestion Improvements:
- Added support for MongoDB database ingestion as containers. (#11178)
- Implemented automatic capturing of Snowflake assets with Pandas I/O Manager in Dagster module. (#11189)
- Enhanced Fivetran ingestion with destination ID filtering capabilities. (#11277)
- Added support for browse-only tables in Databricks ingestion. (#10766)

Other Improvements and Fixes

Upgraded various dependencies including Kafka, Azure Identity, Acryl-SQLglot, and GraphQL/Spring versions.
Improved error handling and logging across multiple components.
Enhanced test coverage and reliability.
Updated documentation for various features and processes.

Breaking Changes

Notable breaking changes include:

Removal of lower method from get_db_name in SQLAlchemySource, affecting URNs of related entities.
Changes to default sink mode and aspect handling that require server version 0.14.0+.

See the full details here.

Contributors

We extend our heartfelt thanks to all contributors for their valuable work on this release:

First-Time Contributors

@AaronYang0628, @alexandrebunn, @alisa-aylward-toast, @arpanchakra29, @esselius, @eunseokyang, @ignitz, @milindgupta, @milindgupta9, @Nbagga14, @rohansun, @sakethvarma397, @vignesh-hbk

Repeat Contributors

@deepgarg-visa, @dushayntAW, @feldjay, @filipe-caetano-ovo, @ksrinath, @Masterchen09, @matthew-coudert-cko, @mayurinehate, @nmbryant, @pinakipb2, @prashanthic23, @sagar-salvi-apptware, @siladitya2, @sleeperdeep

DataHub Maintainers

@anshbansal, @asikowitz, @chriscollins3456, @darnaut, @david-leifker, @eboneil, @hsheth2, @jjoyce0510, @maggiehays, @pedro93, @RyanHolstien, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin

Your contributions are invaluable in making DataHub better for everyone. Thank you!

What's Changed

test(smoke-test): updates to smoke-tests by @david-leifker in #11152
feat(dbt): support prefer_sql_parser_lineage with sources enabled by @hsheth2 in #11168
feat(actions): updates to gha workflows by @david-leifker in #11150
build: fix docker warnings by @anshbansal in #11163
feat(hooks): Make hook enable flag non-default by @pedro93 in #11159
fix(ci): smoke-test changes do not need to build images by @david-leifker in #11174
fix(ci): fix single tag comma split by @david-leifker in #11179
lint(restore-indices): clean-up restore indices class by @david-leifker in #11176
fix(ci): typo by @david-leifker in #11180
fix(ci): additional ci and smoke-test updates by @david-leifker in #11183
test(smoke-test): minor update to openapi test by @david-leifker in #11184
feat(ingest): use pre-built dockerize binary by @hsheth2 in #11181
doc: mark deprecated feature by @anshbansal in #11175
fix(delete) Fix removing completed/verified forms references by @chriscollins3456 in #11172
feat(docs): update docs for new release by @RyanHolstien in #11164
fix(ingest): invalid urn should not fail full batch of changes by @RyanHolstien in #11187
fix(kafka-setup): add missing script to image by @david-leifker in #11190
fix(config): fix hash algo config by @david-leifker in #11191
feat(ingest): allow custom SF API version by @skrydal in #11145
fix(ingestion/transformer): extend dataset_to_data_product_urns_pattern to support containers by @sagar-salvi-apptware in #11124
fix(ui) Fix bug with editing entity names by @chriscollins3456 in #11186
ci(smoke-test): allow smoke-test only PRs by @david-leifker in #11194
feat(ingestion/lookml): support looker -- if comments by @sid-acryl in #11113
fix(elasticsearch): refactor idHashAlgo setting by @david-leifker in #11193
fix(ingestion/airflow-plugin): fixed missing inlet/outlets by @dushayntAW in #11101
docs(readme): add security notes by @david-leifker in #11196
docs: Update README.md by @prashanthic23 in #11144
feat(ingest/dbt): skip CLL on sources with skip_sources_in_lineage by @hsheth2 in #11195
fix(graphql): Correct ownership check when removing owners by @pedro93 in #11154
feat(propagation): UI for rendering propagated column documentation by @jjoyce0510 in #11047
fix(ui): checks truthy value for last ingested by @pinakipb2 in #10840
docs(scim): document okta integration with datahub for scim provisioning by @ksrinath in #11120
fix(ingestion/tableau): Tableau field type parsing by @skrydal in #11202
feat(analytics): Add page numb...

Contributors

shirshanka, esselius, and 42 other contributors

Assets 2

21 Aug 15:29

RyanHolstien

v0.14.0.2

98ad824

v0.14.0.2

DataHub v0.14.0.2 Release Notes

User Experience

Renamed: Validation --> Quality: The Validation tab has been renamed to Quality to make it more intuitive to end-users that it contains outcomes from data quality checks. [#10935]
Data Contract UI: A new Data Contract UI is now available under the Quality Tab, allowing users to handle various data assertion types and add/remove contracts more easily. [#10625]
Updates to Customized Search Ranking: By default, explore (* ) query results are ranked based on enrichment (tags, terms, owners, description, domains, row/column counts) as well as incident status. [#10774]
Custom Dataset Names: Business users can now maintain an editable dataset name separate from default properties, providing more control over dataset identification. [#10608]
Documentation Propagation Setting Page: A new settings page has been added to the UI for managing Documentation Propagation, giving users more control over how documentation is shared across the platform. [#11038]

Developer Experience

NEW: DataHub Open Assertions Specification:
- Announcing a universal assertions specification for declaring Data Quality checks and compiling them into artifacts for use by 3rd party Data Quality tools like Great Expectations, dbt tests, and Snowflake via Data Quality DMFs. [#1 0609]
- Added ability to define data quality rules using a YAML specification file, enabling users to set assertions like volume metrics and conditions, with the ability to compile and schedule them to run on Snowflake as the assertion backend. [#10602]
API and SDK Enhancements:
- New GraphQL APIs added for managing forms, structured properties, and data contracts. [#10826, #10825, #10632]
- Updates to Java and Python SDKs to support creating and updating structured properties on assets. [#10823, #10824]
- Support for conditional write semantics including If-Modified-Since, If-Unmodified-Since, and If-Version-Match in MetadataChangeProposals (MCP) and OpenAPI. [#10868]
CLI Improvements:
- A new check server-config command has been added to test server credentials and retrieve diagnostic information. [#10990]
- The get command now includes a --details/--no-details flag for more detailed output, facilitating easier issue debugging. [#10815]
- Update to CLI to optionally display server configuration settings. [#10676]
- Added functionality to the CLI by introducing the ability to assign actors (users or groups) to forms in the forms YAML API. [#10683 ]
Improved Logging and Monitoring:
- Unified request logging implemented across GraphQL, OpenAPI, and Restli requests, including additional information like actor, IP address, and API type. [#10802]
- New CLI command check server-config added to test server credentials and retrieve diagnostic information. [#10990]
Performance Optimizations:
- Implemented throttling for the mce-consumer based on mae-consumer lag. [#10626]
- Unified request logging now includes additional information like actor, IP address, and API type across GraphQL, OpenAPI, and Restli requests. [#10802]
- Added an ASYNC_BATCH mode to the rest sink for improved performance. [#10733]
- Improved the performance of read queries in Neo4j by specifying labels and combining multiple Neo4j statements within the addEdge function into a single statement, improving efficiency and performance. [#10593, #10598]
Security Enhancements:
- Updated encryption and decryption methods with a stronger cryptographic algorithm. [#11059]
- Optimized regular expressions to prevent potential ReDoS vulnerabilities. [#10315]

Metadata Ingestion

New Ingestion Sources:
- Azure Blob Storage: Added as a new ingestion source with support for Path Specs. [#10813]
- Grafana: New connector to ingest dashboards, providing documentation within DataHub for DevOps members on call. [#10891]
- IBM DB2: Added support for this platform. [#10601]
Snowflake Improvements:
- Enhanced view lineage parsing without query-based lineage/usage. [#10905]
- Added support for more than 10k views in a Snowflake database. [#10718]
- Implemented parallel schema extraction for improved performance. [#10653]
- Added snowflake-queries source for lineage, usage, queries, and operational metadata to improve performance and configurability. [#10835]
BigQuery Enhancements:
- Refactored and parallelized dataset metadata extraction for better performance. [#10884]
- Added support for new data types including BIGNUMERIC, NUMERIC, DECIMAL, BIGDECIMAL, FLOAT64, and RANGE. [#10950]
- Added support for ingesting View labels during ingestion. [#10648]
Looker Updates:
- Ingested explore tags into DataHub. [#10547]
- Fixed issues related to CLL generation when the view definition language is SQL. [#10542]
- Added support for including platform instance details in URNs for dashboards and charts. [#10771]
Other Improvements:
- dbt: Enhanced flexibility in lineage generation with the new experimental prefer_sql_parser_lineage flag. [#11039]
- Airflow: Task ownership info can now be set as a group rather than an individual user. [#10742]
- Athena: Enhanced profiling capabilities to support column quantiles and medians. [#10723]
- Fivetran: Improved connector performance for faster ingestion. [#10556]
- SageMaker: Added stateful ingestion capability to remove deleted assets during ingestion runs. [#10573]
- Tableau: Support added for ingesting multiple Tableau sites in a single configuration, with sites appearing as containers in DataHub. [#10498]
- Added support for ingesting schemas from schema registry in the Kafka module. [#10612]
- Introduced a TagsToTermMapper transformer for mapping specific tags to glossary terms. [#10758]
- Enhanced the SQL lineage parser with an optional default_dialect parameter for customized dialect selection. [#10830]

Other Improvements and Fixes

Fixed high vulnerabilities related to sensitive information logging. [#11088]
Optimized regular expressions to prevent potential ReDoS vulnerabilities. [#10315]
Improved error handling and logging across various modules.
Enhanced test coverage for new features and existing functionality.

Breaking Changes

Protobuf CLI will no longer create binary encoded protoc custom properties by default.
Changes to Data flow info and data job info aspects may require a server upgrade.
OpenAPI V3 - Creation of aspects now requires wrapping within a value key.
Profiling configuration for Glue source has been updated.

For full details on breaking changes, please refer to the updating guide.

Contributors

Massive shoutout to all of the contributors who made this release possible:

First-Time Contributors

@aabharti-visa, @acrylJonny, @amit-apptware, @AndreasHegerNuritas, @aviv-julienjehannet, @brbrown25, @chardaway, @dragontail, @ipolding-cais, @joelmataKPN, @john-claro-cko, @jordanjeremy, @lima-renan, @nadavgross, @nephtyws, @obaltian, @PeamThom, @pie1nthesky, @pulsar256, @samblackk, @shtephlee, @simaov, @steffengr, @tkdrahn, @TristanHeisler, @wornjs, @xkollar

Repeat Contributors

@ajoymajumdar, @bossenti, @cburroughs, @cccs-eric, @deepgarg-visa, @dushayntAW, @fjmacagno, @githendrik, @haeniya, @jayasimhankv, @k7ragav, @kevin1chun, @ksrinath, @Kunal-kankriya, @looppi, @Masterchen09, @mayurinehate, @ngamanda, @nmbryant, @noggi, @pankajmahato-visa, @PatrickfBraz, @pinakipb2, @Rajasekhar-Vuppala, @rtekal, @sagar-salvi-apptware, @shubhamjagtap639, @siladitya2, @ssilb4, @Sukeerthi31, @sumitappt, @TonyOuyangGit, @walter9388

DataHub Maintainers

@anshbansal, @asikowitz, @chriscollins3456, @darnaut, @david-leifker, @eboneil, @ethan-cartwright, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @maggiehays, @pedro93, @RyanHolstien, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin

What's Changed

fix(ingest/unity-catalog) upstream lineage for hive_metastore external table with s3 location by @dushayntAW in #10546
feat(ingestion/looker): ingest explore tags into the DataHub by @sid-acryl in #10547
fix(instropection): fix configuration application order by @david-leifker in #10579
fix(ingest/slack): pull real names by @hsheth2 in #10565
fix(ingest): Remove env deprecation message by @treff7es in #10581
test(ingest/sql): refactor CLL generator + add tests by @hsheth2 in #10580
docs(remote-ingestion): update description and deployment instructions by @darnaut in #10574
fix(ingest): DataProcessInstance.emit_process_end() ignored start_timestamp_millis by @obaltian in #10539
fix(ingest/metabase): Fix for query template expressions and invalid URNs for Text Cards by @pulsar256 in #10381
feat(graphql): Support tagging incidents and assertions via GraphQL API by @jjoyce0510 in #10575
docs(update): updating-datahub by @david-leifker in #10585
docs: reorder semantics guide to the bottom by @yoonhyejin in #10541
feat(auth): add viewTests platform privilege by @ksrinath in https://github.com...