Releases: datahub-project/datahub
v0.10.4
Release Highlights
User Experience
-
You can now create and assign Custom Ownership types within DataHub; plus, we now display the owner type on an Entity Page
-
Various bug fixes to Column Level Lineage visualization
Metadata ingestion
- You can now define column-level lineage (aka fine-grained lineage) via our file-based lineage source
- Looker: Ingest Looks that are not part of a Dashboard
- Glue: Error reporting now includes lineage failures
- BigQuery: Now support deduplicating LogEntries based on insertId, timestamp, and logName
Docs
- CSV Enricher: improvements to sample CSV and recipe
- Guide for changing default DataHub credentials
- Updated guide to apply time-based filters on Lineage
What's Changed
- ci(ingest/kafka): improve kafka integration test reliability by @hsheth2 in #8085
- fix(ingest/bigquery): Deduplicate LogEntries based on insertId, timestamp, logName by @asikowitz in #8132
- feat(ingest/glue): report glue job lineage failures, update doc by @mayurinehate in #8126
- feat(lineage source): add fine grained lineage support by @anshbansal in #7904
- docs(glue): fix broken link by @mayurinehate in #8135
- feat(custom ownership): Adds Custom ownership types as a top level entity by @pedro93 in #8045
- Update updating-datahub.md for v0.10.3 release by @iprentic in #8139
- feat: add dbt-athena adapter support for column types mapping by @svdimchenko in #8116
- docs(csv-enricher): add example csv file & recipe by @gabe-lyons in #8141
- chore(ci): update base requirements file by @anshbansal in #8144
- fix(ingest/s3): Path spec aware folder traversal by @treff7es in #8095
- fix(ui) Fix selecting columns in Lineage tab for CLL by @chriscollins3456 in #8129
- feat(search): adding support for
_entityType
filter in the application layer + frontend by @gabe-lyons in #8102 - docs(ingest/nifi): fix broken links by @mayurinehate in #8143
- fix(scroll): fix scroll cache key for hazelcast by @RyanHolstien in #8149
- chore(json): fix json vulnerability by @RyanHolstien in #8150
- fix(ingest/json-schema): handle property inheritance in unions by @hsheth2 in #8121
- chore(log): fix log as error instead of info by @anshbansal in #8146
- fix(lineagecounts) Include entities that are filtered out due to sibling logic in the filtered count of lineage counts by @iprentic in #8152
- fix(stats): display consistent query count on stats tab by @joshuaeilers in #8151
- fix(ingest): remove
original_table_name
logic in sql source by @hsheth2 in #8130 - feat(ingest): add more fail-safes to stateful ingestion by @hsheth2 in #8111
- feat(ingest/snowflake): support for more operation types by @mayurinehate in #8158
- fix(ui) Show Entities first on Domain pages again by @chriscollins3456 in #8159
- fix(ingest/nifi): allow nifi site url with context path by @mayurinehate in #8156
- feat(ingest): Create Browse Paths V2 under flag by @asikowitz in #8120
- fix(ingestion/looker): set project-name for imported_projects views by @mohdsiddique in #8086
- fix(docs): Fix ownership type typos by @pedro93 in #8155
- docs(townhall) feb and march town hall agenda and recording by @maggiehays in #7676
- feat(ingest/unity): Add qualified name to dataset properties by @asikowitz in #8164
- feat(ingest/bigquery_v2): enable platform instance using project id by @Khurzak in #8142
- feat(ingest/snowflake): Deprecate legacy lineage and optimize query history joins by @asikowitz in #8176
- fix(ingest/kafka): Fixing error printing in Kafka properties get call by @treff7es in #8145
- fix(ingest/snowflake): set use_quoted_name to profile lowercase tables by @mayurinehate in #8168
- feat(classification): support for regex based custom infotypes by @mayurinehate in #8177
- fix(restli): update base client retry logic by @david-leifker in #8172
- fix(ingest): Fix modeldocgen; bump feast to relax pyarrow constraint by @asikowitz in #8178
- refactor(ci): move from sleep to kafka lag based testing by @shirshanka in #8094
- docs(lineage): document timestamp filtering in lineage feature by @iprentic in #8174
- build(ingest/feast): Pin feast to minor version by @asikowitz in #8180
- feat(ingest/snowflake): Okta OAuth support; update docs by @asikowitz in #8157
- feat(ingest/presto-on-hive): add support for extra properties and merge property capabilities by @treff7es in #8147
- docs(managed datahub): release notes for v0.2.8 by @anshbansal in #8185
- fix(nocode): fix DeleteLegacyGraphRelationshipsStep for Elasticsearch by @david-leifker in #8181
- feat(docker):Add the jattach tool to the docker container(#7538) by @yangjiandan in #8040
- refactor: Return original exception as caused by by @Jorricks in #7722
- docs(ingest) Add MetadataChangeProposalWrapper import to example code by @iprentic in #8175
- fix(ingest/kafka): Better error handling around topic and topic description extraction by @asikowitz in #8183
- fix(vulnerabilities)/vulnerabilities_fixes_datahub (#8075) by @david-leifker in #8189
- fix: add dedicated guide on changing default credentials by @yoonhyejin in #8153
- feat(classification): configurable minimum values threshold by @mayurinehate in #8186
- fix(ingestion/looker): ingest looks not part of dashboard by @mohdsiddique in #8140
- fix(ingest/profiling): only apply monkeypatches once when profiling by @hsheth2 in #8160
- docs(tableau): site config is required for tableau cloud / tableau online by @mohdsiddique in #8041
- fix(ingest/bigquery): Swap log order to avoid confusion by @asikowitz in #8197
- fix(ingest/redshift): Adding env parameter where it was missing for urn generation by @treff7es in #8199
- revert(ingest/bigquery): Do not emit DataPlatformInstance; remove references to platform_instance by @asikowitz in #8196
- docs(managed datahub): add docs link to v0.2.8 by @anshbansal in #8202
- Add combined health check endpoint which can check multiple components by @iprentic in #8191
- chore(cp-schema-registry): bump minor version by @david-leifker in #8192
- feat(ingest): Produce browse paths v2 on demand and with platform instance by @asikowitz in #8173
New Contributors
- @svdimchenko made their first contribution in #8116
- @Khurzak made their first contribution in #8142
- @Jorricks made their first contribution in #7722
Full Changelog: v0.10.3...v0.10.4
v0.10.3
Release Highlights
User Experience
- Define Data Products via YAML and manage associated entities within a Domain
- Search experience: quickly apply a filter at time of search
- Form-based PowerBI ingestion
Developer Experience
- Progress toward Removing Confluent Schema Registry requirement -- Helm & Quickstart simplifications to follow
- NOTE: this will only work for new deployments of DataHub; If you have already deployed DataHub with Confluent Schema Registry, you will not be able to disable it
- Delete CLI - correctly handles deleting timeseries aspects
- Ongoing improvements to Quickstart stability
- Support entity types filter in
get_urns_by_filter
- Search customization
- regex based query matching
- full control over scoring functions (useable on any document field, i.e. tags, deprecated flags, etc)
- enable/disable fuzzy, prefix, exact match queries
Ingestion
- BigQuery - Improve ingestion disk usage & speed; extract dataset usage from Views
- Unity Catalog - Capture create/last modified timestamps; extract usage; data profiling support
- PowerBI - Update workspace concept mapping; support
modified_since
,extract_dataset_schema
, and more - Superset – support stateful ingestion
- Business Glossary – Simplify ingestion source
- Kafka – Add description in dataset properties
- S3 – Support stateful ingestion &
last_updated
- CSV Enricher – Support updating more types
- PII Classification - Configurable sample size
- Nifi - Support Kerberos authentication
What's Changed
- fix(ingest/bigquery): Add to lineage, not overwrite, when using sql parser by @asikowitz in #7814
- fix(ingest/bigquery): Enable lineage and usage ingestion without tables by @asikowitz in #7820
- fix(ingest/bigquery): Do not query columns when not ingesting tables or views by @asikowitz in #7823
- fix(ingest/bigquery): update usage query, remove erroneous init by @mayurinehate in #7811
- fix(ingest/bigquery): Handle null values from usage aggregation by @asikowitz in #7827
- perf(ingest/bigquery): Improve bigquery usage disk usage and speed by @asikowitz in #7825
- fix(cli): use correct ingestion image in script by @hsheth2 in #7826
- fix(release): prevent republish of images on release edits by @RyanHolstien in #7828
- feat(): finish populating the entity registry by @hsheth2 in #7818
- fix(ui) Fix 404 page routing bug by @chriscollins3456 in #7824
- feat(ui): Support PowerBI Ingestion via UI form by @jjoyce0510 in #7817
- fix(ingest/snowflake): fix column name in snowflake optimised lineage by @mayurinehate in #7834
- feat(ingest/unity): capture create/lastModified timestamps by @hsheth2 in #7819
- fix(test): fix spark lineage test by @david-leifker in #7829
- docs(): add markprompt help chat by @jeffmerrick in #7837
- Update DataJobInputOutput.pdl to express that CLL fields are not shown in the UI right now by @gabe-lyons in #7830
- feat(cli): improve quickstart stability by @hsheth2 in #7839
- chore(ci): regular upgrade base requirements.txt by @anshbansal in #7821
- feat(timeseries): Support sorting timeseries aspects by non-timestampMillis field + fix operations resolver by @jjoyce0510 in #7840
- doc(ingestion/tableau): Fix rendering ingestion quickstart guide by @mohdsiddique in #7808
- fix(ingest): pin sqlparse version by @hsheth2 in #7847
- feat(urn): Add a validator when creating an URN that it is no longer than the li… by @iprentic in #7836
- chore(ingest): bug fix in sqlparse pin by @hsheth2 in #7848
- feat: enriching guide on creating dataset by @yoonhyejin in #7777
- feat(docs): consolidate api guides by @yoonhyejin in #7857
- fix(ingest/salesforce): use report timestamp for operations by @hsheth2 in #7838
- chore(ci): fix CI failing due to lint by @anshbansal in #7863
- fix(mcl): fix improper pass by reference by @RyanHolstien in #7860
- feat(urn) Add validator to reject URNs which contain the character we plan to u… by @iprentic in #7859
- feat(elasticsearch): Add servlet which provides an endpoint for a healthcheck on the ES cl… by @iprentic in #7799
- fix(ui) Add UI fixes and design tweaks to AutoComplete by @chriscollins3456 in #7845
- fix(ui) Get all entity assertions in chrome extension by @chriscollins3456 in #7849
- refactor(platform): Refactoring ES Utils, adding EXISTS condition support to Filter Criterion by @jjoyce0510 in #7832
- chore(ui): change background color to transparent for avatar with photoUrl by @hieunt-itfoss in #7527
- refactor(ingest): Add helper DataHubGraph methods by @asikowitz in #7851
- fix(ui) Disable cache on Domain and Glossary Related Entities pages by @chriscollins3456 in #7867
- fix(cache): Fix cache key serialization in search service by @pedro93 in #7858
- docs(ingest): update dbt and aws docs by @hsheth2 in #7870
- docs(ingest): fix CorpGroup example by @hsheth2 in #7816
- docs(ingest/powerbi): update workspace concept mapping by @eeepmb in #7835
- feat(ingest/powerbi): support modified_since, extract_dataset_schema and many more by @aezomz in #7519
- Remove usages of commons-text library lower than 1.10.0 by @iprentic in #7850
- feat(glue): allow resource links to be ignored by @YusufMahtab in #7639
- feat(ingestion): lookml refinement support by @mohdsiddique in #7781
- feat(ingest/unity): Ingest ownership for containers; lookup service principal display names by @asikowitz in #7869
- Logging and test models fixes by @david-leifker in #7884
- feat(model) Add ContainerPath aspect model by @chriscollins3456 in #7774
- bug(7882): run kafka-configs.sh on DataHubUpgradeHistory_v1 to make sure the retention.ms is set to infinite by @jinlintt in #7883
- fix: refactor toc by @yoonhyejin in #7862
- feat(cli): Modifies ingest-sample-data command to use DataHub url & token based on config by @pedro93 in #7896
- feat(ingest/snowflake): optionally emit all upstreams irrespective of recipe pattern by @mayurinehate in #7842
- fix(ingestion/tableau): backward compatibility with version 2021.1 an… by @mayurinehate in #7864
- fix(ingest/dbt): ensure dbt shows view properties by @hsheth2 in #7872
- docs(airflow): add debug guide on url generation by @hsheth2 in #7885
- feat(sdk): support entity types filter in
get_urns_by_filter
by @hsheth2 in #7902 - fix(ingest/snowflake): fix optimised lineage query, filter temporary … by @mayurinehate in #7894
- fix(ingest/bigquery): fix handling of time decorator offset queries by @mayurinehate in #7843
- fix(ingest): fix minor bug + protective dep requirements by @hsheth2 in #7861
- fix(cli): remove duplicate labels from quickstart files by @hsheth2 in #7886
- Revert "feat(cli): Modifies ingest-sample-data command to use DataHub… by @pedro93 in #7899
- feat(sdk): add
DataHubGraph.get_entity_semityped
method by @hsheth2 in #7905 - test(ingest/biz-glossary): add test for enable_auto_id by @hsheth2 in #7911
- feat(ingest): add GCS ingestion source by @mayurinehate in #7903
- [bugfix] Fix remote file ingestion...
DataHub v0.10.2
Known Issues
- Postgresql: In release v0.10.1 the default value for
max_threads
was increased in the CLI from1
to15
. This creates an issue with Postgresql transactions. The recommended workaround is to decrease themax_threads
in your ingestion recipes to1
if running Postgresql for the GMS backend. - BigQuery: BigQuery connector depends on a bad version of SQLParse, which manifest as
str object is not callable
error. This has since been fixed in CLI release version v0.10.2.2.
Release Highlights
Metadata Ingestion
New
- [redshift] Redshift Combining Usage and Metadata Extraction
- [bigquery] Cross-Project Usage Support (using File System)
- [snowflake] Push down Lineage Extraction to Snowflake Access History API
- [azure-ad] Support stateful ingestion - Automatically remove groups and users when they are removed in Azure.
- [okta] Support stateful ingestion - Automatically remove groups and users when they are removed in Okta.
- [tableau] Extract lineage from CSQL queries in Tableau ingestion
- [snowflake] Better error message on key pair authentication
- [sdk] Support executing GraphQL Queries via DataHubGraph
- [unity] Support extracting ownership
- [postgres] Support extracting metadata from all databases in a single recipe
Bug Fixes
- [bigquery] Capture all operation types when ingesting operational stats
- [bigquery] Fix and refactor exported audit logs query
- [redshift] Fix SQL for extracting lineage from insert queries
User Experience
New
- Auto-Complete UX Refresh - Quickly filter search results inside autocomplete experience
- View: Support views on the Auto-Complete Search Bar
Bug Fixes
- Fix bug where Tag names do not render properly in search previews
- Fix bug where Tag color does not render properly in search autocomplete
- Fix bug when adding Tags and Glossary Terms to nested schema fields
- Fix bug where DataHub would redirect you when clicking to navigate back home
- Fix bug where Metadata Tests results did not show if they were all passing
Documentation
- Redshift Ingestion Quickstart Guide: https://datahubproject.io/docs/quick-ingestion-guides/redshift/overview
- Tableau Ingestion Quickstart Guide: https://datahubproject.io/docs/quick-ingestion-guides/tableau/overview
- PowerBI Ingestion Quickstart Guide: https://datahubproject.io/docs/quick-ingestion-guides/powerbi/overview
- Add docs on creating users and groups: https://datahubproject.io/docs/api/tutorials/creating-users-and-groups/
- Add docs for our Python SDK: https://datahubproject.io/docs/python-sdk/builder
- Add docs on Windows compatibility: https://datahubproject.io/docs/developers/#windows-compatibility
Developer Experience
- Add performance testing framework for BigQuery usage
What's Changed
- fix(cli): allow usage without kafka by @hsheth2 in #7677
- test(elasticsearch): Add unit test for timestamp-based lineage feature by @iprentic in #7661
- feat(docs-website): add docs on creating users and groups by @yoonhyejin in #7574
- chore(ci): add coverage code for python by @anshbansal in #7681
- doc(release): managed datahub v0.2.4 release notes by @anshbansal in #7679
- refactor(ingest/bigquery): add inline comments + refactor in table name parsing by @mayurinehate in #7609
- fix(ingest/looker): skip empty user ids for usage by @hsheth2 in #7686
- fix(ingest/dbt): enable incremental lineage by default by @hsheth2 in #7674
- fix(ingest/bigquery): Fix BigQueryTableType enum accesses by @asikowitz in #7685
- fix(ingest/looker): correct looker/lookml capability reports by @hsheth2 in #7683
- feat(ingest/looker): enable looker usage ingestion by default by @hsheth2 in #7684
- doc(freshness): add faq for dataset freshness by @anshbansal in #7693
- chore(lint): fix lint in looker by @anshbansal in #7695
- fix(ingest/bigquery): quote string constants in query by @mayurinehate in #7694
- feat(ui) Update auto-complete functionality and design by @chriscollins3456 in #7515
- fix(ui) Update Looker/Lookml forms to set client id and deploy key as Secrets by @chriscollins3456 in #7479
- perf(ingest): Improve FileBackedDict iteration performance; minor refactoring by @asikowitz in #7689
- feat(quickstart): move quickstart back to master by @hsheth2 in #7697
- test(ingest/dbt): add test for column meta match by @hsheth2 in #7673
- feat(ingest/postgres): support extracting metadata from all databases in single recipe by @mayurinehate in #7581
- docs(): generate docs for our Python SDK by @hsheth2 in #7612
- fix(ingest/redshift): Lineage query fix to work with the latest redshift by @treff7es in #7698
- feat(ingestion): azure-ad stateful ingestion by @mohdsiddique in #7701
- chore(ingest): formatting + cleanup MCPW usages by @hsheth2 in #7706
- test(ingest/bigquery): Add performance testing framework for bigquery usage by @asikowitz in #7690
- fix(docs): Fixing timeseries delete doc until code path is fixed by @jjoyce0510 in #7711
- docs: add concept section by @yoonhyejin in #7655
- JWT authenticator with asymmetric PublicKey verification for JWT token. by @syedzoherer in #6495
- fix(ingestion): fix AssertionError in base_transformer by @sgomezvillamor in #7702
- feat(docs): support inlining code snippets from files by @hsheth2 in #7712
- feat(ingestion) Allow for ingestion to read files remotely by @xiphl in #7552
- feat: add pre-commit by @yoonhyejin in #7680
- docs(okta): add how to use email in urns by @anshbansal in #7708
- feat(ingest/snowflake): hide
host_port
from snowflake docs by @hsheth2 in #7717 - feat(ingest/bigquery): Capture all operation types when ingesting operational stats by @asikowitz in #7723
- doc(redshift) - Adding Redshift ingestion quickstart guide by @treff7es in #7700
- refactor(ingest): Minor cleanup of File, CsvEnricher, BusinessGlossary, and FileLineage sources by @asikowitz in #7718
- feat(ingest/lookml): support views with
derived_table
.explore_source
by @hsheth2 in #7704 - fix(ci): Fixing broken Domains Test by @jjoyce0510 in #7746
- feat(ingest/dbt): include dbt unique_id in properties by @hsheth2 in #7737
- docs(airflow): update with information for new plugin by @anshbansal in #7732
- chore(ingest): change kafka connect mapped ports by @hsheth2 in #7728
- feat(docs): clear up source configs by @hsheth2 in #7720
- feat(ingest): emit state payloads as soft-deleted by @hsheth2 in #7714
- fix(sdk): remove rest emitter to graph cache in CorpGroup by @bossenti in #7743
- refactor(ingest): Use sqlite.Row row_factory for FileBackedCollections by @asikowitz in #7739
- refactor(ingest/bigquery): Standardize audit log parsing and make TopKDict a DefaultDict by @asikowitz in #7738
- doc(ingestion): tableau quick ingestion guide by @mohdsiddique in #7682
- docs(search): Add example search for finding tables without the name field by @iprentic in #7647
- feat(ingest/dbt): update subtypes for dbt by @hsheth2 in #7750
- feat(snowflake): better error message on key pair authentication by @anshbansal in #7734
- feat(sdk): fix ownership emission for groups by @hsheth2 in #7751
- fix(TestResults UI):show non-failing TestResult by @blankon123 in #7747
- fix(ingest/bigquery): fix and refractor exported audit logs query by @mayurinehate in #7699
- fix(ingest/demo-data): fix bug in path type by @hsheth2 in https://github.com/datahub-project...
DataHub v0.10.1
Known Issues
CLI
- BigQuery: Table and Column Level profile broken due to bad assumption introduced in this version. Please use an alternate version if you are using the BigQuery Profiling feature.
ElasticSearch
7.9 and below clusters are no longer supported with this release due to lack of case sensitivity support in term queries
Release Highlights
User Experience
- The Queries Tab has a new look - supports manually adding and annotating queries directly from the UI, making it easier to share trusted SQL logic with others
- Glossary Terms now shows “Contained by" and "Inherited by" relationships
- Resolved issues with Download to CSV for large volumes of entities
- Update to the Analytics tab - view Monthly Active users to keep track of DataHub adoption and activity within your organization
- Ongoing UI optimizations focused on improve navigation experience
Metadata Ingestion
BigQuery
- Improvements to memory usage during metadata extraction
- Ingestion now captures Dataset Labels
- Emit cross-project usage
PowerBI
- Support for Platform Instance and uniquely identify multiple instances of the same Platform
- Support for PowerBI <> (Redshift, BigQuery) lineage extraction
- Extract entity descriptions
Miscellaneous
- DataHub Integrations Catalog to quickly filter and search for supported integrations
- Kafka Connect - support for stateful ingestion & lowercasing URNs
- Snowflake: improvements to memory usage during metadata extraction
- Postgres: supports estimated row counts during profiling
- Fix to dbt ingestion to address inconsistent upper/lower casing
- S3 ingestion now supports path_specs of multiple buckets in the same recipe
- Looker: Upgrade Looker API from 3.1 to 4.0
- Great Expectations: support for lowercasing URNs
- Tableau: Support for Project Path & Containers; ingestion more resilient to timeout exceptions
Developer Experience
Miscellaneous
- Neo4j support for lineage time filter
- Metadata model support for JSON schemas stored in Files, Directories, and Kafka Schema Registry
- Timeline API now supports Glossary Terms
- Improvements to startup time for DataHub CLI
API Docs & Guides
- Table of contents to understand DataHub APIs at a glance
- Guides:
- Add Tags, Terms, Owners to entities
- Create datasets
- Manage Lineage
Search Improvements
- searchAcrossEntities/Lineage improvements
- support searchAfter
- advanced query, identity autocomplete, exact match weight
Breaking Changes
Lineage Graph UI
- Previously, DataHub would display Nodes in Lineage Viz even for URNs that do not technically exist (do not have any aspects defined). Now, those nodes are filtered out. This means that lineage which previously existed may not appear anymore in Lineage Graph. This change was done to improve the correctness and consistency of the DataHub experience. If you have feedback, feel free to reach out to the core team. To fix this issue, simply produce "DatasetKey" aspects for any URNs that you'd like to show in Lineage graph.
What's Changed
- fix(test): cleanup test on setup error by @david-leifker in #7259
- feat(cli): add 0.10 awareness to upgrade prompt by @shirshanka in #7273
- chore(ci): cleanup build to remove dependencies duckdb, dev by @anshbansal in #7267
- feat(oidc): add options for preferred jws algorithm by @david-leifker in #7245
- chore(cypress): upgrade cypress to latest v12.5.1 by @aditya-radhakrishnan in #7276
- fix(ingest/bigquery) - Fix for Bigquery parser quoted semicolon in the FROM table name as well by @treff7es in #7277
- chore(ci): ensure kafka setup runs for smoke tests by @anshbansal in #7278
- feat(ingest/bigquery) - Reporting current state of BigQuery ingestion by @treff7es in #7282
- feat(graphql): enabling graphql for data platform instance aspects by @sgomezvillamor in #7177
- feat(api): Timeline API supports Glossary Terms now by @vojtechneradatos in #7229
- getting rid of build locally(broken) for ./gradlew quickstart(working) by @laulpogan in #7283
- chore(ci): remove redundant quickstart check by @anshbansal in #7286
- Update smoke.sh by @david-leifker in #7284
- docs(release notes): Managed DataHub v0.2.0 release notes by @david-leifker in #7299
- docs(release): updating docs per release process by @david-leifker in #7281
- doc(access): move heading above the images by @anshbansal in #7291
- fix(docs): kafka - update docs to indicate protobuf support by @shirshanka in #7280
- fix(system-update): fixes system-update with more than 1 partition by @david-leifker in #7302
- fix(ui): fix styling on sign up and reset screens by @aditya-radhakrishnan in #7301
- fix(cypress): fix broken cypress tag tests by @aditya-radhakrishnan in #7306
- chore(ci): speed up ingestion test runs by @anshbansal in #7296
- docs(release notes): Update updating-datahub.md by @david-leifker in #7311
- fix(ingest/bigquery): Usage rate limiting and lineage exported log fix by @treff7es in #7297
- fix(bootstrap): do not re-run retention policy ingestion by @anshbansal in #7295
- refactor(github): change github reference to git references by @anshbansal in #7308
- fix(datahub-upgrade): allow registry override by @david-leifker in #7258
- feat(cli): improve startup time by @hsheth2 in #7292
- fix(search): correctly filter fields in EDITABLE_FIELD_TO_QUERY_PAIRS with a list of values by @jinlintt in #7303
- fix(ingest/bigquery) Lowering significantly the memory usage of the BigQuery connector by @treff7es in #7315
- chore(ingest): upgrade to mypy 1.0.0 by @hsheth2 in #7313
- fix(tests): Remove rollback-reports, add to ignore by @david-leifker in #7312
- perf(ingest): speed up MCPW.validate() by @hsheth2 in #7319
- fix(ingest/bigquery): Fix for table cache was not cleared by @treff7es in #7323
- fix(ingest/bigquery): Improve memory usage of lineage extraction by @treff7es in #7326
- docs(): Adding notebook support disclaimer by @jjoyce0510 in #7327
- fix(docs): sort sources by display name in doc's sidebar by @Masterchen09 in #7322
- fix(transformers): pattern add domain transformer - enable replace_existing by @asikowitz in #7317
- fix(ci): remove command from cache key as irrelevant for dependency by @anshbansal in #7314
- fix(check upgrade): update logic to compare server and client version by @mayurinehate in #7238
- fix(tracking): Remove 'title' field from tracking by @jjoyce0510 in #7328
- fix(homepage): make entity counts execute in parallel and make cache configurable by @RyanHolstien in #7249
- docs(delete): cleanup removed option by @anshbansal in #7335
- feat(ingestion): powerbi # Configurable Admin API by @mohdsiddique in #7055
- fix(sso) Retrieve cookie configs separately from SSO configs by @chriscollins3456 in #7330
- logging(cli): dropping neo4j message to debug to avoid confusion by @shirshanka in #7340
- perf(matadata-io): neo4j generateLineageStatement use shortestPath by @shidianshifen in #7219
- fix(tableau): make Tableau ingestor resilient to timeout exceptions by @skrydal in #7333
- chore(ci): mark tests correctly by @anshbansal in #7337
- refactor(upgrade): Trim upgrade name before executing by @jjoyce0510 in #7343
- fix(ui) Update styles of embedded profile page to match designs by @chriscollins3456 in #7348
- fixed links and improved recommendations by @laulpogan in #7334
- gradle(development): add additional commands for development by @david-leifker in #7321
- fix(search): support searchFlags ...
DataHub v0.10.0
Release Highlights
Potential Downtime
This release introduces substantial improvements to search functionality which require reindexing indices.
During the reindexing:
- a system-update job will set indices to read-only and create a backup/clone of each index
- new components will be prevented from start-up until the reindex completes
- Helm deployments will go into read-only mode and new ingestion runs will fail
This process can take anywhere from 5 minutes to multiple hours; as rough estimate, please expect it to take 1 hour for every 2.3 million entities. After the reindex is complete, please check your ingestion run to re-run any that did not complete.
If you are deploying containers yourself
If you're deploying the Docker containers yourself (without Helm or Docker-Compose Quickstart), then you'll need to ensure that you first run the acryldata/datahub-upgrade
docker image (v0.10.0 tag) with the following environment variables enabled.
Then, run the container this with the command
docker run acryldata/datahub-upgrade:v0.10.0 -u SystemUpdate
For the full set of environment variables required, check out the default docker.env provided for Docker Compose deployments.
This will run the required reindex against your elasticsearch instance, after which other DataHub components should start correctly. If you do not run the datahub-upgrade
container successfully, other components in the stack will fail to start correctly.
User Experience
We have some really exciting improvements to the DataHub user experience in this release!
Improved documentation editor, contributed by @ngamanda and the Grab Team.
This work provides a much more intuitive documentation editing experience within the UI, providing “what you see is what you get” formatting & removing the need for markdown expertise.
Additionally, you can easily:
- Add links to other entities/users within DataHub
- embed and resize tables & images
- toggle between font sizes and formats
- embed syntax-highlighted code blocks
Filter lineage graphs based on time windows
You can now easily see the full lineage graph of an entity at a specific point in time. This makes it much easier to understand how interdependencies have evolved over time and to troubleshoot data issues in the past.
Improvements in Search
As noted above, we have rolled out substantial improvements to Search functionality, making it easier than ever for end-user to find the entities that matter most. This release includes:
- Stemm & Synonyms
- Search by full or partial URN
- Autocomplete improvements
- Quoted search analyzer for exact & prefix match
Metadata Ingestion
Here are some of the most notable ingestion-related improvements:
- Redshift: You can now extract lineage information from unload queries – thanks for the contrib, @mmmeeedddsss
- PowerBI: Ingestion now maps Workspaces to DataHub Containers – thanks for the contrib, @looppi
- BigQuery: You can now extract lineage metadata from the Catalog API – thanks for the crontrib, @PatrickfBraz
- Glue: Ingestion now uses table name as the human-readable name – thanks for the contrib, @danielcmessias
Developer Experience
- This release introduces DataHub Lite - a new experimental lightweight implementation of DataHub. It is intended to enable local developer tooling use-cases such as simple access to metadata for scripts and other tools. DataHub Lite is compatible with the DataHub metadata format and all the ingestion connectors that DataHub supports. Checkout the docs here.
Breaking Changes
#7103 This should only impact users who have configured explicit non-default names for DataHub's Kafka topics. The environment variables used to configure Kafka topics for DataHub used in the kafka-setup docker image have been updated to be in-line with other DataHub components, for more info see our docs on Configuring Kafka in DataHub . They have been suffixed with _TOPIC where as now the correct suffix is _TOPIC_NAME. This change should not affect any user who is using default Kafka names.
What's Changed
- fix(ci): only scan on master branch by @anshbansal in #7047
- fix(ci): use trivy offline scanning by @anshbansal in #7050
- docs(get-started) Simplify copy on Get Started landing page by @maggiehays in #7043
- fix(ingest/kafka): fix ResourceType import error for confluent_kafka<1.9.0 by @mayurinehate in #7046
- docs(dbt): fix indentation in dbt meta mapping docs by @jx2lee in #7045
- fix(ingest): temporarily disable vertica tests by @hsheth2 in #7059
- feat(editor): improve documentation editor using Remirror by @ngamanda in #6631
- fix(bootstrap): add EDIT_LINEAGE privilege to some default policies by @aditya-radhakrishnan in #7060
- feat(ingest): add entity registry in codegen by @hsheth2 in #6984
- feat(ingest): extract powerbi endorsements to tags by @looppi in #6638
- feat(ingestion): pull metabase database, schema names from raw query and api by @remisalmon in #7039
- fix(ingest): support multiple entity_registry sections by @hsheth2 in #7066
- ci(ingest): add flag to skip tests but run codegen during release by @hsheth2 in #7067
- fix(ingest): preserve dbt column name casing by @hsheth2 in #7063
- fix(ingest/tableau): fix node limit exceeded error for workbooks query by @mayurinehate in #7068
- fix(build/airflow): Fixing gradlew path by @treff7es in #7069
- feat(ingest): support snapshots in dbt and dbt-cloud by @hsheth2 in #7062
- fix(ui) Fix duplicate schema field rendering with siblings by @chriscollins3456 in #7057
- refactor(ingest/athena): Replace
s3_staging_dir
parameter in Athena source withquery_result_location
by @bossenti in #7044 - feat(ingest): fix handling of unions with aliases in post restli conversion by @hsheth2 in #7058
- fix(ui) Make checkboxes in ingestion forms easier to see by @chriscollins3456 in #7061
- fix(ingest): support git clone of non-github repos by @hsheth2 in #7065
- feat(ingest): reporting revamp, part 1 by @hsheth2 in #7031
- fix(secret-service): fix default encrypt key by @david-leifker in #7074
- feat(datahub-lite): introduces a new experimental lightweight impleme… by @shirshanka in #7052
- feat(datahub-lite): adding tab completion, small serialization fixes by @shirshanka in #7079
- docs: add docs for managed DataHub v0.1.72 by @anshbansal in #7070
- docs(readme): add inovex as adopter by @DSchmidtDev in #7077
- docs: add warning about clearing cookies for login by @anshbansal in #7084
- feat(cache): add hazelcast distributed cache option by @RyanHolstien in #6645
- docs(datahub-lite): small improvement for zsh tab completion by @shirshanka in #7085
- fix(ingest/bigquery): clear stateful ingestion correctly by @hsheth2 in #7075
- fix(graphql): Return with appropriate status code instead of stacktrace by @szalai1 in #7086
- fix(sso): Clear cookies on SSO redirect error by @aditya-radhakrishnan in #7088
- fix(docs): add missing mutation literal by @ruedigerblock in #7082
- fix(ui): display the correct access token expiry in AccessTokenModal by @ngamanda in #7078
- fix(cli/lite): fix datahub lite serve command by @hsheth2 in #7089
- fix(profiling): Fix syntax for APPROX_COUNT_DISTINCT on bigquery and snowflake by @feljen in #7087
- fix(ingest): fix logic error of google protobuf wrapper type. by @wngus606 in #7076
- feat(ui): Documentation Editor Improvements by @jjoyce0510 in #7072
- fix(uri): marks uri field as deprecated, removes problem code, and adds coercer for usages of URI typeref by @RyanHolstien in #7093
- fix(build): postgres docker secret by @david-leifker in https://github.com/datahub-pr...
DataHub v0.9.6.1
Release Highlights
Please upgrade from 0.9.6 ASAP to avoid ongoing issues creating and using secrets.
Important Release Notes
With this release, if you are using Neo4J as your graph implementation, you need to set:
GRAPH_SERVICE_DIFF_MODE_ENABLED=false
For GMS (or MAE Consumer for standalone mode).
Bug fix for secrets encryption
- Prevents decryption errors for existing secrets
- Affects reading ingestion secret created with a previous release
- Affects native user password validation
What's Changed
Full Changelog: v0.9.6...v0.9.6.1
DataHub v0.9.6
⚠️ This Release has been patched. Please upgrade to 0.9.6.1 ⚠️
As of January 19th, 2023 0.9.6.1 is now the official release build, and should be used over 0.9.6. Upgrade to 0.9.6.1 when possible to avoid issues creating and using secrets.
Release Highlights
Important Release Notes
With this release, if you are using Neo4J as your graph implementation, you need to set:
GRAPH_SERVICE_DIFF_MODE_ENABLED=false
For GMS (or MAE Consumer for standalone mode).
User Experience
- We now support embedding Dashboards, Charts, and Datasets. This allows us to do things like directly embed Looker / Tableau / Mode / Redash Looks, Dashboards, Explores into the Dataset pages themselves.
- [Experimental] You can now customize the number of queries displayed on the Query tab of a Dataset entity
- Improved error messaging for bulk editing via the UI
Metadata Ingestion
- Update to data profiling to allow configurable number of sample values to be returned
- Postgres ingestion now supports emitting lineage edges for Views - shoutout to @LucasRoesler for the contribution!
- Snowflake ingestion now supports extracting tags - shoutout to @frsann for the contribution!
- Vertica ingestion now supports projections and lineage- thanks for the contribution, @vishalkSimplify!
- Glue ingestion now emits an s3 lineage edge when data was written with an s3a/s3n client - thanks for the contribution, @danielli-ziprecruiter!
Developer Experience
- Fixes quickstart/docker compose issues for M1 machines
- Improvements in reliability and performance of the Restli Service endpoints for ingestion:
- Scale Restli Service thread pool based on CPU
- Add retry (exp backoff) to Restli Entity Client
- MCE no longer relies on GMS for Restli service
- Converted Restli Service from standalone servlet to Spring injectable
- Docker build externalized (significantly faster on m1, <7 minute build times, based on this)
- Frontend asset generation refactor (causing tests to fail intermittently)
What's Changed
- feat(ingest): add pydantic helper for removed fields by @hsheth2 in #6853
- chore(0.9.5): Bump defaults for release v0.9.5 by @jjoyce0510 in #6856
- Revert "fix(ci): remove warnings due to deprecated action" by @anshbansal in #6857
- refactor(restli-mce-consumer) by @david-leifker in #6744
- fix(ci): reduce smoke test run time by @anshbansal in #6841
- fix(security): require signed/encrypted jwt tokens by @david-leifker in #6565
- feat(ingest): update profiling to fetch configurable number of sample values by @mayurinehate in #6859
- feat(ingest/airflow): support raw dataset urns in airflow lineage by @hsheth2 in #6854
- refactor(graphql): make graphqlengine easier to use by @anshbansal in #6865
- fix(kafka): datahub-upgrade job by @david-leifker in #6864
- feat(ingest): pass timeout config in kafka admin client api calls by @mayurinehate in #6863
- chore(ingest): loosen requirements file by @hsheth2 in #6867
- feat(ingest): upgrade pydantic version by @cccs-eric in #6858
- fix(elasticsearch): fixes out of order runId writes by @david-leifker in #6845
- chore(ingest): loosen additional requirements by @hsheth2 in #6868
- feat(ingest): bigquery/snowflake - Store last profile date in state by @treff7es in #6832
- docs(google-analytics): Correct grammatical error in README.md by @jx2lee in #6870
- feat(CI): add venv caching by @szalai1 in #6843
- feat(ingest/snowflake): handle failures gracefully and raise permission failures by @mayurinehate in #6748
- fix(runid): always update runid, except when queued by @david-leifker in #6876
- fix(ingest): conditionally include env in assertion guid by @hsheth2 in #6811
- chore(ci): update dependencies docs-website by @anshbansal in #6871
- feat(ui) - Add a custom error message for bulk edit to add clarity by @mkamalas in #6775
- docs(adding users): Refreshing the docs for adding new DataHub Users by @jjoyce0510 in #6879
- test(mce-consumer): mockbeans by @david-leifker in #6878
- feat(ingest): avoid embedding serialized json in metadata files by @hsheth2 in #6742
- refactor(gradle): move the local docker registry to common location by @david-leifker in #6881
- refactor(smoke): use env variables by @anshbansal in #6866
- fix(lint): pin pydantic version by @anshbansal in #6886
- refactor(docs): Correctly spell elasticsearch in docs by @jjoyce0510 in #6880
- fix(ingest): okta undefined variable error by @anshbansal in #6882
- fix(ci): reduce flakiness in add_users, siblings smoke test by @anshbansal in #6883
- fix(ingest): fall back to default table comment method for all Trino query errors by @marvin-roesch in #6873
- test(misc): misc test updates by @david-leifker in #6890
- deprecate(ingest): bigquery - Removing bigquery-legacy source by @treff7es in #6851
- chore(ingest): remove inferred args to MCPW, part 1 by @hsheth2 in #6819
- test(ingest/kafka-connect): make docker setup more reliable by @hsheth2 in #6902
- fix(ingest): profiling (bigquery) - Address biquery profiling query error due to timestamp vs data mismatch by @treff7es in #6874
- fix(cli): Make datahub quickstart work with latest docker compose in M1 by @pedro93 in #6891
- fix(cli): fix delete urn cli bug + stricter type annotations by @hsheth2 in #6903
- fix(ingest/airflow): reorder imports to avoid cyclical dependencies by @stijndehaes in #6719
- feat: remove jq requirement + tweak modeldocgen args by @hsheth2 in #6904
- chore(ingest): loosen pyspark and pydeequ deps by @hsheth2 in #6908
- docs(ingest/looker): fix typos + update lookml github action example by @hsheth2 in #6910
- fix(ingest/metabase): use card_id in dashboard to chart lineage by @ccpypy in #6583
- fix(es-setup): create data stream on non-aws by @szalai1 in #6926
- Adding missing Platform logos by @maggiehays in #6892
- feat(ingestion): PowerBI# Improve PowerBI source ingestion by @mohdsiddique in #6549
- Fix compose context for kafka-setup by @szalai1 in #6923
- feat(backend): Supporting Embeddable Previews for Dashboards, Charts, Datasets by @jjoyce0510 in #6875
- chore(deps): bump json5 from 2.2.1 to 2.2.3 in /docs-website by @dependabot in #6930
- chore(deps): bump json5 from 1.0.1 to 1.0.2 in /datahub-web-react by @dependabot in #6931
- fix(ci): managed ingestion test fix by @anshbansal in #6946
- feat(ingest): add
include_table_location_lineage
flag for SQL common by @hsheth2 in #6934 - feat(ingest): allow extracting snowflake tags by @frsann in #6500
- chore(ingest): unpin pydantic dep by @hsheth2 in #6909
- chore(ingest): partially revert pyspark dep from #6908 by @hsheth2 in #6954
- fix(ingest): use branch info when cloning git repos by @hsheth2 in #6937
- chore(ingest): remove i...
DataHub v0.9.5
Release Highlights
Notice: This PR includes a fix for Single Sign-On (OIDC) that was introduced in the previous release, v0.9.4.
Important Release Notes
With this release, if you are using Neo4J as your graph implementation, you need to set:
GRAPH_SERVICE_DIFF_MODE_ENABLED=false
For GMS (or MAE Consumer for standalone mode).
User Experience
- Manual Lineage is LIVE! You can now add and remove lineage between entities in the Lineage Visualization screen, making it easier than ever to manage the complex relationships between your data resources.
- Our new Views feature makes it easy to create curated sets of Entities within DataHub. This is a great way to start to isolate the entities that matter most, and provide your DataHub end-users with a streamlined view of the assets that are relevant to their use cases. See the original demo video.
- In-App Product Tours are here! When logging into DataHub and/or visiting a new page type for the first time, new users will be prompted with a helpful walkthrough of core functionality to get them familiar with the platform. We’ll continue to add modules as we roll out new features!
- Automatically send updates to Slack and/or Microsoft Teams when changes are made within DataHub by leveraging our the new Slack and Teams Actions.
Metadata Ingestion
We’re continuing to improve the user experience for UI-based ingestion for the following sources:
- DataBricks Unity Catalog
- dbt Cloud
- MySQL
- Trino/Presto
- Microsoft SQL Server
- MariaDB
If you’re just getting started with UI-based Ingestion, check out our new BigQuery & Snowflake guides.
Stateful ingestion is now supported for Iceberg (thanks for the contrib, @cccs-Dustin!) and LDAP (thanks for the contrib, @bda618!)
What's Changed
- feat(ingest): add failure/warning counts to ingest_stats by @hsheth2 in #6823
- refactor(ingest): clean up pipeline init error handling by @hsheth2 in #6817
- fix(ingest): exclude ztsd from uber jar to prevent jni conflicts with spark by @danielli-ziprecruiter in #6787
- feat(ingest/bigquery): add option to enable/disable legacy sharded table support by @treff7es in #6822
- fix(ingest): support patches in
auto_status_aspect
by @hsheth2 in #6827 - fix(ci): reduce flakiness views select test by @anshbansal in #6821
- refactor(ingest): clean up exception types by @hsheth2 in #6818
- fix(ingest): fixed snowflake oauth ingestion not using role attribute… by @daguito81 in #6825
- refactor(ingestion): Browse Paths Upgrade V2 Feast & Sagemaker by @jjoyce0510 in #6002
- fix(lineage) Fix lineage viz with multiple siblings by @chriscollins3456 in #6826
- fix(pac4j-oidc): add verifier parameter by @david-leifker in #6835
- feat(ingest): extract kafka topic config properties as customProperties by @mayurinehate in #6783
- docs: Incorrect import statement fixed in example by @mirac-cisco in #6838
- feat(ingestion): support lineage for delta lake writes by @danielli-ziprecruiter in #6834
- feat(ui): Support defining the ID for Glossary Term and Glossary Term Group in UI by @jjoyce0510 in #6830
- feat(ci): add cypress test ui based ingestion by @anshbansal in #6769
- feat(ui): sortable domain list by @looppi in #6736
- fix(ci): add labels based on more folders by @anshbansal in #6840
- fix(ingest): kafka ingest task hand up with error bootstrap server by @wangsaisai in #6820
- fix(ingest): Fixing Kafka source linting by @jjoyce0510 in #6844
- fix(ingestion) Inject pipeline_name into recipes at runtime by @chriscollins3456 in #6833
- feat(ingest): add db/schema properties hook to SQL common by @hsheth2 in #6847
- fix(oidc): fix oidc authentication loop by @david-leifker in #6848
- docs(confluent): add details for actions pod for confluent by @RyanHolstien in #6810
- feat(ingestion): Business Glossary# Add domain support in GlossaryTerm ingestion by @mohdsiddique in #6829
- fix(ingest/looker): handle missing
label
fields by @hsheth2 in #6849 - refactor(ui): Misc domains improvements by @jjoyce0510 in #6850
New Contributors
- @daguito81 made their first contribution in #6825
- @mirac-cisco made their first contribution in #6838
Full Changelog: v0.9.4...v0.9.5
[Known Issues] DataHub v0.9.4
Known Issues
In this release, the version of our OIDC SSO library was majorly upgraded. There is an issue with how the newer version of the library interacts with OIDC providers. We have addressed this issue in v0.9.5. We recommend avoiding upgrading to this version if your organization is actively using OIDC to manage user authentication.
Important Release Notes
With this release, if you are using Neo4J as your graph implementation, you need to set:
GRAPH_SERVICE_DIFF_MODE_ENABLED=false
For GMS (or MAE Consumer for standalone mode).
What's Changed
- chore(): Updating default CLI version, update updating-datahub.md by @jjoyce0510 in #6590
- fix(ingest): profiling - Profiling failed if column cardinality threw an error by @treff7es in #6582
- fix(actions): add missing datahub-gms-protocol env var by @shirshanka in #6593
- fix(ingest): restrict snowflake-connector-python dependency by @mayurinehate in #6594
- feat(ingest/bigquery): avoid creating/deleting tables for profiling by @hsheth2 in #6578
- fix(ingest): unify emit interface by @hsheth2 in #6592
- fix(security): security version updates by @david-leifker in #6602
- docs: remove Kafka Streams from documentation by @maver1ck in #6596
- refactor(ui): Improving Kafka UI Ingestion Form, Create Domain, Create Secret Modals by @jjoyce0510 in #6588
- fix(ingest): clarify tableau auth error messages by @hsheth2 in #6600
- docs(graphql): fix deleteTest "Create"->"Delete" by @nickwu241 in #6574
- fix(gms/startup): remove set -x from start.sh by @timcosta in #6589
- feat(sql): Add SQL index on createdon field by @pedro93 in #6522
- feat(ml model): updating view of ml model feature list by @gabe-lyons in #6576
- fix(ingest/bigquery): ignore complex types from profiling by @treff7es in #6613
- feat(ingest): add external url for snowflake objects by @mayurinehate in #6580
- chore(ingest): bump and pin mypy by @hsheth2 in #6584
- fix(ingest): only require github_info for lookml and not looker by @hsheth2 in #6608
- docs(ingest): add airflow docs that use the
PythonVirtualenvOperator
by @hsheth2 in #6604 - fix(ui) Fix double scroll in embedded list search sections by @chriscollins3456 in #6618
- feat(ingest): print detailed GMS error messages by @djordje-mijatovic in #6519
- Townhall agenda wikimedia by @maggiehays in #6622
- fix(analytics): skip ListDomains if user cannot manage domains and have only one loading message by @aditya-radhakrishnan in #6624
- feat(quickstart): add support for passing thru env vars needed by Sla… by @shirshanka in #6591
- docs(actions): slack, teams by @shirshanka in #6632
- fix(logging): Remove lombok as source of slf4j-api by @david-leifker in #6616
- docs: add links from main README to slack, teams actions by @shirshanka in #6633
- feat(ingest): Support config variable for specifying a direct privat… by @mayurinehate in #6609
- Add AWS Postgres Iam Auth jar to GMS by @syedzoherer in #6371
- feat(ingest/snowflake): support filtering by fully qualified schema_pattern by @mayurinehate in #6611
- feat(ingest/kafka-connect): support MongoSourceConnector by @frsann in #6416
- feat(graph) Add createdOn, createdActor, updatedOn, updatedActor to graph edges by @chriscollins3456 in #6615
- refactor(ui): Making improvements to UI ingestion forms, adding MySQL, Trino, Presto, MSSQL, MariaDB forms by @jjoyce0510 in #6607
- perf(ui-ingestion): cache on creation or deletion of ingestion sources to reduce latency by @aditya-radhakrishnan in #6647
- feat(ingest): add dummy data source for automated testing by @anshbansal in #6550
- docs(managed datahub): adding release notes for v0.1.70 by @anshbansal in #6655
- feat(gms): Pluggable Authentication & Authorization Framework by @mohdsiddique in #6634
- docs: move rfcs to separate repo by @laulpogan in #6621
- fix(ingest): fix lingering demo-data source issues by @hsheth2 in #6659
- feat(ingest): bigquery - Running lineage extraction after metadata extraction by @treff7es in #6653
- fix(ingest): issue deprecation warning correctly by @hsheth2 in #6623
- chore(ingest): remove feast-legacy by @hsheth2 in #6661
- fix(ingest/snowflake): support domains for snowflake schema containers by @hsheth2 in #6662
- build(deps): bump decode-uri-component from 0.2.0 to 0.2.2 in /datahub-web-react by @dependabot in #6617
- feat(ingest/dbt): add support for latest DBT version 1.3 by @MatthieuBlais in #6651
- docs: add languages to code highlighting by @hsheth2 in #5576
- docs(typo) Correct typo in domains.md by @maggiehays in #6667
- feat(gms): Enable auth-api publishing to maven by @mohdsiddique in #6671
- fix(ingest/powerbi-report-server): deprecate unused graphql config by @daha in #6630
- fix(docker): Fix datahub-frontend dockerfile by @jjoyce0510 in #6670
- fix(ingest): profiling - Changing profiling defaults by @treff7es in #6640
- feat(ci): add smoke test for domain mutation by @anshbansal in #6641
- fix(datahub-protobuf): fix missing httpclient dependency by @shirshanka in #6672
- feat(ingest): update snowflake docs, add simple validations by @mayurinehate in #6636
- fix(gms): DataHub Auth API java doc fix by @mohdsiddique in #6674
- feat(ingest): run profiler in more cardinality cases by @hsheth2 in #6397
- docs(search) update broken youtube link by @maggiehays in #6678
- docs(protobuf): update examples for protobuf by @david-leifker in #6681
- feat(ingest): support knowledge links in business glossary by @mohdsiddique in #6375
- fix(ingestion/vertica): support columns with timestamp precision by @inancdokurel in #6295
- feat(ingest): add timestamps for snowflake objects by @mayurinehate in #6570
- feat(onboarding): adds framework and some steps for onboarding steps UI by @aditya-radhakrishnan in #6462
- feat(ingest): use entry point for registering transformers by @Masterchen09 in #6628
- chore(ci): update base ingestion image requirements file by @anshbansal in #6687
- fix(ci): reduce warnings due to deprecated action by @anshbansal in #6686
- refactor(ui): Adding caching for users, groups, and roles by @jjoyce0510 in #6673
- fix(ci): revert confluent kafka in base image by @anshbansal in #6690
- fix(security): version bump to latest minor python image by @david-leifker in #6694
- docs(ingest/salesforce): list required permissions by @orlandine in #6610
- feat(ingest): bigquery - option to set on behalf project by @treff7es in #6660
- ci: stop commenting test results on PR by @hsheth2 in #6700
- fix(auth-api): Attempting to fix publish for auth-api by @jjoyce0510 in https:...
DataHub v0.9.3
Release Highlights
Important Release Notes
With this release, if you are using Neo4J as your graph implementation, you need to set:
GRAPH_SERVICE_DIFF_MODE_ENABLED=false
For GMS (or MAE Consumer for standalone mode).
User Experience
- Column Level Lineage Impact Analysis is live! Read more about it here
- You can now sort Dataset field names alphabetically - this is super handy for finding columns within wide datasets that may not have an easy-to-follow order by default
- New - an “Explore All” button on the home page, making it easier to jump into the search experience
- Plus! We now have a “Share” button on entity pages, making it easier for you to share DataHub links with others
-
[Community Contribution] You can now assign the same user as different owner types - thanks for the contrib, @rtekal!
-
[Community Contribution] You can now see recommendations for Recently Edited entities on the homepage! - thanks for the contrib, @CorentinDuhamel
Metadata Ingestion
- Snowflake Automated PII Classification is here! We’re eager for feedback on the utility of this feature - check out this guide, take it for a spin, and let us know what you think!
- NEW! dbt Cloud ingestion is ready for ya - check out the module details here
- We’ve simplified the configs required to add stateful ingestion to an ingestion source - check out the updated docs here
- Speaking of stateful ingestion, it’s now available with:
- Looker & LookML ingestion sources
- [Community Contribution] Container-level ingestion – thanks for the contrib, @wangsaisai!
Developer Experience
- [Community Contribution] For those of you deploying DataHub with Neo4j, we now support Lineage Impact analysis via Neoj4 mulithop functionality. Thanks for the contrib, @djordje-mijatovic!
- We’ve loosened our SQLAlchemy dependencies to support Airflow 2.3+
What's Changed
- fix(spark-lineage): Smoke test fix + smoke test m1 support by @treff7es in #6372
- feat(ingest): supports MCEs in domain transformer by @hsheth2 in #6364
- feat(ingest): enable container stateful ingestion by @wangsaisai in #6343
- build(ingest): pin mypy version by @hsheth2 in #6391
- build: use acryl's gradle-avro-plugin by @hsheth2 in #6390
- fix(ingest): unity - add missing date type by @ms32035 in #6385
- fix(ingest): unity-catalog - Removing unneeded sqlalchemy dependency to fix install by @treff7es in #6379
- feat(ingest/tableau): re-authenticate if the token expires by @hsheth2 in #6380
- fix(ingest): use profiler config settings correctly by @hsheth2 in #6354
- fix(ingest): handle error when query returns no columns in snowflake lineage by @mayurinehate in #6404
- fix(ingest): fix missing snowflake lineage when table_pattern is set by @mayurinehate in #6410
- feat(ingest): loosen sqlalchemy dep & support airflow 2.3+ by @hsheth2 in #6204
- fix(ingest/s3): add status aspect for detected s3 datasets by @mayurinehate in #6402
- fix(ingest/snowflake): loosen snowflake connector version requirement by @hsheth2 in #6418
- fix(mysql): fix native data type for mysql set type by @mayurinehate in #6407
- perf(ui): virtualized schema table rows by @stanbaker in #6287
- fix(ui) Improve HoverEntityTooltip and truncate parent glossary nodes by @chriscollins3456 in #6417
- feat(ingest): support incremental lineage to dbt node from external platform by @mayurinehate in #6392
- fix(ingest): init dataset props if missing in transformer by @hsheth2 in #6429
- fix(change-event): remove unnecessary dependencies on EntityChangeEventGeneratorRegistryFactory by @aditya-radhakrishnan in #6431
- build(deps): bump moment-timezone from 0.5.34 to 0.5.35 in /datahub-web-react by @dependabot in #5783
- feat(frontend): Adding support to show externalUrl and institutionalMemoryFields for MLModels by @lurecas in #6053
- feat(model): adds properties, ownership, deprecated, institutional memory and tags as aspects for data platform instance entity by @sgomezvillamor in #5728
- docs(ingest/airflow): clarify docs around 1.x compat by @hsheth2 in #6436
- feat(recommendations): add last edited entities by @CorentinDuhamel in #6329
- fix(ingest): correctly compute entity change percentage by @hsheth2 in #6438
- docs(townhall) Updating Townhall History by @maggiehays in #6336
- Neo4j multihop support by @djordje-mijatovic in #6104
- fix(mae-consumer): Set proper variable expansion for JMX_OPTS and JAVA_OPTS in MAE docker by @skrydal in #6378
- docs(ingest): move prerequisite section before the ingestion recipe example by @mayurinehate in #6341
- fix(dataset): improve glossary term load performance for datasets by @Reilman79 in #6396
- feat(lineage) Implement CLL impact analysis for inputFields by @chriscollins3456 in #6426
- feat(ui) Add upgrade step to enable CLL impact analysis for existing data by @chriscollins3456 in #6427
- Added functionality to copy fieldpath and urn of each column by @Ankit-Keshari-Vituity in #6398
- fix(ingestion): add output converters for ODBC unsuported datatype in… by @LavinaVRovine in #6134
- fix(ui) Fix parentNodes overfetching everywhere it's used by @chriscollins3456 in #6446
- fix(ingest): snowflake - Fixing top query trimming in snowflake by @treff7es in #6447
- feat(elasticsearch): Updates to elasticsearch configuration, dao, tests by @david-leifker in #6269
- chore(ingest): fix mssql lint by @hsheth2 in #6453
- fix(ingest): add cli info to ingestion reporter by @hsheth2 in #6451
- fix(ui) Fix glossary side browser width fluctuating by @chriscollins3456 in #6457
- fix(python): Fix python dependencies for doc generation by @david-leifker in #6460
- docs(website): add homepage links by @jeffmerrick in #6458
- build(ingest): loosen jinja2 dependency for superset by @KulykDmytro in #6433
- fix(ingest): lowercase db name in mssql ingestion by @hsheth2 in #6448
- fix(ingest): handle missing schema in transformer by @hsheth2 in #6445
- feat(ingest): allow specific profiler config fields to override profile_table_level_only by @hsheth2 in #6366
- docs(enrichment) updating enrichment landing page by @maggiehays in #6286
- fix(home-page): remove redundant getAuthenticatedUser query by @aditya-radhakrishnan in #6464
- feat(ingest): detect old or missing docker compose by @hsheth2 in #6466
- feat(ingestion): powerbi # Power BI report support by @mohdsiddique in #6339
- fix(ingest/dbt): disable incremental lineage by default by @hsheth2 in #6467
- fix(loggin): print logging timestamp in ISO8601 format instead of jus… by @szalai1 in #6474
- docs(ingest/trino): add example of http connect...