Releases: dagster-io/dagster
0.8.8
New
- The new
configured
API makes it easy to create configured versions of resources. - Deprecated the
Materialization
event type in favor of the newAssetMaterialization
event type,
which requires theasset_key
parameter. Solids yieldingMaterialization
events will continue
to work as before, though theMaterialization
event will be removed in a future release. - We have added an
intermediate_store_defs
argument toModeDefinition
, which will eventually
replace system storage. You can only use one or the other for now. We will eventually deprecate
system storage entirely, but continued usage for the time being is fine. - The help panel in the dagit config editor can now be resized and toggled open or closed, to
enable easier editing on smaller screens.
Bugfixes
- Opening new Dagit browser windows maintains your current repository selection. #2722
- Pipelines with the same name in different repositories no longer incorrectly share playground state. #2720
- Setting
default_value
config on a field now works as expected. #2725 - Fixed rendering bug in the dagit run reviewer where yet-to-be executed execution steps were
rendered on left-hand side instead of the right.
0.8.7
Breaking Changes
- Loading python modules reliant on the working directory being on the PYTHONPATH is no longer
supported. Thedagster
anddagit
CLI commands no longer add the working directory to the
PYTHONPATH when resolving modules, which may break some imports. Explicitly installed python
packages can be specified in workspaces using thepython_package
workspace yaml config option.
Thepython_module
config option is deprecated and will be removed in a future release.
New
- Dagit can be hosted on a sub-path by passing
--path-prefix
to the dagit CLI. #2073 - The
date_partition_range
util function now accepts an optionalinclusive
boolean argument. By default, the function does not return include the partition for which the end time of the date range is greater than the current time. Ifinclusive=True
, then the list of partitions returned will include the extra partition. MultiDependency
or fan-in inputs will now only cause the solid step to skip if all of the
fanned-in inputs upstream outputs were skipped
Bugfixes
- Fixed accidental breaking change with
input_hydration_config
arguments - Fixed an issue with yaml merging (thanks @shasha79!)
- Invoking
alias
on a solid output will produce a useful error message (thanks @iKintosh!) - Restored missing run pagination controls
- Fixed error resolving partition-based schedules created via dagster schedule decorators (e.g.
daily_schedule
) for certain workspace.yaml formats
0.8.6
Breaking Changes
- The
dagster-celery
module has been broken apart to manage dependencies more coherently. There are now three modules:dagster celery
,dagster-celery-k8s
, anddagster-celery-docker
. - Related to above, the
dagster-celery worker start
command now takes a required-A
parameter which must point to theapp.py
file within the appropriate module. E.g if you are using thecelery_k8s_job_executor
then you must use the-A dagster_celery_k8s.app
option when using thecelery
ordagster-celery
cli tools. Similar for thecelery_docker_executor
:-A dagster_celery_docker.app
must be used. - Renamed the
input_hydration_config
andoutput_materialization_config
decorators todagster_type_
anddagster_type_materializer
respectively. Renamed DagsterType'sinput_hydration_config
andoutput_materialization_config
arguments toloader
andmaterializer
respectively.
New
-
New pipeline scoped runs tab in Dagit
-
Add the following Dask Job Queue clusters: moab, sge, lsf, slurm, oar (thanks @DavidKatz-il!)
-
K8s resource-requirements for run coordinator pods can be specified using the
dagster-k8s/resource_requirements
tag on pipeline definitions:@pipeline( tags={ 'dagster-k8s/resource_requirements': { 'requests': {'cpu': '250m', 'memory': '64Mi'}, 'limits': {'cpu': '500m', 'memory': '2560Mi'}, } }, ) def foo_bar_pipeline():
-
Added better error messaging in dagit for partition set and schedule configuration errors
-
An initial version of the CeleryDockerExecutor was added (thanks @mrdrprofuroboros!). The celery workers will launch tasks in docker containers.
-
Experimental: Great Expectations integration is currently under development in the new library dagster-ge. Example usage can be found here
0.8.5
Breaking Changes
- Python 3.5 is no longer under test.
Engine
andExecutorConfig
have been deleted in favor ofExecutor
. Instead of the@executor
decorator decorating a function that returns anExecutorConfig
it should now decorate a function that returns anExecutor
.
New
- The python built-in
dict
can be used as an alias forPermissive()
within a config schema declaration. - Use
StringSource
in theS3ComputeLogManager
configuration schema to support using environment variables in the configuration (Thanks @mrdrprofuroboros!) - Improve Backfill CLI help text
- Add options to spark_df_output_schema (Thanks @DavidKatz-il!)
- Helm: Added support for overriding the PostgreSQL image/version used in the init container checks.
- Update celery k8s helm chart to include liveness checks for celery workers and flower
- Support step level retries to celery k8s executor
Bugfixes
- Improve error message shown when a RepositoryDefinition returns objects that are not one of the allowed definition types (Thanks @sd2k!)
- Show error message when
$DAGSTER_HOME
environment variable is not an absolute path (Thanks @AndersonReyes!) - Update default value for
staging_prefix
in theDatabricksPySparkStepLauncher
configuration to be an absolute path (Thanks @sd2k!) - Improve error message shown when Databricks logs can't be retrieved (Thanks @sd2k!)
- Fix errors in documentation fo
input_hydration_config
(Thanks @joeyfreund!)
0.8.4
Bugfix
- Reverted changed in 0.8.3 that caused error during run launch in certain circumstances
- Updated partition graphs on schedule page to select most recent run
- Forced reload of partitions for partition sets to ensure not serving stale data
New
- Added reload button to dagit to reload current repository
- Added option to wipe a single asset key by using
dagster asset wipe <asset_key>
- Simplified schedule page, removing ticks table, adding tags for last tick attempt
- Better debugging tools for launch errors
0.8.3
Breaking Changes
-
Previously, the
gcs_resource
returned aGCSResource
wrapper which had a singleclient
property that returned agoogle.cloud.storage.client.Client
. Now, thegcs_resource
returns the client directly.To update solids that use the
gcp_resource
, change:context.resources.gcs.client
To:
context.resources.gcs
New
- Introduced a new Python API
reexecute_pipeline
to reexecute an existing pipeline run. - Performance improvements in Pipeline Overview and other pages.
- Long metadata entries in the asset details view are now scrollable.
- Added a
project
field to thegcs_resource
indagster_gcp
. - Added new CLI command
dagster asset wipe
to remove all existing asset keys.
Bugfix
- Several Dagit bugfixes and performance improvements
- Fixes pipeline execution issue with custom run launchers that call
executeRunInProcess
. - Updates
dagster schedule up
output to be repository location scoped
0.8.2
Bugfix
- Fixes issues with
dagster instance migrate
. - Fixes bug in
launch_scheduled_execution
that would mask configuration errors. - Fixes bug in dagit where schedule related errors were not shown.
- Fixes JSON-serialization error in
dagster-k8s
when specifying per-step resources.
New
- Makes
label
optional parameter for materializations withasset_key
specified. - Changes
Assets
page to have a typeahead selector and hierarchical views based on asset_key path. - dagster-ssh
- adds SFTP get and put functions to
SSHResource
, replacing sftp_solid.
- adds SFTP get and put functions to
Docs
- Various docs corrections
0.8.1
Bugfix
- Fixed a file descriptor leak that caused
OSError: [Errno 24] Too many open files
when enough
temporary files were created. - Fixed an issue where an empty config in the Playground would unexpectedly be marked as invalid
YAML. - Removed "config" deprecation warnings for dask and celery executors.
New
- Improved performance of the Assets page.
0.8.0 "In The Zone"
Major Changes
Please see the 080_MIGRATION.md
migration guide for details on updating existing code to be
compatible with 0.8.0
-
Workspace, host and user process separation, and repository definition Dagit and other tools no
longer load a single repository containing user definitions such as pipelines into the same
process as the framework code. Instead, they load a "workspace" that can contain multiple
repositories sourced from a variety of different external locations (e.g., Python modules and
Python virtualenvs, with containers and source control repositories soon to come).The repositories in a workspace are loaded into their own "user" processes distinct from the
"host" framework process. Dagit and other tools now communicate with user code over an IPC
mechanism. This architectural change has a couple of advantages:- Dagit no longer needs to be restarted when there is an update to user code.
- Users can use repositories to organize their pipelines, but still work on all of their
repositories using a single running Dagit. - The Dagit process can now run in a separate Python environment from user code so pipeline
dependencies do not need to be installed into the Dagit environment. - Each repository can be sourced from a separate Python virtualenv, so teams can manage their
dependencies (or even their own Python versions) separately.
We have introduced a new file format,
workspace.yaml
, in order to support this new architecture.
The workspace yaml encodes what repositories to load and their location, and supersedes the
repository.yaml
file and associated machinery.As a consequence, Dagster internals are now stricter about how pipelines are loaded. If you have
written scripts or tests in which a pipeline is defined and then passed across a process boundary
(e.g., using themultiprocess_executor
or dagstermill), you may now need to wrap the pipeline
in thereconstructable
utility function for it to be reconstructed across the process boundary.In addition, rather than instantiate the
RepositoryDefinition
class directly, users should now
prefer the@repository
decorator. As part of this change, the@scheduler
and
@repository_partitions
decorators have been removed, and their functionality subsumed under
@repository
.
-
Dagit organization The Dagit interface has changed substantially and is now oriented around
pipelines. Within the context of each pipeline in an environment, the previous "Pipelines" and
"Solids" tabs have been collapsed into the "Definition" tab; a new "Overview" tab provides
summary information about the pipeline, its schedules, its assets, and recent runs; the previous
"Playground" tab has been moved within the context of an individual pipeline. Related runs (e.g.,
runs created by re-executing subsets of previous runs) are now grouped together in the Playground
for easy reference. Dagit also now includes more advanced support for display of scheduled runs
that may not have executed ("schedule ticks"), as well as longitudinal views over scheduled runs,
and asset-oriented views of historical pipeline runs. -
Assets Assets are named materializations that can be generated by your pipeline solids, which
support specialized views in Dagit. For example, if we represent a database table with an asset
key, we can now index all of the pipelines and pipeline runs that materialize that table, and
view them in a single place. To use the asset system, you must enable an asset-aware storage such
as Postgres. -
Run launchers The distinction between "starting" and "launching" a run has been effaced. All
pipeline runs instigated through Dagit now make use of theRunLauncher
configured on the
Dagster instance, if one is configured. Additionally, run launchers can now support termination of
previously launched runs. If you have written your own run launcher, you may want to update it to
support termination. Note also that as of 0.7.9, the semantics ofRunLauncher.launch_run
have
changed; this method now takes therun_id
of an existing run and should no longer attempt to
create the run in the instance. -
Flexible reexecution Pipeline re-execution from Dagit is now fully flexible. You may
re-execute arbitrary subsets of a pipeline's execution steps, and the re-execution now appears
in the interface as a child run of the original execution. -
Support for historical runs Snapshots of pipelines and other Dagster objects are now persisted
along with pipeline runs, so that historial runs can be loaded for review with the correct
execution plans even when pipeline code has changed. This prepares the system to be able to diff
pipeline runs and other objects against each other. -
Step launchers and expanded support for PySpark on EMR and Databricks We've introduced a new
StepLauncher
abstraction that uses the resource system to allow individual execution steps to
be run in separate processes (and thus on separate execution substrates). This has made extensive
improvements to our PySpark support possible, including the option to execute individual PySpark
steps on EMR using theEmrPySparkStepLauncher
and on Databricks using the
DatabricksPySparkStepLauncher
Theemr_pyspark
example demonstrates how to use a step launcher. -
Clearer names What was previously known as the environment dictionary is now called the
run_config
, and the previousenvironment_dict
argument to APIs such asexecute_pipeline
is
now deprecated. We renamed this argument to focus attention on the configuration of the run
being launched or executed, rather than on an ambiguous "environment". We've also renamed the
config
argument to all use definitions to beconfig_schema
, which should reduce ambiguity
between the configuration schema and the value being passed in some particular case. We've also
consolidated and improved documentation of the valid types for a config schema. -
Lakehouse We're pleased to introduce Lakehouse, an experimental, alternative programming model
for data applications, built on top of Dagster core. Lakehouse allows developers to define data
applications in terms of data assets, such as database tables or ML models, rather than in terms
of the computations that produce those assets. Thesimple_lakehouse
example gives a taste of
what it's like to program in Lakehouse. We'd love feedback on whether this model is helpful! -
Airflow ingest We've expanded the tooling available to teams with existing Airflow installations
that are interested in incrementally adopting Dagster. Previously, we provided only injection
tools that allowed developers to write Dagster pipelines and then compile them into Airflow DAGs
for execution. We've now added ingestion tools that allow teams to move to Dagster for execution
without having to rewrite all of their legacy pipelines in Dagster. In this approach, Airflow
DAGs are kept in their own container/environment, compiled into Dagster pipelines, and run via
the Dagster orchestrator. See theairflow_ingest
example for details!
Breaking Changes
-
dagster
-
The
@scheduler
and@repository_partitions
decorators have been removed. Instances of
ScheduleDefinition
andPartitionSetDefinition
belonging to a repository should be specified
using the@repository
decorator instead. -
Support for the Dagster solid selection DSL, previously introduced in Dagit, is now uniform
throughout the Python codebase, with the previoussolid_subset
arguments (--solid-subset
in
the CLI) being replaced bysolid_selection
(--solid-selection
). In addition to the names of
individual solids, this argument now supports selection queries like*solid_name++
(i.e.,
solid_name
, all of its ancestors, its immediate descendants, and their immediate descendants). -
The built-in Dagster type
Path
has been removed. -
PartitionSetDefinition
names, including those defined by aPartitionScheduleDefinition
,
must now be unique within a single repository. -
Asset keys are now sanitized for non-alphanumeric characters. All characters besides
alphanumerics and_
are treated as path delimiters. Asset keys can also be specified using
AssetKey
, which accepts a list of strings as an explicit path. If you are running 0.7.10 or
later and using assets, you may need to migrate your historical event log data for asset keys
from previous runs to be attributed correctly. Thisevent_log
data migration can be invoked
as follows:from dagster.core.storage.event_log.migration import migrate_event_log_data from dagster import DagsterInstance migrate_event_log_data(instance=DagsterInstance.get())
-
The interface of the
Scheduler
base class has changed substantially. If you've written a
custom scheduler, please get in touch! -
The partitioned schedule decorators now generate
PartitionSetDefinition
names using
the schedule name, suffixed with_partitions
. -
The
repository
property onScheduleExecutionContext
is no longer available. If you were
using this property to pass toScheduler
instance methods, this interface has changed
significantly. Please see theScheduler
class documentation for details. -
The CLI option
--celery-base-priority
is no longer available for the command:
dagster pipeline backfill
. Use the tags option to specify the celery priority, (e.g.
dagster pipeline backfill my_pipeline --tags '{ "dagster-celery/run_priority": 3 }'
-
The
execute_partition_set
API has been removed. -
The deprecated
is_optional
parameter toField
andOutputDefinition
has been removed.
Useis_required
instead.
...
-