Skip to content

The Deephaven Core Roadmap 1H24

Pete Goddard edited this page Mar 11, 2024 · 5 revisions

Our roadmap for the first half of 2024 is centered around Deephaven’s core usage patterns as (1) a UI framework for live data (and batch data, too), (2) a versatile query engine for live workloads in Python and Java, and (3) a live data pipeline utility.

Annotations:

Mark Status
* work not yet started
🏃 work in progress
work completed
💪 stretch goal
💡 needs research
🟡 particularly important

Project organization

We intend to release new versions of the project at the end of each month. At any time, deliveries intended for the subsequent two months should be found in the appropriate GitHub Milestone, respectively.

Themes

Work will fall into the following categories: *UI/UX framework capabilities. (To be delivered in this deephaven-core project, as well as web-client-ui, plugins, ipywidgets and other complementary projects.) *Live ingestion. *Data lake interoperability. *Engine capabilities, Python interoperability. *Client APIs and the “Barrage” wire protocol (as found in this repo and github/barrage. *Gen-AI integrations

UI?UX framework

  • 🟡 Provide Deephaven as a library to be integrated naturally into Python clients.

  • 🟡 🏃 MVP version of deephaven.ui, a complete framework for live dashboards and widgets.

    • 🏃 Integration of React Spectrum library of UI interactive and adaptive experiences.
    • 🏃 A rich interface for client-side callbacks and interactivity.
    • 🏃 Programmatic layouts via Python scripts.

  • 🟡 🏃 Ease-of-deployment related to Deephaven’s plug-in infrastructure.

    • 🏃 Deephaven’s integration with PlotlyExpress (“Deephaven Express”)
      • Integrate Deephaven’s smart-downsampler to line plots (- including real-time, ticking ones)
      • Comprehensive documentation of Deephaven Express, plotly. Matplotlib, seaborn, and Java plotting.
  • 💪 An interactive SQL UX (similar to the Python and Java Groovy exploratory and development IDEs provided today).

  • 💡 🟡 UI/UX for writing queries using natural language.

    • Slick integration of LLM-to-SQL utilities, leveraging Deephaven’s declarative Query Syntax Tree (QST) API and Apache Calcite.
    • GUI experiences for typing English and inheriting tables, plots, widgets, and visualizations that update in real time.

Live ingestion

  • 🟡 Packaging of powerful, general-purpose abstractions for ingesting live data from a variety of sources.

  • 🏃 Live mapping of 1-to-N and streaming data with nested formats into live Deephaven tables.

  • 🏃 Improved Kafka support: Payload coverage and performance.

  • 🏃 Improved JSON capabilities:

    • Support metadata and common _metadata files in Parquet (-- this supports adding partitions to Parquet).
    • :strong: 💡 Integrate with Iceberg’s dynamic capabilities.
  • Continue investing in table operations’ engineering:

    • 🏃 Improve performance of static operations:
      • 🏃 Aggregations.
      • 🏃 Multi-column sort.
      • Select() and update().
      • NaturalJoin(), leftJoin()
    • 🏃 Improve performance of update operations.
      • 🏃 Aggregations.
      • 🏃 Multi-column sort.
      • Select() and update().
    • Add an updateBy() operation – static and incremental.
    • Support outerJoin() – static and incremental.
    • Multi-threading:
      • Provide multi-threaded where().
      • 🏃 Provide multi-threaded update().
      • 💡Research using fetching and initializing the Python globals in a GIL-safe way to avail Python users of concurrent operations.

  • 🟡 Support smart-keys to enable tableMaps() and popular aggregated structures:

    • Enhance Query Syntax Tree (“QST”) to support smart-key structures.
    • Plumb through tree tables.
    • Plumb through aggregated tables and roll-ups.
    • Plumb through pivot tables.

  • 💡Research replacing the Java array implementation of chunks with Apache Arrow ValueVectors instead as a POC.


  • 🏃 Complete support for recent Java versions.

  • Improve support for data types as it relates to popular open formats.

    • Support a decimal(precision, scale) "proper" type (# 1734).
    • Visit “Date”, “Time”, and “Duration” types to ensure maximum compatibility with open formats; contemplate time as nanos, improve local time.
    • Support “Half-Float” 2-byte floating point type as defined by the IEEE 754 extension.
    • Support “unsigned x {short, int, long}” types.
    • 💡Research the implications of supporting nulls as a separate Index (rowset) (# 1733).

  • Dev-rel: Provide examples related to atomicity and transactions….. i.e. reconstruct consistent states for synchronizing tables when correlated sequence numbers do not exist.


  • Parameterized (on-demand) queries.
    • Support URI-format that is higher level than a Barrage ticket in support of parameterized queries.
    • Extend app-mode to support parameterized queries and remote URI-calls.

  • Support popular authentication technologies.

  • Support access controls in the engine once Barrage supports it in the API.

  • Improve functions for easily analyzing query performance; i.e. performanceOverview().

  • Contemplate efforts to align Deephaven software with core technology innovations:

    • Explore alternative bytecode-generation strategies.
    • Investigate a GraalVM compilation option.


Python, AI, ML, and Data Science Capabilities

  • 🏃 Improve server-side Python table-interface to be more native and extendable.

    • Release already-developed improvements.
    • Refactor away the TableTools module.

  • 🟡 🏃 Provide dockerized versions of Deephaven packaged with Python AI libraries.


  • Improve Python formula interface

    • Address mismatch between Java and Python in terms of syntax and errors.
    • Amortize and chunk to improve performance of formulas.
    • Make Java collections iterable in Python.

  • Other Python workflow items

    • Support Python debuggers.
    • Allow users to save and load tables with pickled Python.
    • Enable query scope variables to work inside Python packages.

  • 🏃 Improve deephaven.learn to make AI real-time easier.

    • Restore Python tooling related to deephaven-learn.
  • 🏃 Assign and manage Tensors more elegantly.

  • Tidy and extend the NumPy-integration.

    • Visit mapping of NumPy to Java types.
    • Provide NumPy array support in ungroup().

  • Increase coverage of AI library integration tests.

UI/UX and JS-API Features

  • 🏃 Deliver UI-driven table conditional formatting.

  • 🏃 Support and evolve a Plug-In framework.

    • Enable users to deliver enhanced functionality server-side.
      • Rendering matplotlib as a first example.
      • Render seaborn.
      • Contemplate extensions for ggplot, etc.; potentially via new contributors.
    • Enables users to extend UI capabilities, outright or as coupled with server-side Plug-Ins.
      • Support App-level Plug-Ins.
      • Support Dashboard-level Plug-Ins.
      • Support Table-level Plug-Ins.

  • 🟡 Deliver widgets

    • Write software to support widgets that are (i) natively Barrage tables, (ii) deployable independent of server-side code, (iii) plumbed for easy development by the community, and (iv) able to be nicely embedded in third party applications.
    • Release the Deephaven table and plot widgets as first examples.
    • Add to the suite of available widgets.


  • 🟡 🏃 Implement and extend Playwright testing suite.

  • 🟡 Render Pandas with full table-interactions in web UI.

  • Research more native plotting and visualization integrations to avail users of even greater flexibility:

    • Plot.ly
    • D3

  • 🟡 🏃 Support plotBy() (-- i.e. UI-driven keyed filtering for plots).

  • 🟡 Research improved dashboarding experiences:

    • Consider a Deephaven builder API to provide declarative development of widgets, layouts, input and parameterized-query experiences.
    • 💡 Potentially integrate with Dash or Streamlit.

  • Provide API support and UI integration for authentication.

  • Support rendering of scripted aggregation views:

    • Deliver tree tables.
    • Deliver roll-ups.
    • Design and implement rendering programmatically-created pivot tables.
    • Support creation of pivot tables from the web UI.
  • 💡 Research dashboard and website builder workflows (a la R-Shiny).


Client APIs and the DH-gRPC-API

  • Invest in client workflows generally:

    • Improve set of control features for clients to impose on DH server.
    • Support debugger workflows, breakpoints.
    • Ensure auto-complete is smooth for client libraries in remote IDEs.

  • Java

    • Improve packaging and deployment.
    • Improve experience of delivering the UpdateGraphProcessor (-- the updating, Barrage client) model into Java clients.

  • C++

    • 🏃 Support getting updating data from a DH server; doExchange().
    • 🏃 Support delivering updating data to a SH server; doPush().
    • Provide client-R integration of snapshotted tables-to/from-R-data frames using C++.

  • Python

    • Wrap C++-client for doPush() and doExchange().
    • Continue to evolve the client experience to emulate the Python server experience.

  • 💪 Consider other language APIs.
    • Prototype of Julia client API.
    • Prototype of Rust client API.
    • Prototype of Go client API.
    • Prototype of server-side R.


Data Sources and Sinks

  • 🏃 Accelerate CSV and other file reading and parsing.

  • 🟡 Expand use cases for URI-driven data access and ingestion using DH’s resolve() library.

  • Integrate with SQL warehouses:

    • 🟡 Read ODBC/JDBC support.
    • CDC ingestion.
    • Write ODBC/JDBC support.
    • DevRel: Example Debezium integration.

  • Improve the Kafka integration

    • DevRel: Document Deephaven worker-to-worker with Kafka.

    • Ingest: Windowing on append-ingestion (#1054).

    • Improve Avro integration:

      • Support nested Avro data structures.
      • Provide direct-to-chunk ingestion for Avro (#1040).
      • Support Arrays in Avro schemes (#994).
    • Other consume-Kafka improvements:

      • Key/value marker interface (#1037).
      • In-memory SymbolTableSource to improve enum and reduce duplication overhead. (#1236).
      • Enforcement of stream assumptions and attributes on ingest (#1322).
      • Improved error handling (#1025).
    • Support Kafka-exhaust of POJOs.

  • Refine the Parquet integration

    • Support nested definitions and complex types.
    • Custom-form indexes.
    • Reading: Support predicate pushdown related to (i) metadata per partition; and (ii) indexes and groups.
    • Writing: Write multiple partitions with proper tooling (#958).
    • Writing: Beef up encoding statistics (#949).

  • Arrow and Flight

    • Static data: doGet() and doPut() supported from Deephaven IDE.
    • Dynamic data: doPush() and doExchange() supported from Deephaven IDE.
    • 💡 Support on-disk, persisted Arrow.

  • 💡 Deliver In-line, real-time persistence.

    • DevRel: Document snapshot writing to Kafka and Parquet.
    • Research append-only persistence:
      • Consider designs for one-time persistence of input tables.
      • Consider designs for chunked persistence, using Parquet partitions or Arrow chunks.
      • Consider key-value persistence of append-only tables.
    • Support replay from checkpoints.
    • Research, design, and potentially implement proper real-time persistence, using open formats, with spills and robust recovery in mind.
    • Consider CDN-like models that leverage key-value engines as a backing store, and allow passing huge data, and updates thereof, to the edge.

  • Other data source and sinks:

    • Streaming:
      • AWS pub-sub.
      • Google pub-sub.
      • Solace.
      • Confluent queues.

  • Batch and hybrids:
    • 🟡 Orc.
    • Iceberg.
    • Delta lakes (-- integration with commercial offerings).
    • Apache Kudu.
    • Spark dataframes.

  • Platforms:
    • Confluent Platform.
    • Databricks Lakehouse.
    • AWS.
    • GCP.
    • Azure.
    • Dremio Lakehouse.


“Barrage” Wire Protocol

  • 🏃 Devrel: Document examples of Deephaven worker-to-worker use cases and example implementations, using Barrage.

  • 🟡 🏃 Support Deephaven-extensions via Barrage.

  • Make stream-table semantics available directly via Barrage.

  • Extend (non-Java) data types supported by Barrage.

  • Make a publicly-available performance-measurement suite for Barrage.

  • 🟡 Contribute to Arrow Flight:

    • 🏃 Deliver the schema-evolution code as to Arrow Flight.
    • Deliver column-backed implementation to Arrow Flight.

  • 🟡 Provide flow control for the Barrage API.

  • Worker-to-worker table and update communications via Barrage

  • Govern access control to raw and derived data via Barrage:

    • At the source level
    • At the table level
    • At the row level

  • Consider having Barrage replay support remote replication, allowing Deephaven to serve as a caching layer for a key-value store.