Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving our understanding of our users with kedro-telemetry #510

Open
yetudada opened this issue Jan 8, 2024 · 10 comments
Open

Improving our understanding of our users with kedro-telemetry #510

yetudada opened this issue Jan 8, 2024 · 10 comments
Assignees

Comments

@yetudada
Copy link
Contributor

yetudada commented Jan 8, 2024

Introduction

Analytics play a critical role in product management. As Marty Cagan highlights, analytics are essential for understanding user behaviour, measuring product progress, validating product ideas, informing decisions, and inspiring product work​​. In the context of Kedro, we have telemetry tools that help us qualitatively understand our users, namely:

  • kedro-telemetry, which gives insight into the feature usage and user adoption of the CLI in Kedro Framework and the CLI and UI of Kedro-Viz
  • Standard Heap Analytics configuration, which gives insights on our documentation and website

kedro-telemetry is the focus of this GitHub issue.

What principles should we adopt to govern the improvements of kedro-telemetry?

With all of these potential changes to kedro-telemetry, I thought it would be helpful to ground our work in certain principles that affect our users and our team. Therefore, I propose we adopt the following principles when improving kedro-telemetry:

  1. Trustworthy: Ensure all insights from kedro-telemetry are reliable and accurate. Team members should have full confidence in the data they're using to make decisions.
  2. Accessible: Make insights easy to obtain and understand for all team members, facilitating informed product development.
  3. User-aware: Clearly communicate to users about kedro-telemetry, including its activation process, ensuring informed consent and understanding.
  4. Transparent: Provide crystal clear information about what data kedro-telemetry collects and its scope, in an easily digestible format.
  5. Actionable: Design kedro-telemetry to provide insights that are directly applicable to product improvement strategies.
  6. Minimal: Only collect the data that is needed, and no more.
  7. Privacy-conscious: Ensure data collection complies with privacy laws and ethical standards, respecting user privacy at every step.
  8. Collaborative: Facilitate sharing and discussion of insights among team members to foster a collaborative approach to product development.

How was kedro-telemetry designed?

We have detailed some of the ways that kedro-telemetry was designed in a separate GitHub issue (#506).

What are the current challenges with its implementation?

There is room for improvement for the current implementation of kedro-telemetry. I've tried to capture all known issues here but let me know if I'm missing some and I'll update the details here.

Theme Problem Priority Linked GitHub Issue Status
Access to statistics Make sure everyone on the Kedro Team can access Heap Analytics and/or Snowflake and knows how to find and understand the data P1 Done
Data collection Improve masking of CLI commands P2 #371 Done
Data collection Develop a methodology to track users of Kedro-Viz that are shareable URL users of Kedro-Viz and thus do not activate Kedro-Viz using a CLI P2
Developer experience This work also includes making sure kedro-telemetry does not interrupt the CI/CD workflow, right now users have to check the documentation when kedro-telemetry will interrupt their workflow P2 kedro-org/kedro#1640 & #484 Done
Developer experience Determine whether telemetry should be active by default and rather have an opt-out workflow. This must be investigated with the LF AI & Data legal team. P1 Done
Developer experience Determine whether we should have kedro-telemetry as a mandatory dependency meaning that users will have kedro-telemetry packaged in Kedro and it will no longer be part of the requirements of the starters P2 Done
Developer experience Fix how kedro-telemetry works with Databricks P1 #484 Done
Documentation Update our README.md on what data we're collecting about our users P1 #508 Done
User identification Develop a robust method to distinguish real users from CI/CD P1 #483 Done
User identification Figure out a way to have unique user identification, even if Docker is being used P1 #333 Done
Documentation Decide on the best place to publish information about kedro-telemetry to our users P3 #509 Done
Project identification Choose a singular ID for projects; we collect package_name and project_name and investigate why project_name is a blank field in our data P3 #507 Done
User identification Figure out a new way to identify internal users of Kedro P2 Done
Data collection Develop a common understanding of why the number of kedro viz CLI command runs differs to the number of users of Kedro-Viz according to Heap Analytics P3

What else could we learn from our users?

I'll always be forward-looking on how we could continue to learn more about our users and even improve our existing metrics. I'd like to use a key to detail the status of the metric.

Status of metric:

  • 🚀 - This insight exists and is trustworthy
  • 🧐 - This insight exists, has a defined methodology but could be improved
  • 📬 - The data for this insight exists but the insight does not exist yet
  • 🛠️ - The data for this insight does not exist and this insight does not exist
Type of insight Priority Category What does this insight allow us to do? How is the data collected? What limitations exist with the current implementation? What are alternative data sources? Status
Number of users P1 Adoption Tracking the number of users helps gauge Kedro's product penetration and user base size kedro-telemetry collects and hashes the computer's username upon user consent for user ID generation and counting. This insight requires kedro-telemetry installation, CLI usage, user consent, and assumes unique computer usernames, though this may not hold in cases like Docker Depends on the use case for a total view of users, we could use MAUs from our documentation or if we wanted to guess if Kedro was being used more as a library and less than a framework then we could look at PyPI downloads with kedro-telemetry user data i.e. if kedro-telemetry user data declines by PyPI downloads increase then that might be a sign. 🧐
Number of projects P1 Usage Enables us to track feature adoption, total Kedro project count, and average team size per project kedro-telemetry hashes package_name and project_name for project ID generation and counting. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. N/A 🧐
Number of projects in production P2 Value proposition Indicates if projects are reaching production, approximated by the usage in CI/CD, aligning with Kedro's value proposition N/A N/A N/A 🛠️
Code quality of projects P2 Value proposition Kedro's usage should correlate with improved code quality in projects N/A N/A N/A 🛠️
Average team size P3 Value proposition Kedro's effectiveness should reflect in larger team sizes on projects Leverages insights from total number of users and number of projects. Bound by the same limitations of the insights for number of users and number of projects. The insight may not always be accurate because certain users for a project can opt-out of telemetry. N/A 🧐
Ratio of Spaceflights projects to real Kedro projects P2 Product development Identifies if users stop at generating example projects without further Kedro engagement N/A N/A We could approximate this data by looking at the project size data i.e. we'd calcuate the size of a Spaceflights project and see how many times this size project appears in our data. 📬
% of custom datasets used P2 Product development Reveals the prevalence of custom datasets, indicating unsupported data types. N/A N/A N/A 🛠️
Types of datasets used from kedro-datasets P1 Usage Identifies priority datasets for fixes and feature development and the balance of supported versus custom datasets N/A N/A N/A 🛠️
Types of cloud platforms used P1 Product development Informs us about which cloud platforms to prioritize based on usage data N/A N/A We could piggyback off the "types of datasets" data collection and collect the fsspec registry data. 🛠️
Error tracking P1 Usage Helps identify user issues with Kedro and prioritize errors for clearer resolution N/A N/A N/A 🛠️
Number of datasets used P3 Usage Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects When kedro-telemetry is active, a hook counts this figure. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. N/A 🚀
Number of pipelines created P3 Usage Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects When kedro-telemetry is active, a hook counts this figure. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. N/A 🚀
Number of nodes P3 Usage Gauges Kedro project sizes, relevant to our claim of aiding larger data science projects When kedro-telemetry is active, a hook counts this figure. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. N/A 🚀
Telemetry version P1 Product development Identifies gaps in telemetry metrics due to data collection starting in newer kedro-telemetry versions When kedro-telemetry is active, a hook reads this figure from their project. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. Could look at PyPI download data 🚀
Python version P1 Product development Aids in deciding which Python versions to sunset, in conjunction with Kedro's download data When kedro-telemetry is active, a hook reads this figure from their project. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. Could look at PyPI download data 🚀
Project version P1 Product development Reveals telemetry data gaps and the popularity of specific Kedro versions When kedro-telemetry is active, a hook reads this figure from their project. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. Could look at PyPI download data 🚀
Commands run from the CLI P1 Product development Shows usage patterns of CLI features in Kedro When kedro-telemetry is active, a hook counts this figure. This insight depends on kedro-telemetry installation, Kedro CLI usage, and user consent for data collection. N/A 🧐
Ratio of library or framework + project template users P2 Product development Assesses Kedro's library versus framework adoption N/A N/A N/A 🛠️
Usage frequency P1 Usage Determines if users are repeat or one-time Kedro users N/A N/A N/A 📬
Dependency analysis P1 Product development Helps identify which integrations to build with Kedro N/A N/A N/A 🛠️
Duration of a project P2 Value proposition Evaluates the longevity of Kedro projects in relation to production readiness N/A N/A N/A 📬

What are other projects that we can be inspired by?

I'm just going to list them and not detail what they're about and what we could learn:

@astrojuanlu
Copy link
Member

astrojuanlu commented Jan 10, 2024

Status of personal data collection and consent in adjacent products presented:

Project name Tracks personal data Uses opt-in consent Opt-out mechanism Telemetry collection mechanism is an optional dependency Tracks individual users Documentation Comments
Prefect No ❌ No ❌ Environment variable No 👎 No 👎 they have a session_id instead https://docs.prefect.io/latest/api-ref/prefect/settings/?h=prefect_server_analytics_enabled#prefect.settings.PREFECT_SERVER_ANALYTICS_ENABLED PREFECT_SERVER_ANALYTICS_ENABLED is True by default
Great Expectations No ❌ No ❌ Project settings + Global settings + Environment variable No 👎 Yes 👍 they write a oss_id to ~/.great_expectations/great_expectations.conf https://docs.greatexpectations.io/docs/reference/learn/usage_statistics/ Full schemas in https://github.com/great-expectations/great_expectations/blob/develop/great_expectations/core/usage_statistics/schemas.py
DVC No ❌ No ❌ Project settings + Global settings + Environment variable No 👎 Yes 👍 they store a user_id in ~/.config/iterative/telemetry using pypi/iterative-telemetry https://dvc.org/doc/user-guide/analytics#anonymized-usage-analytics "This does not allow us to track individual users", uses https://pypi.org/project/iterative-telemetry/
Evidently No ❌ No ❌ Environment variable No 👎 Yes 👍 they store a user_id in ~/.config/evidentlyai/telemetry using pypi/iterative-telemetry https://docs.evidentlyai.com/support/telemetry "We only collect anonymous usage data. We DO NOT collect personal data" Uses https://pypi.org/project/iterative-telemetry/
Homebrew No ❌ No ❌ Global settings + Environment variable No 👎 No 👎 they used to do it by storing a UUID in a user-wide git-like config file, but removed user tracking 1 year ago https://docs.brew.sh/Analytics All stats are public https://formulae.brew.sh/analytics/
LangChain Unclear ❓ No ❓ Undocumented Not applicable 🚫 ? No docs, only mention in https://blog.langchain.dev/langchain-state-of-ai-2023/ LangSmith is a commercial platform, not an open source component
Reflex No ❌ No ❌ Project settings No 👎 Yes 👍 they generate a installation_id in ~/.local/share/reflex https://reflex.dev/docs/getting-started/configuration/#anonymous-usage-statistics
Streamlit (OSS) No ❌ No ❌ Project settings No 👎 Unclear ❓ They use front-end (rather than back-end) analytics powered by the Segment SDK (Analytics.js) https://docs.streamlit.io/library/advanced-features/configuration#telemetry The Privacy Notice covers both the open source library ("the Software") and Streamlit Cloud ("the Service"), and the latter does collect personal data https://streamlit.io/privacy-policy
dbt No ❌ No ❌ Project settings + Environment variable No 👎 Unclear ❓ they have the concept of active_user, but looks like the open source code is not setting it https://docs.getdbt.com/reference/global-configs/usage-stats

@astrojuanlu
Copy link
Member

astrojuanlu commented Jan 12, 2024

Added Streamlit OSS (also does not collect personal data) thanks @Joseph-Perkins!

@astrojuanlu
Copy link
Member

Added dbt

@astrojuanlu
Copy link
Member

Pending: Add another column that shows whether the systems track individual users or no

@astrojuanlu astrojuanlu moved this to In Progress in Kedro Framework Feb 16, 2024
@astrojuanlu astrojuanlu self-assigned this Feb 16, 2024
@astrojuanlu
Copy link
Member

Done 👍🏽

@merelcht merelcht changed the title Improving our understanding our users with kedro-telemetry Improving our understanding of our users with kedro-telemetry Feb 22, 2024
@merelcht merelcht added this to Roadmap Mar 28, 2024
@merelcht merelcht moved this to Current in Roadmap Mar 28, 2024
@astrojuanlu
Copy link
Member

There's a couple of things in this issue. On one hand, we compiled a list of similar libraries to have references of how other projects do telemetry, and we also asked for legal advice. That is already done #510 (comment)

On the other hand, there's the list of use cases @yetudada created in #510 (comment). Before getting to those we want to simplify our data collection process #375, for which we want to address #333 (done) and #507 (in progress).

For now this issue is blocked, for clarity I'm removing it from the current sprint and focusing on #507.

Regardless, it's a good moment to make a release of kedro-telemetry cc @merelcht

@astrojuanlu astrojuanlu moved this from In Progress to To Do in Kedro Framework Apr 29, 2024
@merelcht merelcht removed the status in Kedro Framework Jun 4, 2024
@astrojuanlu
Copy link
Member

Updated the first table of #510 (comment) with the current status, only 2 items remaining.

@merelcht
Copy link
Member

Anything left to do here? @astrojuanlu

@astrojuanlu
Copy link
Member

The "What are the current challenges with its implementation?" still contains a couple of minor items, and also "What else could we learn from our users?" contains some valid points. I will have a look at this before EOY and give a summary of what should be the next steps, if any.

@astrojuanlu
Copy link
Member

Today I learned that Daft collects telemetry on every function call: https://github.com/Eventual-Inc/Daft/blob/fd662c1/docs/source/faq/telemetry.rst

What data do we collect?

[...]
2. On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.

It's achieved by decorating every public method and function:

https://github.com/Eventual-Inc/Daft/blob/fd662c1a95c1697d19a321447f1a72da21961598/daft/dataframe/dataframe.py#L259-L262

And then it buffers the events, by default in groups of 100:

https://github.com/Eventual-Inc/Daft/blob/fd662c1a95c1697d19a321447f1a72da21961598/daft/analytics.py#L98-L113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Status: Current
Development

No branches or pull requests

3 participants