Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation for pyspark setup ('kedro run' cant resolve path on starter project with tool pyspark enabled) #4366

Open
bf-malefiz opened this issue Dec 3, 2024 · 3 comments
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation

Comments

@bf-malefiz
Copy link

Description

I'm initializing a new project with all tools enabled and an example pipeline. After installing the requirements kedro new fails with
The system cannot find the path specified.

Context

Just trying to get the example running. The spaceflights starter is working but can't be initialized with --tools=all

kedro new --name=basic --starter=spaceflights-pandas

Steps to Reproduce

  1. conda install -c conda-forge kedro
  2. kedro new --name=basic --tools=all --example=yes
  3. cd ./basic /
  4. pip install -r requirements.txt
  5. kedro run

Expected Result

INFO Pipeline execution completed successfully.

Actual Result

[12/03/24 01:27:05] INFO Using 'conf\logging.yml' as logging configuration. You can change this by setting the KEDRO_LOGGING_CONFIG environment variable accordingly. init.py:270
[12/03/24 01:27:06] INFO Kedro project space session.py:329
The system cannot find the path specified.

The system cannot find the path specified.

Your Environment

Win11, conda-env

  • Kedro version used (pip show kedro or kedro -V): Version: 0.19.10
  • Python version used (python -V):Python 3.11.10
  • Operating system and version: Windows 11 Pro 24H2
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Dec 3, 2024
@bf-malefiz
Copy link
Author

Now I tested all tools separately and tool 6 (pyspark) is causing this issue. I didn't have spark installed so I checked the requirements.txt and pyspark wasn't present. Unfortunately adding and installing it with pip didn't solve the issue. I also added the env-var PYSPARK_HADOOP_VERSION=3 for a quick check but it didn't resolve the issue either.

@bf-malefiz bf-malefiz changed the title Example pipeline from kedro new cant resolve Path on Win11 'kedro run' cant resolve path on starter project with tool pyspark enabled Dec 3, 2024
@bf-malefiz
Copy link
Author

I suppose its less of a bug than poor documentation related to this isssue/comment:

kedro-starters/issues/237#
kedro-starters/pull/236# (comment)

My hadoop is pretty old and links might be broken. I won't be able to test a fresh installation though, if you aren't able to reproduce this it mostlikely my machine. Still the documentation and error message (which path cant be found?) could get improved. Is hadoop getting installed with initializing kedro? Why isn't it mentioned its necessary for the pyspark tool?

@SajidAlamQB
Copy link
Contributor

Hi @bf-malefiz thanks for reporting this.

I wasn't able to recreate your issue. Ensure you're using Java 8 or Java 11. I noticed issues with Java 21, which is not officially supported by Apache Spark or PySpark. You can check your current Java version with: java -version.

You mentioned a concern about documentation clarity. This is a valid point, and we'll looks into making the setup steps for tools like PySpark clearer in the documentation.

@merelcht merelcht added Component: Documentation 📄 Issue/PR for markdown and API documentation and removed Community Issue/PR opened by the open-source community labels Dec 10, 2024
@merelcht merelcht changed the title 'kedro run' cant resolve path on starter project with tool pyspark enabled Improve documentation for pyspark setup ('kedro run' cant resolve path on starter project with tool pyspark enabled) Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Documentation 📄 Issue/PR for markdown and API documentation
Projects
Status: No status
Development

No branches or pull requests

3 participants