-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hdfs support #196
hdfs support #196
Conversation
Thank you @skewballfox -- this is a great first contribution and a great improvement! For reference, you may find the following PRs helpful, as they encapsulate the Object Store integration work done by @linhr: Side Note: I linked your PR to the associated Github issue (#173). |
hey, trying to setup spark for test but I'm running into an issue. even with a clean slate I'm getting a error somewhere
here's the last bit of output. scrolling up can't take me past a long stream of "copying " statements to the origin.
given this post has a match for all the non-specific strings, I'm guessing this error comes from setup.py or immediately after setup.py is called. Any ideas what's going on? |
We've seen this error occasionally and you can ignore it and continue with the next step. The PySpark package has been built but the Spark patch somehow is not revert correctly. (You can manually drop the change in the Sorry for the confusion! We'll update the documentation and look into the root cause. |
Would it be alright if I pushed some "off-topic" commits? I added some changes to the |
Yeah go for it! |
I just took a look at the script change and I feel it goes a bit beyond what the script is supposed to do. Since the script is also used in CI, it may conflict with other parts of the GitHub Actions workflow. Would you mind reverting the script change? Sorry for the back and forth. These are valuable suggested changes though. We'll definitely consider a way to incorporate the idea to make the developer experience better. |
BTW I like the idea of using a lock file to skip expensive Maven build. I'll need some time to test it in CI so let's revisit this in a future PR. |
a7befb8
to
b35c144
Compare
@skewballfox Feel free to put the changes in a separate draft PR so that none of your work is lost. |
no worries, I used git reset to undo the last commit. and I'll push the script to a separate branch. Sorry for the delayed response, been running into issues getting the test environment setup |
You're actually quite prompt! We're thrilled to have your contribution, and if there's anything we can do to help with setting up your test environment, don't hesitate to reach out. |
a couple things. Is there a way to get pyo3 to use a uv installed python version? I tried running Second, I'm having trouble getting the
I've made sure |
Have you gone through the Environment Setup steps? After the above, it should just be the following commands to setup your test env:
Although, i'm also curious if maybe there is a problem with the Java installation. Unfortunately running the Spark tests requires Java to be installed. Also, the Github Actions workflow runs the tests, so it may be good to manually follow the steps there (It looks like the actions workflow uses Java Corretto 17): Can we try to get some more verbose log outputs? It seems like
Lastly, yes it should be possible to get pyo3 to use a uv installed python version. I would refer to the Environment Setup steps that I linked at the very beginning of this message. |
I have java installed and
when I ran that command. I'm getting a series of no such file or directory errors for files such as |
Could you run the Spark build script unmodified and see if it works? I took another look at #200 and it seems to me that the patch lock file may cause the patch to be skipped when you build the Python package. If the patch is not applied, neither the tests nor the test support files will be in the correct location in the zip package. The Maven lock file seems fine to me though. |
Before doing this, I suggest going into the |
I renamed my current copy of sail and recloned the repo. Suprisingly it worked, but I'm not sure why given I removed the opt directory a few times (including the lock files), so I retested on my original copy (pruned/recreated the virtual envs, removed opt, reran the build script, etc). It's working now I think I may finally have a hadoop container setup in pseudo-distributed mode for testing hdfs support. Where should it go? under scripts, opts or somewhere else? EDIT: right now it's a Dockerfile, but I can try to integrate it into the existing compose file, might take me a bit though because I'm running rootless podman which comes with a few extra caveats around container networking. |
Glad to hear that you got the setup working! Let's define the container in Testing external systems with containers is not part of CI right now but it could be future work. For this PR, it would be good enough to have a local setup. Either Python or Rust tests would be great. |
hey, kind of stuck atm. I'm setting up hdfs in a local container, verifying it's working via the web portal, and then connecting to the hdfs container using the method mentioned in the dev docs.
This throws a Object store error originating from this generic IO error. I know it's connecting since the client returns okay, and because if I change it to a non-existent file, I get an error indicating the resource doesn't exist. So it's making it somewhere past the object store instantiation and failing during execution (I think). I'm having trouble tracing it any further than that though. here's the last bit of traceback from python:
here's the last trace from the server
which is returned by any ideas what the issue is or have suggestions on what I should try to track down the problem? |
Default logging filters for the spark connect server only. If you want expanded debug logging you can start the server doing Let's see if we're able to get any better logs that way. I also highly recommend using the debugger. Let me know if you'd like help getting one setup! |
Just an update, Managed to finally read a json file from the hdfs instance, was a configuration issue. Still failing to write to hdfs, though that is probably also a configuration issue (working on fixing it). Launching the hdfs container still doesn't work locally via podman compose. It keeps dying after around 15 seconds without producing any error and returning a normal exit code. I'm running the docker file under env HADOOP_USER_NAME=sail podman build -t hadoop-container .
podman run -it -p 50070:9870 -p 9000:9000 -p 9864:9864 -p 9866:9866 hadoop-container /bin/bash
#/hdfs-init.sh I think i might have found a bug in the s3 code that may be upstream. If you upload a json file via the web portal, and then try to read it via This doesn't happen with the code from the docs, but the query |
This is the expected behavior. The content type needs investigation though. |
related issue for hdf-native-object-store |
So I got reading, writing files working, at least for json. what other functions should I implement, confirm working prior to merging this? with testing, I'm still using the dockerfile, bringing both containers up with the compose file doesn't seem to work, but it's most likely a misconfiguration of hadoop or podman. Traffic goes in to the container (write request are fulfilled), but responses don't make it back out. Given that, should I just document how to set up the dockerfile? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an exciting progress!
I just triggered the workflow. Let's format the code using the command here: https://docs.lakesail.com/sail/main/development/build/rust.html
Reading and writing JSON files is a perfect scope for this PR. Let's add a FIXME comment in compose.yml
for future investigation of the container issues, but there is no need to add documentation for manual container setup.
I have a few comments but overall this is quite an impressive change! I can see especially that the Dockerfile requires non-trivial efforts. I think the PR is pretty close to the ready state.
btw, compose might work on other systems(with or without sudo). I think part of the issue might be differences in rootless network configuration between EDIT: So I sort of figured out the problem. also, from trial and error It seems like trying to map port 9000 to another port breaks hdfs during initialization(at least with pasta). I'm guessing from traffic it's trying to send itself. it's really not designed to run the entire cluster in a single container lol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments! Also thanks for the notes regarding container setup.
I have one more minor comment, but I feel this is in a good shape!
It seems there is a formatting issue in compose.yml
. Could you use the following commands to format the non-Rust code? More information can be found here.
pnpm install
pnpm run format
pnpm run lint
Also, when you commit the changes this time, you can add [spark tests]
in the commit message so that we can get the Spark tests triggered. After this I think this PR is ready to be merged!
…o longer necessary
sorry I forgot to put |
No worries! Since this PR does not change PySpark logic, it's fine to skip the tests. The tests will be run after the code is merged to I've also created an issue (#227) so that the tests can be triggered without the special commit message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skewballfox We are so glad to have you as the first community contributor of Sail! Thank you for your contribution!
Still trying to figure a few things out. This might be easier than I was initially expecting given it's the default shared filesystem other implementations have to be compatible with.
Some relevant info I've found so far:
all hits for
object_store
incrates
happen within sail plan, so I'm guessing this should primarily be implemented within that crate?