Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 write failure with alluxio 2.9.4 release #18643

Open
pragnesh opened this issue Jul 2, 2024 · 4 comments
Open

S3 write failure with alluxio 2.9.4 release #18643

pragnesh opened this issue Jul 2, 2024 · 4 comments
Labels
type-bug This issue is about a bug

Comments

@pragnesh
Copy link
Contributor

pragnesh commented Jul 2, 2024

Alluxio Version:
2.9.4

Describe the bug
After upgrading alluxio from 2.9.3 to 2.9.4, we are seeing following exception when spark job write output to dir which is mounted on s3 as UFS, multiple job running which is writing to same location.

2024-06-28 14:15:33,548 ERROR [task-execution-service-2](S3AOutputStream.java:156) - Failed to upload s3path/y=2024/mo=06/d=28/h=13/part-00008-df7dc7d1-cde0-4ba9-98b4-e6563e37ee59.c000.snappy.parquet
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@22e014ef rejected from java.util.concurrent.ThreadPoolExecutor@722238d8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
	at com.amazonaws.services.s3.transfer.internal.UploadMonitor.create(UploadMonitor.java:95)
	at com.amazonaws.services.s3.transfer.TransferManager.doUpload(TransferManager.java:685)
	at com.amazonaws.services.s3.transfer.TransferManager.upload(TransferManager.java:534)
	at alluxio.underfs.s3a.S3AOutputStream.close(S3AOutputStream.java:154)
	at com.google.common.io.Closer.close(Closer.java:218)
	at alluxio.job.plan.persist.PersistDefinition.runTask(PersistDefinition.java:183)
	at alluxio.job.plan.persist.PersistDefinition.runTask(PersistDefinition.java:57)
	at alluxio.worker.job.task.TaskExecutor.run(TaskExecutor.java:88)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
2024-06-28 14:15:33,553 INFO  [task-execution-service-2](TaskExecutorManager.java:204) - Task 0 for job 1719583916388 failed:

To Reproduce
spark job running with 2.9.4 using s3 as UFS at

Expected behavior
alluxio should be able to upload file to s3, instead it failed and throw exception.

Urgency
Unable to upgrade to 2.9.4

Are you planning to fix it
Not working on it, But i am willing to help with PR if someone point me to some direction.

Additional context
Add any other context about the problem here.
Other jobs running which is also writing to same location. So s3 metadata update could be issue.
I am also seeing alluxio.proxy.s3.bucketpathcache.timeout property default vaule changed from 1 min to 0 min in 2.9.4 release.

@pragnesh pragnesh added the type-bug This issue is about a bug label Jul 2, 2024
@YichuanSun
Copy link
Contributor

Do you mean that before the upgrade you used 2.9.3 and spark job write output to dir works?

@pragnesh
Copy link
Contributor Author

Do you mean that before the upgrade you used 2.9.3 and spark job write output to dir works?

Yes it is working with 2.9.3

@iwasakims
Copy link

#15748 seems to be the cause. The executor is used via TransferManager of AWS SDK as described in the comment of the PR.

Since the thread pool is shut down on GC of the TransferManager by default, we should be ok to leave the executor open in the close?

@pragnesh
Copy link
Contributor Author

#15748 seems to be the cause. The executor is used via TransferManager of AWS SDK as described in the comment of the PR.

Since the thread pool is shut down on GC of the TransferManager by default, we should be ok to leave the executor open in the close?

i have build Alluxio 2.9.6 from source with above commit reverted, that revert does fixed the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

3 participants