Skip to content
This repository has been archived by the owner on Mar 30, 2023. It is now read-only.

Persistent buffers using 100% of available space causes the job to fail #113

Open
jsteel44 opened this issue Oct 2, 2019 · 1 comment
Open
Labels
bug Something isn't working

Comments

@jsteel44
Copy link
Contributor

jsteel44 commented Oct 2, 2019

We can use 99% (technically 100% minus one unit of granularity) of available space however when we create a buffer to consume the remaining space, it looks to be created successfully but then squeue reports:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                20     debug create-p    test2 PD       0:00      1 (burst_buffer/datawarp: setup: panic: runtime error: index out of range [recovered]
        panic: runtime error: index out of range

goroutine 1 [running]:
main.main.func1()
        /home/circleci/data-acc/cmd/dacctl/main.go:187 +0xb9
panic(0xb0d7c0, 0x12394c0)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/RSE-Cambridge/data-acc/internal/pkg/dacctl/workflow_impl.sessionFacade.doAllocationAndWriteSession(0xce3440, 0xc000176d00, 0xcdf780, 0xc000179560, 0xce0e80, 0xc00017c750, 0xcccf
60, 0x126c0c8, 0x7ffff8268d30, 0x2, ...)
        /home/circleci/data-acc/internal/pkg/dacctl/workflow_impl/session.go:166 +0x6ee
github.com/RSE-Cambridge/data-acc/internal/pkg/dacctl/workflow_impl.sessionFacade.CreateSession.func1(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/circleci/data-acc/internal/pkg/dacctl/workflow_impl/session.go:110 +0xfb
github.com/RSE-Cambridge/data-acc/internal/pkg/dacctl/workflow_impl.sessionFacade.submitJob(0xce3440, 0xc000176d00, 0xcdf780, 0xc000179560, 0xce0e80, 0xc00017c750, 0xcccf60, 0x126c0c8, 0x7
ffff8268d30, 0x2, ...)
        /home/circleci/data-acc/internal/pkg/dacctl/workflow_impl/session.go:53 +0x3e8
github.com/RSE-Cambridge/data-acc/internal/pkg/dacctl/workflow_impl.sessionFacade.CreateSession(0xce3440, 0xc000176d00, 0xcdf780, 0xc000179560, 0xce0e80, 0xc00017c750, 0xcccf60, 0x126c0c8,
 0x7ffff8268d30, 0x2, ...)
        /home/circleci/data-acc/internal/pkg/dacctl/workflow_impl/session.go:107 +0x1b4
github.com/RSE-Cambridge/data-acc/internal/pkg/dacctl/actions_impl.(*dacctlActions).CreatePerJobBuffer(0xc000179580, 0xcdcac0, 0xc0000e71e0, 0xc000179580, 0x0)
        /home/circleci/data-acc/internal/pkg/dacctl/actions_impl/job.go:93 +0x6f5
main.setup(0xc0000e71e0, 0x0, 0x0)
        /home/circleci/data-acc/cmd/dacctl/actions.go:92 +0xa3
github.com/urfave/cli.HandleAction(0xade2a0, 0xc15140, 0xc0000e71e0, 0xc0000e71e0, 0x0)
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:514 +0xbe
github.com/urfave/cli.Command.Run(0xbe38a2, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0a8db, 0x47, 0x0, ...)
        /go/pkg/mod/github.com/urfave/[email protected]/command.go:171 +0x4d2
github.com/urfave/cli.(*App).Run(0xc0000dc540, 0xc0000b4000, 0xe, 0xf, 0x0, 0x0)
        /go/pkg/mod/github.com/urfave/[email protected]/app.go:265 +0x733
main.runCli(0xc0000b4000, 0xf, 0xf, 0xbe21f8, 0x1)
        /home/circleci/data-acc/cmd/dacctl/main.go:172 +0x1255
main.main()
        /home/circleci/data-acc/cmd/dacctl/main.go:194 +0x1f1
)

At this point everything looks normal with the buffers with now 0 FreeSpace as expected:

Name=datawarp DefaultPool=default Granularity=1600GiB TotalSpace=24000GiB FreeSpace=0 UsedSpace=24000GiB
  Flags=EnablePersistent,PrivateData
  StageInTimeout=3600 StageOutTimeout=3600 ValidateTimeout=5 OtherTimeout=3600
  GetSysState=/usr/local/bin/dacctl
  GetSysStatus=/usr/local/bin/dacctl
  Allocated Buffers:
    Name=small CreateTime=2019-10-02T14:24:55 Pool=default Size=3200GiB State=allocated UserID=test2(1002)
    Name=full CreateTime=2019-10-02T14:23:14 Pool=default Size=20800GiB State=allocated UserID=test2(1002)
  Per User Buffer Use:
    UserID=test2(1002) Used=24000GiB
@jsteel44
Copy link
Contributor Author

This seems to only affect persistent buffers, a per-job buffer can request 100% of the space without issue.

@jsteel44 jsteel44 changed the title Using 100% of available space causes the job to fail Persistent buffers using 100% of available space causes the job to fail Jan 14, 2020
@jsteel44 jsteel44 added the bug Something isn't working label Jan 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant