`brig ls` and repinning with large dataset takes forever and blocks other operations #119

max-privatevoid · 2022-06-15T15:36:18Z

Describe the bug
With a large dataset (~70GB, ~4500 files), brig ls takes a very long time, as does performing a repin, Worse yet, repinning seems to be an operation that blocks all other file operations, so with the default configuration brig stalls for hours every 15 minutes. I suspect this is due to IsCached being called on everything in the repo and therefore performing a /refs call on all the CIDs brig knows about.

To Reproduce

Add a large dataset
brig ls

or

While (re-)adding a large dataset, trigger a repin
Staging process blocks until repin is complete

Please always include the output of the following commands in your report!

brig bug -s

go version: ````
uname -s -v -m: Linux #1-NixOS SMP PREEMPT Wed May 18 08:28:23 UTC 2022 x86_64
IPFS config (only datastore):

{
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "NoSync": true,
    "Spec": {
      "child": {
        "path": "badgerds",
        "syncWrites": false,
        "truncate": true,
        "type": "badgerds"
      },
      "prefix": "badger.datastore",
      "type": "measure"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "200GB"
  },
}

brig client version: v.. [build: ] (built from 6b7eccf)
brig server version: v..+
IPFS Version: 0.12.2+

Expected behavior
brig ls should return (near) instantly. Repinning should be performed in the background, without blocking other commands and without hammering the disk with read operations.

The text was updated successfully, but these errors were encountered:

evgmik · 2022-06-15T21:47:09Z

Hi,

Your general assessment is correct: checking the pinned status is quite expensive operation. Since we talk to the ipfs and it then has to check for all chunks. Consequently, ls will not come back instantly, since it waiting for each file pinned status. I think we setup a cache for this status
so second call to ls should be fast within some time interval.

As for the hammering the disk every 15 minutes, it might be triggered
by IPFS itself. I am not sure, but I recall that IPFS backend does garbage collection every 15 minutes or so. With many large files it usually takes a while. See IPFS configuration for option responsible.

Also try to set

brig cfg set repo.autogc.enabled false

it should help too.

Try to use development version of brig, we did make several speed improvements.

But both IPFS and brig do not scale to well with many files repos, mostly because of IPFS. One can disable garbage collection which helps but than you may have a lot of stale chunks in the backend.

IPFS has a database backend, instead of flatfs.datastore it might work better, but I did not play with it in practice.

max-privatevoid · 2022-06-15T22:47:51Z

Thanks for your reply.

I am using the latest development version already, as well as badgerds as my datastore for the IPFS node.

I've worked around the performance issue caused by the IsCached lookup by simply ripping it out from the code paths for brig ls and brig info, making those commands near-instant even at the root of the repo. As for the repin issue, I've relaxed the constraints on locking the fs mutex during the repin loop a bit. I don't know whether this is still safe, but it does allow operations such as brig ls to run while a repin is running.
Performance patches in my config repo

These patches combined with some config tweaks seem to have made the whole thing decently workable.

My current patches are probably far too hacky for upstreaming, but if I find better ways around these issues you'll see PRs from me.

evgmik · 2022-06-15T23:49:27Z

I would agree, it is too hacky to just disable the check. This way you cannot be sure about the state of the local repo. You might unpin something which has no replica anywhere else.

As I said, the first check for pin status is the only one which is long. You can increase cache retention time and then consequent ls will be fast. See

brig/backend/httpipfs/shell.go

Line 104 in 6b7eccf

locallyCached: cache.New(5*time.Minute, 10*time.Minute),

The rest is IPFS doing its own things.

An ideal solution would be to have asynchronous pin test check.

By the way, what is your use case?

max-privatevoid · 2022-06-16T17:20:52Z

My use case is Syncthing-like (or Nextcloud-like, since that's the actual thing I seek to replace) personal file synchronization. I'm hoping to use brig for this because I already have a decent IPFS setup on my machines and I like the "lazy synchronization" aspect where files are only fetched when accessed.

evgmik · 2022-06-16T18:14:56Z

Currently we ask IPFS about every file pin status separately.
I have an idea how to speed up the pin check: we need to ask IPFS about all available pins in one go and then work with cached IPFS pin status.

Could you check how long does it take on your large data collection
ipfs pin ls > /dev/null and ipfs refs local > /dev/null
if this happens fast enough than we have hope.

On a separate note, if you file collection is static brig will work. But if you thinking about using it as a work tree with a lot of changes, then merging might give you trouble. @sahib and I worked on it while ago, but did not make it fully bullet proof.

max-privatevoid · 2022-06-16T19:35:46Z

I cancelled ipfs pin ls after a couple of minutes since that's probably too long already. ipfs pin ls --type=recursive on the other hand returns very quickly. I believe everything brig pins is type=recursive, so that could work.

Benchmark 1: ipfs pin ls --type=recursive > /dev/null
  Time (mean ± σ):      91.9 ms ±   5.2 ms    [User: 130.1 ms, System: 25.4 ms]
  Range (min … max):    85.6 ms … 103.6 ms    33 runs
 
Benchmark 2: ipfs refs local > /dev/null
  Time (mean ± σ):      2.010 s ±  0.153 s    [User: 2.728 s, System: 0.216 s]
  Range (min … max):    1.906 s …  2.438 s    10 runs

Regarding work tree operations, I do intend to use brig like that and I believe I've already run into a problem. I've been debugging it with the following repro steps:

ali touch test/x.txt
bob sync ali
# bob now has the file
bob info test/x.txt
# bob removes the file
bob rm test/x.txt
# sync the removal
ali sync bob
ali ls -R
# ali should not have test/x.txt anymore

I can open a separate issue for that if you think it's worth looking at.

evgmik · 2022-06-17T01:38:37Z

Thanks for benchmark, I double check the isPinned code at

brig/backend/httpipfs/pin.go

Line 76 in 6b7eccf

func (nd *Node) IsCached(hash h.Hash) (bool, error) {

The situation is a bit more complex: the pin status alone is not enough. Pin is equivalent for download and keep request, it does not guarantee that children chunks are already locally available. So we called for --recursive check for children as well. This takes a lot of time.

My use case was to ask brig to download a folder for offline use, so I needed to be sure that all chunks are locally available.

But pinning is difficult: there is pin at brig which just ask IPFS to pin this file. But a file can be pinned but not yet locally available, i.e not yet cached. I, as user, want to be aware of both situations. But this is very time consuming check. Maybe we should/can add a switch to ls to skip caching status. I unfortunately do not have time to dig into it. But @sahib welcomes patches and use to review them quite quickly.

Please file a separate bug about tree operations. Only @sahib fully understand the tree storage system. I recall working on similar issue, but I was not able to solve it. There were no merge conflict resolution mechanism envisioned, so syncing has strange side effects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`brig ls` and repinning with large dataset takes forever and blocks other operations #119

`brig ls` and repinning with large dataset takes forever and blocks other operations #119

max-privatevoid commented Jun 15, 2022 •

edited

Loading

evgmik commented Jun 15, 2022

max-privatevoid commented Jun 15, 2022

evgmik commented Jun 15, 2022

max-privatevoid commented Jun 16, 2022

evgmik commented Jun 16, 2022

max-privatevoid commented Jun 16, 2022

evgmik commented Jun 17, 2022

brig ls and repinning with large dataset takes forever and blocks other operations #119

brig ls and repinning with large dataset takes forever and blocks other operations #119

Comments

max-privatevoid commented Jun 15, 2022 • edited Loading

evgmik commented Jun 15, 2022

max-privatevoid commented Jun 15, 2022

evgmik commented Jun 15, 2022

max-privatevoid commented Jun 16, 2022

evgmik commented Jun 16, 2022

max-privatevoid commented Jun 16, 2022

evgmik commented Jun 17, 2022

`brig ls` and repinning with large dataset takes forever and blocks other operations #119

`brig ls` and repinning with large dataset takes forever and blocks other operations #119

max-privatevoid commented Jun 15, 2022 •

edited

Loading