Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for Cleaner implementation #1617

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

matthiasblaesing
Copy link
Member

The Cleaner used multiple monitors to protect its datastructures. And as datastructre a (manually) linked list was used. The datastructure was updated to a ConcurrentHashMap and the multiple monitor usages are replaced with a ReentrandReadWriteLocks.

Performance numbers:

Commandline:

java -jar target/benchmarks.jar -t 1000 -i 1 -wi 0

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
1211666,184 ±(99.9%) 134595,856 ops/s [Average]
(min, avg, max) = (1178371,132, 1211666,184, 1271195,212), stdev = 34954,116
CI (99.9%): [1077070,328, 1346262,040] (assumes normal distribution)

Estimated CPU Load: 650%

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
3260953,271 ±(99.9%) 655799,010 ops/s [Average]
(min, avg, max) = (3092006,068, 3260953,271, 3527896,224), stdev = 170308,920
CI (99.9%): [2605154,261, 3916752,281] (assumes normal distribution)

Estimated CPU Load: 1500%

============================

Code:

package eu.doppelhelix.jna.jmh;

import com.sun.jna.internal.Cleaner;
import java.util.concurrent.atomic.AtomicLong;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.infra.Blackhole;

public class MyBenchmark {

@Benchmark
public void testMethod(Blackhole blackhole) {
    DummyObject dummyObj = new DummyObject();
    DummyObjectCleaner dummyCleaner = new DummyObjectCleaner(blackhole, dummyObj.getDummyValue());
    Cleaner.getCleaner().register(dummyObj, dummyCleaner);
}

public static class DummyObject {
    private static final AtomicLong ai = new AtomicLong();
    private final String dummyValue;

    public DummyObject() {
        this.dummyValue = "d " + ai.incrementAndGet();
    }

    public String getDummyValue() {
        return dummyValue;
    }
}

public static class DummyObjectCleaner implements Runnable{
    private final Blackhole bh;
    private final String data;

    public DummyObjectCleaner(Blackhole bh, String data) {
        this.bh = bh;
        this.data = data;
    }

    public void run() {
        this.bh.consume(this.data);
    }
}

}

The Cleaner used multiple monitors to protect its datastructures. And
as datastructre a (manually) linked list was used. The datastructure
was updated to a ConcurrentHashMap and the multiple monitor usages are
replaced with a ReentrandReadWriteLocks.


Performance numbers:

Commandline:

java -jar target/benchmarks.jar -t 1000 -i 1 -wi 0

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
  1211666,184 ±(99.9%) 134595,856 ops/s [Average]
  (min, avg, max) = (1178371,132, 1211666,184, 1271195,212), stdev = 34954,116
  CI (99.9%): [1077070,328, 1346262,040] (assumes normal distribution)

Estimated CPU Load: 650%

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
  3260953,271 ±(99.9%) 655799,010 ops/s [Average]
  (min, avg, max) = (3092006,068, 3260953,271, 3527896,224), stdev = 170308,920
  CI (99.9%): [2605154,261, 3916752,281] (assumes normal distribution)

Estimated CPU Load: 1500%

============================

Code:

package eu.doppelhelix.jna.jmh;

import com.sun.jna.internal.Cleaner;
import java.util.concurrent.atomic.AtomicLong;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.infra.Blackhole;

public class MyBenchmark {

    @benchmark
    public void testMethod(Blackhole blackhole) {
        DummyObject dummyObj = new DummyObject();
        DummyObjectCleaner dummyCleaner = new DummyObjectCleaner(blackhole, dummyObj.getDummyValue());
        Cleaner.getCleaner().register(dummyObj, dummyCleaner);
    }

    public static class DummyObject {
        private static final AtomicLong ai = new AtomicLong();
        private final String dummyValue;

        public DummyObject() {
            this.dummyValue = "d " + ai.incrementAndGet();
        }

        public String getDummyValue() {
            return dummyValue;
        }
    }
    
    public static class DummyObjectCleaner implements Runnable{
        private final Blackhole bh;
        private final String data;

        public DummyObjectCleaner(Blackhole bh, String data) {
            this.bh = bh;
            this.data = data;
        }

        public void run() {
            this.bh.consume(this.data);
        }
    }
}
@matthiasblaesing
Copy link
Member Author

@ivanr-rs @pmconrad you both raised issues about performance of the cleaner implementation. I had a look at this and this is the result of that. Could you have a look at this and give it a spin?

If you need a build, here it is: jna.zip

@matthiasblaesing
Copy link
Member Author

For the interpretation of the performance numbers - the maximum CPU load archivable would be 1600% (8 core 2 threads).

@ivanr-rs
Copy link

@matthiasblaesing Many thanks for your quick response. We will take it for a spin.

It may take us some time, though, as we're experiencing this problem only in our production environment, and we have weekly development cycles with a week of testing in a preproduction environment. We were able to shift the load over the weekend so that we're now under the threshold at which this problem is manifesting.

try {
return add(new CleanerRef(this, obj, referenceQueue, cleanupTask));
} finally {
synchronized (obj) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially dangerous. Some other thread could be holding a lock on obj already.
This may be a non-issue because Cleaner is for internal use only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate, why you think we can deadlock here? For a deadlock we need two different locks, but Cleaner#register itself only locks the monitor of obj here. Sure, we can block at this position, but the other thread currently holding the monitor of obj can finish the critical section, release the monitor and then we run to completion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of two threads. One locks object A and registers object B, the other one locks B and registers A.
Like I said, this is an unlikely constellation in Cleaner's internal use.

cleanerThreadLock.readLock().lock();
try {
long count = trackedObjects.incrementAndGet();
if (cleanerThread == null && count > 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because register() ensures that ref cannot be cleaned before add() has completed, we know that count > 0 is always true here.

cleanerThread.start();
}
} finally {
cleanerThreadLock.readLock().lock();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock is unlocked again immediately. That can be solved better without re-locking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are mistaken. The readLock is locked and the writeLock is unlocked. This downgrades the ReadWriteLock from write to read. This is needed, as the readLock is unlocked in the finally clause. My assumption here is, that while a thread holds the writeLock now other thread can enter it, thus taking the readLock can be done on a fast path.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed, as the readLock is unlocked in the finally clause.

What I meant is that you could set a flag instead and skip the unlock in finally if the flag is set.

registeredCleaners.append(cleanerRef.cleanupTask.toString());
break;
} finally {
cleanerThreadLock.readLock().lock();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Re-locking can be avoided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

// Guard for trackedObjects and cleanerThread. The readlock is utilized when
// the trackedObjects are manipulated, the writelock protectes starting and
// stopping the CleanerThread
private final ReadWriteLock cleanerThreadLock = new ReentrantReadWriteLock();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the readlock is not necessary at all. trackedObjects is an AtomicLong and doesn't need any protection.
The important step is to protect the starting and stopping of the CleanerThread. In particular, it must be ensured that the thread doesn't terminate while a new reference is added without starting a new CleanerThread..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanerThreadLock protects the interaction of trackedObjects and cleanerThread. The critical section is the starting and stopping of the Thead if and only iff the number of tracked objects reaches zero or is above 1.

This is the sequence I want to prevent:

  1. T1 using the cleaner calls register and runs before incrementAndGet is invoked
  2. CleanerThread reaches the trackedObjects.get() == 0 check and enters the if block, but does not execute further yet.
  3. T1 executes incrementAndGet and receives 1, checks the value of cleanerThread and finds a Thread there.
  4. CleanerThread continues, clears cleanerThread and finishes execution

There is a subtle issue there, the break needs to move into the inner if in line 173 (https://github.com/java-native-access/jna/pull/1617/files/b901c180f4a1026675a2859bbf24727574e19e6a#diff-11b3b0981223743529d972f317f26dfd9488b4a9b80a34a2d6f76cc2c58dcc00R173) into the inner if.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I attempted to write some pseudocode to show how to do it with a single lock and found a race in it. The cause of the race is the double-check locking that you employ in order to achieve better parallelism.
Without the double-check a single lock is sufficient, but then you're back to a mutex block that is always executed, hence no more real parallelism.

I still wonder if a tiny mutex block would be faster than the read/write lock, in particular because the ReadWriteLock JavaDoc says just that:

Further, if the read operations are too short the overhead of the read-write lock implementation (which is inherently more complex than a mutual exclusion lock) can dominate the execution cost, particularly as many read-write lock implementations still serialize all threads through a small section of code.

@pmconrad
Copy link

FTR, I have run the overload test from my optimization branch and found that my original issue is not resolved by this change.

return inChain;
private void remove(CleanerRef ref) {
map.remove(ref);
cleanerThreadLock.readLock().lock();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the lock here is not required. Like you said, the lock protects the interaction of cleanerThread and trackedObjects, but there is no interaction here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants