Improve performance for Cleaner implementation #1617

matthiasblaesing · 2024-07-13T19:52:59Z

The Cleaner used multiple monitors to protect its datastructures. And as datastructre a (manually) linked list was used. The datastructure was updated to a ConcurrentHashMap and the multiple monitor usages are replaced with a ReentrandReadWriteLocks.

Performance numbers:

Commandline:

java -jar target/benchmarks.jar -t 1000 -i 1 -wi 0

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
1211666,184 ±(99.9%) 134595,856 ops/s [Average]
(min, avg, max) = (1178371,132, 1211666,184, 1271195,212), stdev = 34954,116
CI (99.9%): [1077070,328, 1346262,040] (assumes normal distribution)

Estimated CPU Load: 650%

========== 5.14.0 ==========

Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod":
3260953,271 ±(99.9%) 655799,010 ops/s [Average]
(min, avg, max) = (3092006,068, 3260953,271, 3527896,224), stdev = 170308,920
CI (99.9%): [2605154,261, 3916752,281] (assumes normal distribution)

Estimated CPU Load: 1500%

============================

Code:

package eu.doppelhelix.jna.jmh;

import com.sun.jna.internal.Cleaner;
import java.util.concurrent.atomic.AtomicLong;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.infra.Blackhole;

public class MyBenchmark {

@Benchmark
public void testMethod(Blackhole blackhole) {
    DummyObject dummyObj = new DummyObject();
    DummyObjectCleaner dummyCleaner = new DummyObjectCleaner(blackhole, dummyObj.getDummyValue());
    Cleaner.getCleaner().register(dummyObj, dummyCleaner);
}

public static class DummyObject {
    private static final AtomicLong ai = new AtomicLong();
    private final String dummyValue;

    public DummyObject() {
        this.dummyValue = "d " + ai.incrementAndGet();
    }

    public String getDummyValue() {
        return dummyValue;
    }
}

public static class DummyObjectCleaner implements Runnable{
    private final Blackhole bh;
    private final String data;

    public DummyObjectCleaner(Blackhole bh, String data) {
        this.bh = bh;
        this.data = data;
    }

    public void run() {
        this.bh.consume(this.data);
    }
}

}

@benchmark

The Cleaner used multiple monitors to protect its datastructures. And as datastructre a (manually) linked list was used. The datastructure was updated to a ConcurrentHashMap and the multiple monitor usages are replaced with a ReentrandReadWriteLocks. Performance numbers: Commandline: java -jar target/benchmarks.jar -t 1000 -i 1 -wi 0 ========== 5.14.0 ========== Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod": 1211666,184 ±(99.9%) 134595,856 ops/s [Average] (min, avg, max) = (1178371,132, 1211666,184, 1271195,212), stdev = 34954,116 CI (99.9%): [1077070,328, 1346262,040] (assumes normal distribution) Estimated CPU Load: 650% ========== 5.14.0 ========== Result "eu.doppelhelix.jna.jmh.MyBenchmark.testMethod": 3260953,271 ±(99.9%) 655799,010 ops/s [Average] (min, avg, max) = (3092006,068, 3260953,271, 3527896,224), stdev = 170308,920 CI (99.9%): [2605154,261, 3916752,281] (assumes normal distribution) Estimated CPU Load: 1500% ============================ Code: package eu.doppelhelix.jna.jmh; import com.sun.jna.internal.Cleaner; import java.util.concurrent.atomic.AtomicLong; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.infra.Blackhole; public class MyBenchmark { @benchmark public void testMethod(Blackhole blackhole) { DummyObject dummyObj = new DummyObject(); DummyObjectCleaner dummyCleaner = new DummyObjectCleaner(blackhole, dummyObj.getDummyValue()); Cleaner.getCleaner().register(dummyObj, dummyCleaner); } public static class DummyObject { private static final AtomicLong ai = new AtomicLong(); private final String dummyValue; public DummyObject() { this.dummyValue = "d " + ai.incrementAndGet(); } public String getDummyValue() { return dummyValue; } } public static class DummyObjectCleaner implements Runnable{ private final Blackhole bh; private final String data; public DummyObjectCleaner(Blackhole bh, String data) { this.bh = bh; this.data = data; } public void run() { this.bh.consume(this.data); } } }

matthiasblaesing · 2024-07-13T20:08:30Z

@ivanr-rs @pmconrad you both raised issues about performance of the cleaner implementation. I had a look at this and this is the result of that. Could you have a look at this and give it a spin?

If you need a build, here it is: jna.zip

matthiasblaesing · 2024-07-13T20:09:34Z

For the interpretation of the performance numbers - the maximum CPU load archivable would be 1600% (8 core 2 threads).

ivanr-rs · 2024-07-15T15:15:40Z

@matthiasblaesing Many thanks for your quick response. We will take it for a spin.

It may take us some time, though, as we're experiencing this problem only in our production environment, and we have weekly development cycles with a week of testing in a preproduction environment. We were able to shift the load over the weekend so that we're now under the threshold at which this problem is manifesting.

pmconrad · 2024-07-17T11:37:39Z

src/com/sun/jna/internal/Cleaner.java

+        try {
+            return add(new CleanerRef(this, obj, referenceQueue, cleanupTask));
+        } finally {
+            synchronized (obj) {


This is potentially dangerous. Some other thread could be holding a lock on obj already.
This may be a non-issue because Cleaner is for internal use only.

Could you please elaborate, why you think we can deadlock here? For a deadlock we need two different locks, but Cleaner#register itself only locks the monitor of obj here. Sure, we can block at this position, but the other thread currently holding the monitor of obj can finish the critical section, release the monitor and then we run to completion.

Think of two threads. One locks object A and registers object B, the other one locks B and registers A.
Like I said, this is an unlikely constellation in Cleaner's internal use.

pmconrad · 2024-07-17T11:44:55Z

src/com/sun/jna/internal/Cleaner.java

+        cleanerThreadLock.readLock().lock();
+        try {
+            long count = trackedObjects.incrementAndGet();
+            if (cleanerThread == null && count > 0) {


Because register() ensures that ref cannot be cleaned before add() has completed, we know that count > 0 is always true here.

pmconrad · 2024-07-17T11:47:41Z

src/com/sun/jna/internal/Cleaner.java

+                        cleanerThread.start();
+                    }
+                } finally {
+                    cleanerThreadLock.readLock().lock();


This lock is unlocked again immediately. That can be solved better without re-locking.

I think you are mistaken. The readLock is locked and the writeLock is unlocked. This downgrades the ReadWriteLock from write to read. This is needed, as the readLock is unlocked in the finally clause. My assumption here is, that while a thread holds the writeLock now other thread can enter it, thus taking the readLock can be done on a fast path.

This is needed, as the readLock is unlocked in the finally clause.

What I meant is that you could set a flag instead and skip the unlock in finally if the flag is set.

pmconrad · 2024-07-17T11:48:12Z

src/com/sun/jna/internal/Cleaner.java

-                                    registeredCleaners.append(cleanerRef.cleanupTask.toString());
+                                    break;
+                                } finally {
+                                    cleanerThreadLock.readLock().lock();


See above. Re-locking can be avoided.

pmconrad · 2024-07-17T11:53:10Z

src/com/sun/jna/internal/Cleaner.java

+    // Guard for trackedObjects and cleanerThread. The readlock is utilized when
+    // the trackedObjects are manipulated, the writelock protectes starting and
+    // stopping the CleanerThread
+    private final ReadWriteLock cleanerThreadLock = new ReentrantReadWriteLock();


I believe the readlock is not necessary at all. trackedObjects is an AtomicLong and doesn't need any protection.
The important step is to protect the starting and stopping of the CleanerThread. In particular, it must be ensured that the thread doesn't terminate while a new reference is added without starting a new CleanerThread..

The cleanerThreadLock protects the interaction of trackedObjects and cleanerThread. The critical section is the starting and stopping of the Thead if and only iff the number of tracked objects reaches zero or is above 1.

This is the sequence I want to prevent:

T1 using the cleaner calls register and runs before incrementAndGet is invoked

CleanerThread reaches the trackedObjects.get() == 0 check and enters the if block, but does not execute further yet.

T1 executes incrementAndGet and receives 1, checks the value of cleanerThread and finds a Thread there.

CleanerThread continues, clears cleanerThread and finishes execution

There is a subtle issue there, the break needs to move into the inner if in line 173 (https://github.com/java-native-access/jna/pull/1617/files/b901c180f4a1026675a2859bbf24727574e19e6a#diff-11b3b0981223743529d972f317f26dfd9488b4a9b80a34a2d6f76cc2c58dcc00R173) into the inner if.

So, I attempted to write some pseudocode to show how to do it with a single lock and found a race in it. The cause of the race is the double-check locking that you employ in order to achieve better parallelism.
Without the double-check a single lock is sufficient, but then you're back to a mutex block that is always executed, hence no more real parallelism.

I still wonder if a tiny mutex block would be faster than the read/write lock, in particular because the ReadWriteLock JavaDoc says just that:

Further, if the read operations are too short the overhead of the read-write lock implementation (which is inherently more complex than a mutual exclusion lock) can dominate the execution cost, particularly as many read-write lock implementations still serialize all threads through a small section of code.

pmconrad · 2024-07-17T12:09:46Z

FTR, I have run the overload test from my optimization branch and found that my original issue is not resolved by this change.

pmconrad · 2024-07-18T09:51:07Z

src/com/sun/jna/internal/Cleaner.java

-            return inChain;
+    private void remove(CleanerRef ref) {
+        map.remove(ref);
+        cleanerThreadLock.readLock().lock();


I think the lock here is not required. Like you said, the lock protects the interaction of cleanerThread and trackedObjects, but there is no interaction here.

pmconrad reviewed Jul 17, 2024

View reviewed changes

Move break to correct location

f9d7bbf

pmconrad reviewed Jul 18, 2024

View reviewed changes

Improve performance for Cleaner implementation #1617

Are you sure you want to change the base?

Improve performance for Cleaner implementation #1617

Uh oh!

Conversation

matthiasblaesing commented Jul 13, 2024

Uh oh!

matthiasblaesing commented Jul 13, 2024

Uh oh!

matthiasblaesing commented Jul 13, 2024

Uh oh!

ivanr-rs commented Jul 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmconrad commented Jul 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!