Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is some unexpected lock triggered? #534

Open
iredmail opened this issue Nov 8, 2024 · 9 comments
Open

Is some unexpected lock triggered? #534

iredmail opened this issue Nov 8, 2024 · 9 comments

Comments

@iredmail
Copy link
Contributor

iredmail commented Nov 8, 2024

Dear all,

I experienced some weird issue recently, it looks like some kind of locking inside ldap library but i'm not sure.

i prepared a code snippet to reproduce the issue, what it does is:

  • it connects to OpenLDAP server running on localhost
  • run 10 goroutines to endlessly add new ldap objects and print a short message with timestamp on console. so there's up to 10 concurrent ldap add operations.
  • it sleeps 20ms after successfully added object. this is used to simulate ldap operations which may take slightly longer in one ldap connection.
  • it establishes a new ldap connection for each add operation, and close the connection immediately. No shared connection, no ldap connection pool.

Expected result

The log messages on console are rolling without pause. It may be slightly slow if hardware spec is not that good, but it won't pause for X seconds (up to 60 seconds in my tests).

The real result

The log messages are rolling immediately right after launching the program, but it stops rolling shortly (and randomly) for X seconds, then it continues rolling. The pause may happens again and again randomly, and the pause interval is not always same (it could be 30 seconds, or 60 seconds, or X seconds).

How to use this code snippet

  • create a new directory and initialize it with go mod init xxx
  • copy this full code snippet to file main.go
  • edit main.go, update uri, bindDN, bindPassword, parentDN with correct values for your testing LDAP server.
  • run it: go run .
  • If no pause at all, try to stop the program and re-launch. Repeat this until it happens.
package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/go-ldap/ldap/v3"
	"github.com/google/uuid"
)

const (
	uri          = "ldap://127.0.0.1:389"
	bindDN       = "cn=manager,dc=a.io"
	basePassword = "Zp3CRSySlF8lR618WVKqa5ySMf42XE0m"

	// we will create new ldap objects under this parent dn.
	parentDN = "ou=Users,domainName=test.io,o=domains,dc=a.io"
)

func addUser() error {
	conn, err := ldap.DialURL(uri)
	if err != nil {
		return err
	}

	defer conn.Close()

	err = conn.Bind(bindDN, basePassword)
	if err != nil {
		fmt.Println(err)

		os.Exit(255)
	}

	user := uuid.New().String()

	dn := fmt.Sprintf("uid=%s,%s", user, parentDN)

	attrs := map[string][]string{
		"objectClass": {"inetOrgPerson"},
		"uid":         {user},
		"cn":          {user},
		"sn":          {user},
	}

	request := ldap.NewAddRequest(dn, nil)
	for k, v := range attrs {
		request.Attribute(k, v)
	}

	err = conn.Add(request)
	if err != nil {
		return err
	}
	fmt.Printf("%s Added: %s\n", time.Now().Format(time.DateTime), user)

	// simulate operations which takes slightly long time here,
	// easily reproduce the "pause / lock" issue.
	time.Sleep(time.Millisecond * 20)

	return nil
}

func endlessLoop() {
	for {
		err := addUser()
		if err != nil {
			fmt.Println(err)
		}
	}
}

func main() {
	for i := 0; i < 10; i++ {
		go endlessLoop()
	}

	chanQuit := make(chan os.Signal, 1)
	signal.Notify(chanQuit, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, os.Interrupt)
	<-chanQuit
}
@cpuschma
Copy link
Member

cpuschma commented Nov 8, 2024

Hi,

I setup an OpenLDAP server and tried to reproduce your issue by running your program for 5 minutes straight. None of the threads ended up in a softlock and ran flawlessly until I stopped the program. All new directory objects were successfully created.

Can you attach a debgger to your program in the affected environment?

@iredmail
Copy link
Contributor Author

iredmail commented Nov 9, 2024

Forgot to mention my testing environment, sorry about this.

  • MacBook Pro, CPU is M3 Pro (arm64).
  • OpenLDAP (version 2.6.7) and Go (version 1.22.2) are running inside a Ubuntu 24.04 virtual machine (arm64 too), powered by VMware Fusion.
  • I also tried cross building the Go program on macOS (Go 1.23.3), then copy and run it inside VM, same result.

I'm not sure how to attach a debugger. :(

  • Did you try to abort the program with Ctrl-C and re-launch it? Repeat few times until it happens.
  • The longer the simulation sleeps (time.Sleep(time.Millisecond * 20)), the easier to reproduce the issue during my testing.
  • Not sure whether the CPU / arch matters.

@iredmail
Copy link
Contributor Author

Hi @cpuschma

Is there anyway I can help you reproduce the issue? Or cooperate to reproduce it?

@cpuschma
Copy link
Member

@iredmail Can you try and place several debugging statements (e.g. println, fmt.Println) before you

  • Open a connection
  • Send a bind request
  • Send a search request

and prepend the current goroutine so we can know at what stage the routine starts to hang? Thank you

@iredmail
Copy link
Contributor Author

iredmail commented Nov 12, 2024

Sure, new code below. Changes:

  • Add multiple fmt.Printf to indicate the stages.
  • Increase goroutines to 20. Easier to reproduce the issue.

Result:

  • It hangs / pauses multiple times after printed Binding. Does it mean there's a pause (lock) in Bind()?

Sample output with pause:

2024-11-12 06:46:21 [9] Close succeeded.
2024-11-12 06:46:21 [9] Dial succeeded.
2024-11-12 06:46:21 [9] Binding.
2024-11-12 06:46:21 [17] Close succeeded.
2024-11-12 06:46:21 [17] Dial succeeded.
2024-11-12 06:46:21 [17] Binding.    # <-- it hangs / pauses here.
2024-11-12 06:46:42 [13] Add succeeded. # <- check the time, it pauses for 20 seconds.
2024-11-12 06:46:42 [0] Bind succeeded.
2024-11-12 06:46:42 [11] Bind succeeded.
2024-11-12 06:46:42 [9] Bind succeeded.
2024-11-12 06:46:42 [16] Add succeeded.
2024-11-12 06:46:42 [15] Add succeeded.
2024-11-12 06:46:42 [17] Bind succeeded.

Full code (main.go):

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/go-ldap/ldap/v3"
	"github.com/google/uuid"
)

const (
	uri          = "ldap://127.0.0.1:389"
	bindDN       = "cn=manager,dc=a.io"
	basePassword = "Zp3CRSySlF8lR618WVKqa5ySMf42XE0m"

	// we will create new ldap objects under this parent dn.
	parentDN = "ou=Users,domainName=test.io,o=domains,dc=a.io"
)

func addUser(r int) error {
	conn, err := ldap.DialURL(uri)
	if err != nil {
		return err
	}
	fmt.Printf("%s [%d] Dial succeeded.\n", time.Now().Format(time.DateTime), r)

	defer func() {
		conn.Close()
		fmt.Printf("%s [%d] Close succeeded.\n", time.Now().Format(time.DateTime), r)
	}()

	fmt.Printf("%s [%d] Binding.\n", time.Now().Format(time.DateTime), r)
	err = conn.Bind(bindDN, basePassword)
	if err != nil {
		fmt.Println(err)

		os.Exit(255)
	}
	fmt.Printf("%s [%d] Bind succeeded.\n", time.Now().Format(time.DateTime), r)

	user := uuid.New().String()

	dn := fmt.Sprintf("uid=%s,%s", user, parentDN)

	attrs := map[string][]string{
		"objectClass": {"inetOrgPerson"},
		"uid":         {user},
		"cn":          {user},
		"sn":          {user},
	}

	request := ldap.NewAddRequest(dn, nil)
	for k, v := range attrs {
		request.Attribute(k, v)
	}

	err = conn.Add(request)
	if err != nil {
		return err
	}
	fmt.Printf("%s [%d] Add succeeded.\n", time.Now().Format(time.DateTime), r)

	// simulate operations which takes slightly long time here,
	// easily reproduce the "pause / lock" issue.
	time.Sleep(time.Millisecond * 20)

	return nil
}

func endlessLoop(i int) {
	for {
		err := addUser(i)
		if err != nil {
			fmt.Println(err)
		}
	}
}

func main() {
	for i := 0; i < 20; i++ {
		go endlessLoop(i)
	}

	chanQuit := make(chan os.Signal, 1)
	signal.Notify(chanQuit, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT, os.Interrupt)
	<-chanQuit
}

@iredmail
Copy link
Contributor Author

iredmail commented Nov 12, 2024

i cloned go-ldap/ldap and use local git clone by using replace directive in go.mod, added more fmt.Printf and located the pause starts after Sending message (https://github.com/go-ldap/ldap/blob/master/v3/conn.go#L522), it finished the tasks inside case MessageRequest block immediately, but toke 8 seconds to print Receiving message 2.

2024-11-12 07:36:04 [17] Binding.
LDAP Request: (Universal, Constructed, Sequence and Sequence of) Len=62 "<nil>"
 MessageID: (Universal, Primitive, Integer) Len=1 "1"
 Bind Request: (Application, Constructed, 0x00) Len=57 "<nil>"
  Version: (Universal, Primitive, Integer) Len=1 "3"
  User Name: (Universal, Primitive, Octet String) Len=18 "cn=manager,dc=a.io"
  Password: (Context, Primitive, 0x00) Len=32 "Zp3CRSySlF8lR618WVKqa5ySMf42XE0m"
sendMessageWithFlags locking
sendMessageWithFlags inside Lock()
2024/11/12 07:36:04 flags&startTLS = 0
sendMessageWithFlags l.outstandingRequests++
sendMessageWithFlags Unlocking
sendMessageWithFlags Unlocked.
2024/11/12 07:36:04 1: returning
2024/11/12 07:36:04 1: waiting for response
2024/11/12 07:36:04 Sending message 1
2024-11-12 07:47:57 [19] Close succeeded.
2024-11-12 07:47:57 [19] Dial succeeded.
2024-11-12 07:47:57 [19] Binding.
LDAP Request: (Universal, Constructed, Sequence and Sequence of) Len=62 "<nil>"
 MessageID: (Universal, Primitive, Integer) Len=1 "1"
 Bind Request: (Application, Constructed, 0x00) Len=57 "<nil>"
  Version: (Universal, Primitive, Integer) Len=1 "3"
  User Name: (Universal, Primitive, Octet String) Len=18 "cn=manager,dc=a.io"
  Password: (Context, Primitive, 0x00) Len=32 "Zp3CRSySlF8lR618WVKqa5ySMf42XE0m"
sendMessageWithFlags locking
sendMessageWithFlags inside Lock()
2024/11/12 07:47:57 flags&startTLS = 0
sendMessageWithFlags l.outstandingRequests++
sendMessageWithFlags Unlocking
sendMessageWithFlags Unlocked.
2024/11/12 07:47:57 1: returning
DEBUG 11
2024/11/12 07:47:57 1: waiting for response
2024/11/12 07:47:57 Sending message 1
2024/11/12 07:48:06 Receiving message 2.  # <-- it pauses 8 seconds to print this debug log line

@iredmail
Copy link
Contributor Author

iredmail commented Nov 12, 2024

Above result was tested without running conn.SetTimeout(), here's the result with conn.SetTimeout(3*time.Second).
Interesting result:

  • Pause starts when it reaches readPacket() (line 51): https://github.com/go-ldap/ldap/blob/master/v3/request.go#L51
  • it pauses as long as the request timeout (3 seconds). I also changed the timeout to 10 seconds, and it paused for 10 seconds.
  • The program exits immediately after received one error LDAP Result Code 200 "Network Error": ldap: connection timed out. I started 20 goroutines, and each goroutine establishes its own ldap connection, so i expect to receive 20 errors instead of just one. Is my expectation correct? Seems some data was shared in all ldap connections or globally?
    • UPDATE: I ran the program again and again, sometimes it received 16 error messages, so the number of error messages before program exit are different, not always just one.
    • NOTE: Without call SetTimeout(), the program runs endlessly (it pauses, but never exits).
# ./a 
2024-11-12 08:13:59 [2] Dial succeeded.
2024-11-12 08:13:59 [2] Binding.
2024-11-12 08:13:59 [19] Dial succeeded.
2024-11-12 08:13:59 [19] Binding.
2024-11-12 08:13:59 [12] Dial succeeded.
2024-11-12 08:13:59 [18] Dial succeeded.
2024-11-12 08:13:59 [18] Binding.
2024-11-12 08:13:59 [12] Binding.
...
...[omit duplicate lines] ...
...
2024-11-12 08:13:59 [11] Binding.
1: readPacket waiting for response at 1731399239
1: readPacket waiting for response at 1731399239
2024-11-12 08:13:59 [14] Dial succeeded.
2024-11-12 08:13:59 [14] Binding.
1: readPacket waiting for response at 1731399239
2024-11-12 08:13:59 [4] Dial succeeded.
2024-11-12 08:13:59 [4] Binding.
1: readPacket waiting for response at 1731399239
1: readPacket received at 1731399242                   # <- it pauses for 3 seconds, which is same as the request timeout.
LDAP Result Code 200 "Network Error": ldap: connection timed out
1: readPacket received at 1731399242

@iredmail
Copy link
Contributor Author

By the way, conn.SetTimeout() is used to set request timeout, not the connection timeout, i think the method name may confuse users.
Better use SetRequestTimeout() instead, also add SetConnTimeout() to set connection timeout so that users don't need to modify DialURL with custom connection timeout.

@iredmail
Copy link
Contributor Author

Hi @cpuschma, could you help take a look? Thanks. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants