Multiple threads querying same table takes a long time #1358
Replies: 19 comments
-
Very old but possibly related: |
Beta Was this translation helpful? Give feedback.
-
Are those 5 threads sharing one connection, or is this with 5 connections to the database? |
Beta Was this translation helpful? Give feedback.
-
It is 5 different connections - specifically each thread opens a connection and executes the query. I assume there's some internal threadpool but definitely not the same connection object. |
Beta Was this translation helpful? Give feedback.
-
Yeah I stumbled across this and tried it out but no success. Also 5 processes doing the same query doesn't block. |
Beta Was this translation helpful? Give feedback.
-
Does this code illustrate what you're asking about? import logging
import time
from concurrent.futures import ThreadPoolExecutor
import pyodbc
connection_string = "DSN=mssql_199;UID=scott;PWD=tiger^5HHH"
select_statement = "SELECT 1 AS foo"
def thread_function(name):
logging.info("Thread %s: starting", name)
t0 = time.perf_counter()
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
results = crsr.execute(select_statement).fetchval()
crsr.close()
cnxn.close()
elapsed = time.perf_counter() - t0
logging.info("Thread %s: finishing (%s seconds)", name, elapsed)
if __name__ == "__main__":
format = "%(asctime)s: %(message)s"
logging.basicConfig(format=format, level=logging.INFO, datefmt="%H:%M:%S")
t0_main = time.perf_counter()
num_threads = 3
logging.info("[num_threads = %s]", num_threads)
with ThreadPoolExecutor(max_workers=num_threads) as executor:
executor.map(thread_function, range(num_threads))
logging.info(
"Threads all done (%s seconds)", time.perf_counter() - t0_main
)
t0_main = time.perf_counter()
for i in range(num_threads):
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
results = crsr.execute(select_statement).fetchval()
crsr.close()
cnxn.close()
logging.info("Loop done (%s seconds)", time.perf_counter() - t0_main) Sample output:
|
Beta Was this translation helpful? Give feedback.
-
Yes, just that I'm not using a thread pool executor but spinning up threads in a for loop. I also didn't include the connect in my perf block and I don't close the cursor after I also call |
Beta Was this translation helpful? Give feedback.
-
Similarly, if all threads run a sql statement to a different table the time doesn't grow with the number of threads. I tried to also use a with statement to make sure the queries are completely different even if hitting the same table but no luck. |
Beta Was this translation helpful? Give feedback.
-
Have you tried using |
Beta Was this translation helpful? Give feedback.
-
Ah yes - I forgot to mention. I see some slight better performance but the issue is still there - the performance degrades with the number of threads. |
Beta Was this translation helpful? Give feedback.
-
For my test, the time does indeed increase with the number of threads, but that is the elapsed "clock on the wall" time from the moment the thread starts until it finishes. Each thread is obviously not executing the whole time because the elapsed time of each thread is approximately the same as the elapsed time for all threads ("Threads all done"). In other words, the elapsed time for a thread is not the amount of time it is actually executing, it also includes the time when that thread is dormant while one of the other threads is active. As for selecting from one table vs. selecting from different tables, I purposely chose TL;DR - Think about what your metrics really mean. Perhaps a more realistic test would be to define a task to be done and then compare the time it takes to complete that task using different numbers of threads. For example, using a single (main) thread to read 1_000_000 rows from a table vs. using 100 threads to read 10_000 rows each. |
Beta Was this translation helpful? Give feedback.
-
I think it does make sense up to a point. Since execute will release the GIL the threads should execute concurrently (well not practically but theoretically), therefore each thread time should be smaller than the time taken to execute all the commands sequentially. I understand that if you have an ancient machine you might end up hitting a resource limit, but as I said, running the same 5 threads against 5 different tables shows a time that is substantially lower than running it against the same table and there are no resource bottlenecks on the db server. Row level locking should not incur on a select query if I set transaction isolation level uncommitted I think. I think not hitting a real database might also be misleading. I did write a cpp piece of code and I don't see the same issue, the time taken to run the query is not increasing with the number of threads. |
Beta Was this translation helpful? Give feedback.
-
Also this:
makes little sense. |
Beta Was this translation helpful? Give feedback.
-
It's not really the usecase here. Imagine a client-server application where a server has a grpc pool of workers handling requests. If all my requests are queries to different tables they respond fast, if they hit the same table they all slow down massively. |
Beta Was this translation helpful? Give feedback.
-
Sorry, but I'm not seeing that either. I created 5 tables (tbl_0 .. tbl_4) with 5 columns x 1000 rows. I tweaked my test so that each thread either hit the same table (tbl_0) or hit a different one (e.g., thread_#0 hit tbl_0, thread_#1 hit tbl_1, etc.). The SELECT statement used [same] 0.050622984 mean: 0.087852513 [different] 0.057655652 mean: 0.083586224 A t-Test in Excel gave a t statistic of 0.323172196 and a two-tail critical value of 2.10092204, implying that the two means are not significantly different. Certainly there was nothing like 2- or 3-fold difference in the time taken to complete the tests for the two scenarios. Can you reproduce your results if you use pymssql instead of pyodbc? It appears that pymysql has the same threading limitations as pyodbc, namely threadsafety=1: "threads may share the module but not connections". |
Beta Was this translation helpful? Give feedback.
-
I will try - is your example using the same driver as above ? I did try this with two different databases and I was seeing the same result but will look into more fine details tomorrow. I also tried sqlalchemy which showed increased performance but similar pattern although I think under the hood is still using pyodbc. |
Beta Was this translation helpful? Give feedback.
-
Gentle terminology correction: The ODBC Driver is something like "ODBC Driver 18 for SQL Server", or "msodbcsql18" for short (from the name of the .so or .dll file). unixODBC is an ODBC Driver Manager (DM). I was running my tests on Windows, so I was using Windows' built-in DM (odbc32.dll) — not unixODBC — but I wasn't too concerned because we are looking at (what we think might be) a server-side blocking issue and the DM is unlikely to be relevant.
If you use |
Beta Was this translation helpful? Give feedback.
-
Just repeated my test on Ubuntu 22.04 with unixODBC version 2.3.9 and msodbcsql18 version 18.3.3 [same] mean: 0.460556142 [different] mean: 0.47910388 t Stat: 0.512121114 => means are not significantly different |
Beta Was this translation helpful? Give feedback.
-
Your claim that querying different tables vs. the same table results in different times already eliminates pyODBC as the cause. pyODBC doesn't attempt to interpret the query so it can't behave differently depending on that. I agree with gordthompson that this is almost certainly server-side. |
Beta Was this translation helpful? Give feedback.
-
It's because I ran that test on Windows, and ODBC connection pooling is automatically enabled for "ODBC Driver 17/18 for SQL Server" on Windows. The three threads each had to open their own connection, and when they were finished the connections were returned to the pool, so the I actually added the loop to try and provide insight into the overhead required to manage the threads, but I didn't consider the connection pool. My bad. |
Beta Was this translation helpful? Give feedback.
-
Please first make sure you have looked at:
Environment
Issue
Hi folks,
I have the following behavior when running a python application that has multiple threads.
I managed to reproduce this with a test script but can't figure out where the issue is coming from, so I hope you might be able to shed some light.
I have a database that has 5 tables. If I create 5 threads and do a SELECT all of the results are returned in a reasonable amount of time.
If however I make each thread run the same SELECT against only one of the tables, the time taken to return hugely increases. I noticed this increasing with the number of threads. The threads all return at about same time but this time increases with the number of threads.
ApplicationIntent is set to ReadOnly,
Autocommit=True
I also set SQL_ATTR_TXN_ISOLATION to SQL_TXN_READ_UNCOMMITTED but also tried calling "set transaction isolation level read uncommited"
It's worth mentioning that I don't see the same behavior from multiple processes - multiple processes return in a reasonable amount of time and don't grow in time.
I tried looking in the pyodbc code but I don't see anything that would lock and pystack doesn't show any odd behavior. I didn't manage to turn on odbc logs.
Much appreciated for your help.
Beta Was this translation helpful? Give feedback.
All reactions