-
Notifications
You must be signed in to change notification settings - Fork 563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Passing None to SQL Server INSERTs drastically slows down inserts: #741
Comments
Have you tried using setinputsizes to see if that helps improve performance? |
I didn't at first but tried your suggestion using the following map:
Unfortunately nothing changes. I'm currently using a workaround in which I make the first row of the data values all non-None. This increase the speed from around 70 records / second to 7-8k records / second. While I don't know the internals of pyodbc, seems like it uses the first column values to determine the data types; setting the setinputsizes beforehand has not effect. |
Can we assume that you're using |
Yes.
I'm also using the latest ODBC Driver 17 for SQL Server (but previously was using the 13 instead and both show the same issue). |
I could provide you the serialized object of the list I'm inserting in the SQL Server and the table structure, if this would help. |
Could you post an ODBC trace? Since you mention performance I guess there are lots of rows to insert but you can truncate the trace after it starts to become repetitive. |
Sure. Here they are. I'm sending you two log files generated by a sample of 1000 records.
SQL_without_Nones.zip If you want, I can provide with the dataset I used. For some reason, not having Nones in the first record makes a huge difference. |
The first difference I notice is at line 608: "without_Nones" is calling SQLBindParameter with SQL_C_WCHAR while "with_None" is calling it with SQL_C_BINARY. |
Anything else I can provide that would you help your assessment? I've spend half and hour looking at the tracing files but they are beyond my ability to debug. If this is relevant, in this article https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters by Michael Kleehammer, he says that for SQL Server "None could be a NULL of any type. VARCHAR can be converted to most types, but SQL Server won't convert it to binary, so we get an error inserting into binary columns." Could this be related? |
In fast_executemany mode, pyODBC attempts to insert as many rows at once as possible by allocating the entire (2-dimensional) array of parameters, binding them, converting the Python objects into the ODBC C representation needed by the driver, and then executing. Without Nones, the log shows a single execution with 1001 rows. With Nones, the types it detected for the first row are ultimately incorrect (as Gord above mentioned, it chose binary instead of character) and thus has to re-detect each time it finds a new type which doesn't match the previous ones for the column, so you get many smaller execution batches. |
@v-chojas - Does the fast_executemany code really ignore setinputsizes, at least with regard to dealing with null (None) values? |
Yes, it does ignore. I used the following code:
Works the same as without the .setinputsizes and I know the .setinputsizes are being read because if I set the precision to None it raises an exception pyodbc.Error: ('HY104', '[HY104] [Microsoft][ODBC Driver 17 for SQL Server]Invalid precision value (0) (SQLBindParameter)' The table structure is as following (CREATE TABLE statement): Would the behavior explained by @v-chojas expected in the None situation? If so and as @gordthompson suggested, wouldn't set the types and sizes before the execution as the solution? Based on the previous comments, this seems to be a bug indeed, right? Do you want me to provide to tracing of having the setinputsizes declared before the execution? |
Even if you use setinputsizes, which sets the SQL type, I suspect a None cannot be detected as any valid C type for the binding, so it causes the code to go through the redetect path. Yes, a trace with setinputsizes would be useful. |
Here is the log with setinputsizes (as I described above). Insertion of 999 records. |
One question: why we don't get any problems only if the first row has no Nones? If the second row has no Nones, it still impacts the entire insert flow. I mean, shouldn't pyodbc correctly binds all the fields after it finds a non-None value? Why does it keep re-detecting it even after it finds a valid record? Why is the first line special? |
pyODBC can only bind based on the information it has so far; if it was bound as C type X, then any further cells that can't be converted to that type will require redetecing and rebinding, which means submitting what it has found so far. |
Since pyODBC can't bind None because SQL Server has different types of NULLs for different types of fields, the alternative would be to tell pyodbc beforehand what the type is. Theoretically, the way to do that would be using .setinputsizes(). Since this doesn't work in practice, can this be considered bug to be worked in the future? |
@v-chojas - sqlext.h has mappings between C types and SQL types starting here: If we provide the SQL types via |
The C type must match the Python type, because it determines the data format of the bound buffer, and that also explains why in fast executemany mode, pyODBC must scan the first row to find out what the Python type of each column is (and why Nones are such a problem.) |
No, not if I've already told it what to expect by calling Or to put it another way, it sounds like the behaviour is to look at each column value in the current row and if it's None then assume that it is the same type as the corresponding column of the previous row. If so, then just use the information from |
|
Yes, but from this: I was able to derive this:
Not sure about the ones with the brackets, but the most common players are present and there are no duplicates in either the "SQL_type" or "C_type" columns. |
I would go even further and suggest that if the parameter classes in the provided Python parameters are not the same (for each field, ignoring None), then executemany should error out immediately.
It doesn't seem unreasonable to insist the parameter types for each field are the same. |
The C type needs to match the Python type, or it will result in incorrect data inserted; you've only found the default conversions but the driver can do others. Python is a dynamic language, so it's absolutely normal to have heterogeneous types within a single column as long as they are convertible. |
@nelsonwcf - I created an MCVE hoping to reproduce the issue but it didn't. The following code inserts 1_000_000 rows in 53 seconds, regardless of whether the import os
import time
import pyodbc
cnxn = pyodbc.connect(
"DRIVER=ODBC Driver 17 for SQL Server;"
"SERVER=192.168.0.199;"
"DATABASE=mydb;"
"UseFMTONLY=yes;"
f"UID=sa;PWD={os.getenv('sa_PWD')};",
autocommit=True,
)
crsr = cnxn.cursor()
crsr.execute("CREATE TABLE #tmp (id int PRIMARY KEY, txt varchar(50))")
rows_to_insert = 1_000_000
row_index_for_null = 0
row_data = [
(x, (None if x == row_index_for_null else f"row{x}"),)
for x in range(rows_to_insert)
]
sql = "INSERT INTO #tmp (id, txt) VALUES (?, ?)"
crsr.fast_executemany = True
t0 = time.time()
crsr.executemany(sql, row_data)
print(f"{rows_to_insert:,} rows inserted in {(time.time() - t0):0.1f} seconds") |
Yes, I tested and confirm that your code didn't have any of the issues even though the first line has a 'None' Maybe there's an additional hidden criteria that is required for the issue to show? I will prepare a small dataset that shows the problem and share it here (sample from the one I used in my job). |
I think the best way to avoid guessing what are the requirements for the issue to show is to use the original file that created the problem. I'm attaching a 999 records CSV, the one I used to generate the ODBC tracing log files. It would take 1 second to insert all records when no Nones were present in the first line, and around 20 seconds otherwise. Maybe this should help? |
@nelsonwcf – Thanks for posting the sample data. On reviewing the issue I noticed two things:
|
|
I am able to sort-of-reproduce your issue, although in my case the performance penalty for null(s) in the first row is more like a factor of 2, not 20: sample_test_no_null_first_line.csv
sample_test.csv
The "packets" and "bytes" numbers come from Wireshark. Interesting that the second case is only generating 4% more network traffic, but it's taking 43% more packets to get the job done. |
Did you try to compare the odbc tracing for both as well? |
No, because we already know from the ODBC trace logs you posted earlier that "without_Nones" calls the ODBC |
Are there any planned fixes for this bug in upcoming releases? |
I do not think it is a bug: it is just slower but still working, and in any case, no slower than without fast_executemany. |
Are there any plans or ongoing work to fix this bug? The performance impact is huge, I'm seeing bulk inserts of 25 rows/second when first row contains NULLs and 1000 rows/second when it doesn't (database in cloud, high network latency). In the meantime does anyone have any workarounds that can be used? I'm looking at either changing all NULLs on the first row to a placeholder value and then NULLing them back in the DB afterwards, or inserting a fake first row and then immediately deleting it afterwards, assume there's no other way around this? |
@nevado - You could use the bcp utility to either upload the data directly to a table or upload to a temporary table and then do a MERGE into the "real" table. |
There is also the JSON trick described here. |
I think some of the ideas here could work. It does seem like the fastexecute version could be sped up by passing the SQL type of the
The reason is doesn't matter is we will already be sending a real int or a real float in the We should also make sure that the known types can be used on subsequent batches (within the For the non-fastexecute many, it does seem like setinputsizes should be good. It is cumbersome In the non-fastexecute version, the GetParamType function is called when a NULL parameter is Ideally I'd like to integrate the two code paths more in future versions. As was pointed out, since it is working, I'd like to postpone this to version 5.1. The 5.0 |
+1 if this means that the fast_executemany code could "learn as it goes" instead of inferring the type information from the first row and using that for the whole batch. I recently revisited this issue via the SQLAlchemy/pandas discussion sqlalchemy/sqlalchemy#9436 (comment) and I re-ran some tests using the sample data that @nelsonwcf provided here The ODBC trace showed that once the INSERT INTO statement had been sent to the server the fast_executemany code called |
As I noted in #996, the right answer is probably to scan rows before binding. That is, instead of binding based on the first row, record the types of the first row. Then refine those types and lengths as more rows are scanned. When a row is found that is incompatible with the types, bind and execute the scanned rows. If all scanned rows had None, we'd still need to call SQLDescribeParam, but if any had a value we would deduce the SQL type from that value. It would also ensure we use the longest value for any length parameters. Thoughts? I'd like to drop Python 2 support first, so let's wait for 5.x which I'd like to release soon. |
Do you mean scan all the rows!? That will certainly not be fast, especially if there are a lot of rows, and seems to defeat the purpose of fast_executemany. |
What I mean by having the fast_executemany code "learn as it goes" is illustrated by the following example: import pyodbc
cnxn = pyodbc.connect("DSN=mssql_199;UID=scott;PWD=tiger^5HHH", autocommit=True)
crsr = cnxn.cursor()
crsr.execute("DROP TABLE IF EXISTS foo")
crsr.execute("CREATE TABLE foo (col1 varchar(max), col2 varchar(max))")
data = [
("a", None),
(None, "b"),
]
sql = "INSERT INTO foo (col1, col2) VALUES (?, ?)"
crsr.fast_executemany = True
crsr.executemany(sql, data) SQL.LOG shows that when processing the first row, pyodbc determines the column types to be SQL_C_WCHAR and SQL_C_BINARY
When it goes to process the second row it sees the wrong type for col2 and re-scans the row, determining the column types to be SQL_C_BINARY and SQL_C_WCHAR:
That is, it "forgets" the correct binding for col1, so the next time col1 is non-null it will detect the wrong type (SQL_C_BINARY) and trigger another re-scan. If we're lucky and hit a re-scan where all the columns are non-null — and hence all of the C types are correct — then we should be fine for the rest of the rows (barring any weird heterogeneous data in a given column). A possible solution would be to store the column types with an extra attribute indicating whether that type was a "guess" based on a null value. For example, after the scan of the first row instead of something like (the C++ equivalent of) {"col1": SQL_C_WCHAR, "col2": SQL_C_BINARY} it would be {"col1": {"c_type": SQL_C_WCHAR, "guess_for_null": False}, "col2": {"c_type": SQL_C_BINARY, "guess_for_null": True}} Then, when it hits the null value for col1 in the second row it will remember the previous binding for col1 and not overwrite it with a wrong guess. After re-scanning for the second row the type map would be {"col1": {"c_type": SQL_C_WCHAR, "guess_for_null": False}, "col2": {"c_type": SQL_C_WCHAR, "guess_for_null": False}} |
Maybe create a 5.1 milestone and add this to it @mkleehammer ? |
Are there any updates on progress relative to this issue? I have a use case where I am inserting 18k rows, in this use case with In my use case, it may be possible for all rows of a particular column to contain None. The progressive enhancement concept if I understand the proposal would assume at least one row for a given column is non-none so that To state differently if the progressive enhancement may still see use cases where performance is < 50 per sec that is far to slow for our needs and we'd likely opt for a fake first row as @nelsonwcf describes in the original posting. I would prefer to make use of an api like setinputsizes if that meant the performance was consistently > 1000 rows per sec or better |
@abillingsley - Have you tried the JSON trick mentioned in the comment above? |
I did a quick POC using the JSON trick. Performance is better than
31 sec is obviously better than the out-of-the-box execute many option but still not as good if execute many were performed without None |
@abillingsley - The performance hit depends on how many "re-scans" take place. One particularly unfortunate circumstance is a table structure with two interdependent columns where only one of them can be non-null. A re-scan is triggered each time the null value swaps from one column to the other, and no row can have non-null values in both columns so the issue never gets resolved as the rows are processed. If you're curious about what is triggering re-scans with your data you could experiment with the little routine I posted here. If you find that a couple of columns are triggering the majority of re-scans then you could mitigate the effect by sorting your source data on those columns to keep the blocks of null values together. |
Environment
To diagnose, we usually need to know the following, including version numbers. On Windows, be
sure to specify 32-bit Python or 64-bit:
Issue
There's a drastic slow down in parameterized inserts when None/Nan are present in the list, usually of two magnitudes.
I already browsed through through issue #213 but it's old and still open. I also read through https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters, which gave me the idea to replace the None in the first line with valid data, which somewhat fixes the issue (but forces me to run an update query at the end of the INSERT statement to fix the 'substitute' NULLs).
Also, I've tried pyodbc.set_none_binding(pyodbc.SQL_VARCHAR) as suggested in the Proposal 1 workaround but the current pyodbc doesn't recognize this a property or method.
What is the current status of the NULL/None issue when inserting in SQL Server? Are there any developments?
The text was updated successfully, but these errors were encountered: