Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing None to SQL Server INSERTs drastically slows down inserts: #741

Open
nelsonwcf opened this issue Apr 14, 2020 · 47 comments
Open

Passing None to SQL Server INSERTs drastically slows down inserts: #741

nelsonwcf opened this issue Apr 14, 2020 · 47 comments

Comments

@nelsonwcf
Copy link

nelsonwcf commented Apr 14, 2020

Environment

To diagnose, we usually need to know the following, including version numbers. On Windows, be
sure to specify 32-bit Python or 64-bit:

  • Python: 3.7.7
  • pyodbc: 4.0.30
  • OS: Windows 10 x64 [Version 10.0.18362.720]
  • DB: SQL Server 13.0.5698.0
  • driver: ODBC Driver 17 for SQL Server (2017.175.01.01)

Issue

There's a drastic slow down in parameterized inserts when None/Nan are present in the list, usually of two magnitudes.

I already browsed through through issue #213 but it's old and still open. I also read through https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters, which gave me the idea to replace the None in the first line with valid data, which somewhat fixes the issue (but forces me to run an update query at the end of the INSERT statement to fix the 'substitute' NULLs).

Also, I've tried pyodbc.set_none_binding(pyodbc.SQL_VARCHAR) as suggested in the Proposal 1 workaround but the current pyodbc doesn't recognize this a property or method.

What is the current status of the NULL/None issue when inserting in SQL Server? Are there any developments?

@gordthompson
Copy link
Collaborator

Have you tried using setinputsizes to see if that helps improve performance?

@nelsonwcf
Copy link
Author

I didn't at first but tried your suggestion using the following map:

    csr.setinputsizes([
        (pyodbc.SQL_VARCHAR, 128, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_TYPE_TIMESTAMP, 7, 0)
        , (pyodbc.SQL_REAL, 4, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 512, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_SMALLINT, 2, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_VARCHAR, 64, 0)
        , (pyodbc.SQL_BIT, 1, 0)
        , (pyodbc.SQL_BIT, 1, 0)
    ])

Unfortunately nothing changes. I'm currently using a workaround in which I make the first row of the data values all non-None. This increase the speed from around 70 records / second to 7-8k records / second. While I don't know the internals of pyodbc, seems like it uses the first column values to determine the data types; setting the setinputsizes beforehand has not effect.

@gordthompson
Copy link
Collaborator

Can we assume that you're using fast_executemany = True?

@nelsonwcf
Copy link
Author

Yes.

    with pyodbc.connect(cnn_str) as cnn:
        with cnn.cursor() as csr:
            csr.fast_executemany = True
            csr.executemany(sql, data)

I'm also using the latest ODBC Driver 17 for SQL Server (but previously was using the 13 instead and both show the same issue).

@nelsonwcf
Copy link
Author

I could provide you the serialized object of the list I'm inserting in the SQL Server and the table structure, if this would help.

@v-chojas
Copy link
Contributor

Could you post an ODBC trace? Since you mention performance I guess there are lots of rows to insert but you can truncate the trace after it starts to become repetitive.

@nelsonwcf
Copy link
Author

nelsonwcf commented Apr 17, 2020

Sure. Here they are. I'm sending you two log files generated by a sample of 1000 records.

  • In the file "SQL_with_None", I inserted the rows of the dataset as they were, without removing any Nones. It generated a 41k log file.

  • In the file "SQL_without_Nones", I manually changed all the Nones in the first record only to a valid value (for example, for varchar column, I replaced the None to a '_' character, a 0 for a smallint, a 0 for a bit, etc.). The generated log was 797 lines.

SQL_without_Nones.zip
SQL_with_None.zip

If you want, I can provide with the dataset I used.

For some reason, not having Nones in the first record makes a huge difference.

@gordthompson
Copy link
Collaborator

The first difference I notice is at line 608: "without_Nones" is calling SQLBindParameter with SQL_C_WCHAR while "with_None" is calling it with SQL_C_BINARY.

@nelsonwcf
Copy link
Author

Anything else I can provide that would you help your assessment? I've spend half and hour looking at the tracing files but they are beyond my ability to debug. If this is relevant, in this article https://github.com/mkleehammer/pyodbc/wiki/Binding-Parameters by Michael Kleehammer, he says that for SQL Server "None could be a NULL of any type. VARCHAR can be converted to most types, but SQL Server won't convert it to binary, so we get an error inserting into binary columns."

Could this be related?

@v-chojas
Copy link
Contributor

In fast_executemany mode, pyODBC attempts to insert as many rows at once as possible by allocating the entire (2-dimensional) array of parameters, binding them, converting the Python objects into the ODBC C representation needed by the driver, and then executing. Without Nones, the log shows a single execution with 1001 rows. With Nones, the types it detected for the first row are ultimately incorrect (as Gord above mentioned, it chose binary instead of character) and thus has to re-detect each time it finds a new type which doesn't match the previous ones for the column, so you get many smaller execution batches.

@gordthompson
Copy link
Collaborator

gordthompson commented Apr 20, 2020

@v-chojas - Does the fast_executemany code really ignore setinputsizes, at least with regard to dealing with null (None) values?

@nelsonwcf
Copy link
Author

nelsonwcf commented Apr 24, 2020

Yes, it does ignore. I used the following code:

with pyodbc.connect(cnn_str) as cnn:
    with cnn.cursor() as csr:
        csr.setinputsizes([
            (pyodbc.SQL_VARCHAR, 128, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_TYPE_TIMESTAMP, None, 0)
            , (pyodbc.SQL_REAL, 4, 7)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 512, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_SMALLINT, None, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_VARCHAR, 64, 0)
            , (pyodbc.SQL_BIT, None, 0)
        ])
        csr.fast_executemany = True
        csr.executemany(sql, data)

Works the same as without the .setinputsizes and I know the .setinputsizes are being read because if I set the precision to None it raises an exception pyodbc.Error: ('HY104', '[HY104] [Microsoft][ODBC Driver 17 for SQL Server]Invalid precision value (0) (SQLBindParameter)'

The table structure is as following (CREATE TABLE statement):
CREATE TABLE [iu].[poc_tmp](
[iu_methodpath] varchar NULL,
[iu_txnid] varchar NULL,
[iu_accountnumber] varchar NULL,
[iu_tmsid] varchar NULL,
[iu_datetime] datetime2 NULL,
[iu_duration] [real] NULL,
[iu_market] varchar NULL,
[iu_macaddress] varchar NULL,
[iu_accesspoint] varchar NULL,
[iu_controller] varchar NULL,
[iu_resptype] varchar NULL,
[iu_client] varchar NULL,
[iu_cpid] [smallint] NULL,
[iu_vodproductname] varchar NULL,
[iu_vodprovidername] varchar NULL,
[iu_language] [bit] NULL

Would the behavior explained by @v-chojas expected in the None situation? If so and as @gordthompson suggested, wouldn't set the types and sizes before the execution as the solution?

Based on the previous comments, this seems to be a bug indeed, right? Do you want me to provide to tracing of having the setinputsizes declared before the execution?

@v-chojas
Copy link
Contributor

Even if you use setinputsizes, which sets the SQL type, I suspect a None cannot be detected as any valid C type for the binding, so it causes the code to go through the redetect path. Yes, a trace with setinputsizes would be useful.

@nelsonwcf
Copy link
Author

Here is the log with setinputsizes (as I described above). Insertion of 999 records.

SQL.zip

@nelsonwcf
Copy link
Author

@v-chojas

One question: why we don't get any problems only if the first row has no Nones? If the second row has no Nones, it still impacts the entire insert flow. I mean, shouldn't pyodbc correctly binds all the fields after it finds a non-None value? Why does it keep re-detecting it even after it finds a valid record? Why is the first line special?

@v-chojas
Copy link
Contributor

v-chojas commented May 7, 2020

pyODBC can only bind based on the information it has so far; if it was bound as C type X, then any further cells that can't be converted to that type will require redetecing and rebinding, which means submitting what it has found so far.

@nelsonwcf
Copy link
Author

Since pyODBC can't bind None because SQL Server has different types of NULLs for different types of fields, the alternative would be to tell pyodbc beforehand what the type is. Theoretically, the way to do that would be using .setinputsizes(). Since this doesn't work in practice, can this be considered bug to be worked in the future?

@gordthompson
Copy link
Collaborator

@v-chojas - sqlext.h has mappings between C types and SQL types starting here:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548

If we provide the SQL types via setinputsizes then couldn't pyodbc look up the C types that way?

@v-chojas
Copy link
Contributor

The C type must match the Python type, because it determines the data format of the bound buffer, and that also explains why in fast executemany mode, pyODBC must scan the first row to find out what the Python type of each column is (and why Nones are such a problem.)

@gordthompson
Copy link
Collaborator

pyODBC must scan the first row to find out what the Python type of each column is

No, not if I've already told it what to expect by calling setinputsizes. Sure, if you want to support a "zero configuration" option that auto-detects and auto-rebinds, that's fine. But if I explicitly tell you what the types are then you should take my word for it.

Or to put it another way, it sounds like the behaviour is to look at each column value in the current row and if it's None then assume that it is the same type as the corresponding column of the previous row. If so, then just use the information from setinputsizes to pre-load that list of column types instead of using the data values in the first row.

@v-chojas
Copy link
Contributor

setinputsizes is used for setting the SQL type, not the C type.

@gordthompson
Copy link
Collaborator

gordthompson commented May 19, 2020

setinputsizes is used for setting the SQL type, not the C type.

Yes, but from this:

https://github.com/microsoft/ODBC-Specification/blob/b7ef71fba508ed010cd979428efae3091b732d75/Windows/inc/sqlext.h#L548-L611

I was able to derive this:

SQL_type                           C_type                              Comment                              
---------------------------------  ----------------------------------  -------------------------------------
(SQL_BIGINT+SQL_SIGNED_OFFSET)     SQL_C_SBIGINT                       /* SIGNED BIGINT */                  
(SQL_BIGINT+SQL_UNSIGNED_OFFSET)   SQL_C_UBIGINT                       /* UNSIGNED BIGINT */                
(SQL_C_LONG+SQL_SIGNED_OFFSET)     SQL_C_SLONG                         /* SIGNED INTEGER  */                
(SQL_C_LONG+SQL_UNSIGNED_OFFSET)   SQL_C_ULONG                         /* UNSIGNED INTEGER*/                
(SQL_C_SHORT+SQL_SIGNED_OFFSET)    SQL_C_SSHORT                        /* SIGNED SMALLINT */                
(SQL_C_SHORT+SQL_UNSIGNED_OFFSET)  SQL_C_USHORT                        /* UNSIGNED SMALLINT*/               
(SQL_TINYINT+SQL_SIGNED_OFFSET)    SQL_C_STINYINT                      /* SIGNED TINYINT  */                
(SQL_TINYINT+SQL_UNSIGNED_OFFSET)  SQL_C_UTINYINT                      /* UNSIGNED TINYINT*/                
SQL_BINARY                         SQL_C_BINARY                                                             
SQL_BIT                            SQL_C_BIT                                                                
SQL_C_UBIGINT                      SQL_C_BOOKMARK                      /* BOOKMARK        */                
SQL_C_ULONG                        SQL_C_BOOKMARK                      /* BOOKMARK        */                
SQL_CHAR                           SQL_C_CHAR                          /* CHAR, VARCHAR, DECIMAL, NUMERIC */
SQL_DATE                           SQL_C_DATE                                                               
SQL_DOUBLE                         SQL_C_DOUBLE                        /* FLOAT, DOUBLE                */   
SQL_GUID                           SQL_C_GUID                                                               
SQL_INTEGER                        SQL_C_LONG                          /* INTEGER                      */   
SQL_INTERVAL_DAY                   SQL_C_INTERVAL_DAY                                                       
SQL_INTERVAL_DAY_TO_HOUR           SQL_C_INTERVAL_DAY_TO_HOUR                                               
SQL_INTERVAL_DAY_TO_MINUTE         SQL_C_INTERVAL_DAY_TO_MINUTE                                             
SQL_INTERVAL_DAY_TO_SECOND         SQL_C_INTERVAL_DAY_TO_SECOND                                             
SQL_INTERVAL_HOUR                  SQL_C_INTERVAL_HOUR                                                      
SQL_INTERVAL_HOUR_TO_MINUTE        SQL_C_INTERVAL_HOUR_TO_MINUTE                                            
SQL_INTERVAL_HOUR_TO_SECOND        SQL_C_INTERVAL_HOUR_TO_SECOND                                            
SQL_INTERVAL_MINUTE                SQL_C_INTERVAL_MINUTE                                                    
SQL_INTERVAL_MINUTE_TO_SECOND      SQL_C_INTERVAL_MINUTE_TO_SECOND                                          
SQL_INTERVAL_MONTH                 SQL_C_INTERVAL_MONTH                                                     
SQL_INTERVAL_SECOND                SQL_C_INTERVAL_SECOND                                                    
SQL_INTERVAL_YEAR                  SQL_C_INTERVAL_YEAR                                                      
SQL_INTERVAL_YEAR_TO_MONTH         SQL_C_INTERVAL_YEAR_TO_MONTH                                             
SQL_NUMERIC                        SQL_C_NUMERIC                                                            
SQL_REAL                           SQL_C_FLOAT                         /* REAL                         */   
SQL_SMALLINT                       SQL_C_SHORT                         /* SMALLINT                     */   
SQL_TIME                           SQL_C_TIME                                                               
SQL_TIMESTAMP                      SQL_C_TIMESTAMP                                                          
SQL_TINYINT                        SQL_C_TINYINT                                                            
SQL_TYPE_DATE                      SQL_C_TYPE_DATE                                                          
SQL_TYPE_TIME                      SQL_C_TYPE_TIME                                                          
SQL_TYPE_TIME_WITH_TIMEZONE        SQL_C_TYPE_TIME_WITH_TIMEZONE                                            
SQL_TYPE_TIMESTAMP                 SQL_C_TYPE_TIMESTAMP                                                     
SQL_TYPE_TIMESTAMP_WITH_TIMEZONE   SQL_C_TYPE_TIMESTAMP_WITH_TIMEZONE

Not sure about the ones with the brackets, but the most common players are present and there are no duplicates in either the "SQL_type" or "C_type" columns.

@keitherskine
Copy link
Collaborator

I would go even further and suggest that if the parameter classes in the provided Python parameters are not the same (for each field, ignoring None), then executemany should error out immediately.

params = [ ('A', 1), ('B', 2  ) ]  # OK, all field types are the same (or None)
params = [ ('A', 1), ('B', 2.0) ]  # not OK because 1 is an int and 2.0 is a float

It doesn't seem unreasonable to insist the parameter types for each field are the same.

@v-chojas
Copy link
Contributor

The C type needs to match the Python type, or it will result in incorrect data inserted; you've only found the default conversions but the driver can do others.

Python is a dynamic language, so it's absolutely normal to have heterogeneous types within a single column as long as they are convertible.

@gordthompson
Copy link
Collaborator

@nelsonwcf - I created an MCVE hoping to reproduce the issue but it didn't. The following code inserts 1_000_000 rows in 53 seconds, regardless of whether the None is in the first or second row.

import os
import time

import pyodbc

cnxn = pyodbc.connect(
    "DRIVER=ODBC Driver 17 for SQL Server;"
    "SERVER=192.168.0.199;"
    "DATABASE=mydb;"
    "UseFMTONLY=yes;"
    f"UID=sa;PWD={os.getenv('sa_PWD')};",
    autocommit=True,
)
crsr = cnxn.cursor()
crsr.execute("CREATE TABLE #tmp (id int PRIMARY KEY, txt varchar(50))")

rows_to_insert = 1_000_000
row_index_for_null = 0
row_data = [
    (x, (None if x == row_index_for_null else f"row{x}"),)
    for x in range(rows_to_insert)
]

sql = "INSERT INTO #tmp (id, txt) VALUES (?, ?)"
crsr.fast_executemany = True
t0 = time.time()
crsr.executemany(sql, row_data)
print(f"{rows_to_insert:,} rows inserted in {(time.time() - t0):0.1f} seconds")

@nelsonwcf
Copy link
Author

Yes, I tested and confirm that your code didn't have any of the issues even though the first line has a 'None' Maybe there's an additional hidden criteria that is required for the issue to show? I will prepare a small dataset that shows the problem and share it here (sample from the one I used in my job).

@nelsonwcf
Copy link
Author

I think the best way to avoid guessing what are the requirements for the issue to show is to use the original file that created the problem. I'm attaching a 999 records CSV, the one I used to generate the ODBC tracing log files. It would take 1 second to insert all records when no Nones were present in the first line, and around 20 seconds otherwise.

Maybe this should help?
sample_test.zip

@gordthompson
Copy link
Collaborator

@nelsonwcf – Thanks for posting the sample data. On reviewing the issue I noticed two things:

  1. Your original post talks about "when None/Nan are present", and the mention of NaN suggests that you may have been trying to use pandas to_sql to upload the data. Is that the case?

  2. In your follow-up post you provide the DDL for the table, which begins with CREATE TABLE [iu].[poc_tmp]. Is [iu] the default schema for the user under which the insert code is running? We've seen other cases of things getting a bit weird when working with an object that does not belong to the default schema. Queries with Table Value Parameters (TVP) when the Type table is not in the dbo schema #595 involves TVPs, but TVPs and .executemany use the same approach for packing data values into a parameter array (ref: Table-Valued Parameters (TVPs) with fast_executemany #601).

@nelsonwcf
Copy link
Author

nelsonwcf commented Jun 3, 2020

  1. That was when I found the issue. I noticed the problem using pandas, adding the fast_executemany on the SQLAlchemy engine. After I noticed the problem, however, I stopped using pandas with SQL Alchemy altogether.

  2. Yes, I did execute the insert statement using the [iu] schema. The default, dbo, is reserved on the SQL Server I use. But, if that was the case, wouldn't your code to insert 1M records also shows the issue since I modified it to use the iu schema? I needed to modify it to be able to test on the same server where I noticed the problem.

@gordthompson
Copy link
Collaborator

I am able to sort-of-reproduce your issue, although in my case the performance penalty for null(s) in the first row is more like a factor of 2, not 20:

sample_test_no_null_first_line.csv

  • 999 rows inserted in 0.9 seconds
  • 646 packets, 563019 bytes

sample_test.csv

  • 999 rows inserted in 1.7 seconds
  • 922 packets, 585162 bytes

The "packets" and "bytes" numbers come from Wireshark. Interesting that the second case is only generating 4% more network traffic, but it's taking 43% more packets to get the job done.

@nelsonwcf
Copy link
Author

Did you try to compare the odbc tracing for both as well?

@gordthompson
Copy link
Collaborator

Did you try to compare the odbc tracing for both as well?

No, because we already know from the ODBC trace logs you posted earlier that "without_Nones" calls the ODBC SQLExecute function once, while "with_None" calls that same function 91 times. With two packets per call (send and ack), the extra 90 calls represents at least 180 packets, which accounts for much of the increase in packet counts between my two runs.

@jordantshaw
Copy link

Are there any planned fixes for this bug in upcoming releases?

@v-chojas
Copy link
Contributor

v-chojas commented Nov 5, 2020

I do not think it is a bug: it is just slower but still working, and in any case, no slower than without fast_executemany.

@nevado
Copy link

nevado commented Nov 6, 2021

Are there any plans or ongoing work to fix this bug? The performance impact is huge, I'm seeing bulk inserts of 25 rows/second when first row contains NULLs and 1000 rows/second when it doesn't (database in cloud, high network latency).

In the meantime does anyone have any workarounds that can be used? I'm looking at either changing all NULLs on the first row to a placeholder value and then NULLing them back in the DB afterwards, or inserting a fake first row and then immediately deleting it afterwards, assume there's no other way around this?

@gordthompson
Copy link
Collaborator

@nevado - You could use the bcp utility to either upload the data directly to a table or upload to a temporary table and then do a MERGE into the "real" table.

@gordthompson
Copy link
Collaborator

There is also the JSON trick described here.

@mkleehammer
Copy link
Owner

I think some of the ideas here could work.

It does seem like the fastexecute version could be sped up by passing the SQL type of the
previous non-null parameter. In this very specific case, it doesn't matter that the types can
change as long as the type we pass for the NULL parameter is accepted by the server. For
example, if we have a single column with 3 rows with int, null, and float, it doesn't matter
the order they are passed in:

  • 1, null, 2.3: Use 'int' as the type for null. Submit batch when float is detected and resume
    with float.
  • 2.3, null, 1: Use 'float' as the type for null. Submit batch when int is detected and resume
    with int.

The reason is doesn't matter is we will already be sending a real int or a real float in the
previous row. If those are accepted, then the null will be too.

We should also make sure that the known types can be used on subsequent batches (within the
same executemany call) so that we'll have a valid type when the first row of those batches
contains NULL. The code may already do this.

For the non-fastexecute many, it does seem like setinputsizes should be good. It is cumbersome
and people need to remember to turn it back off by calling cursor.setinputsizes(None), but it
is possible to set the expected SQL type.

In the non-fastexecute version, the GetParamType function is called when a NULL parameter is
encountered, specifically for SQL Server. This maintains an array of SQL types for the current
statement and the type is retrieved from SQLDescribeParam. This should first check if the type
was provided by setinputsizes instead.

Ideally I'd like to integrate the two code paths more in future versions.

As was pointed out, since it is working, I'd like to postpone this to version 5.1. The 5.0
version will be a Python 3-only release with as few changes as possible, then we'll start the
improvements in 5.1

@gordthompson
Copy link
Collaborator

+1 if this means that the fast_executemany code could "learn as it goes" instead of inferring the type information from the first row and using that for the whole batch.

I recently revisited this issue via the SQLAlchemy/pandas discussion

sqlalchemy/sqlalchemy#9436 (comment)

and I re-ran some tests using the sample data that @nelsonwcf provided here

#741 (comment)

The ODBC trace showed that once the INSERT INTO statement had been sent to the server the fast_executemany code called SQLExecute 216 times. After changing all null values in just the first row of the data to something non-null, the fast_executemany code only called SQLExecute once.

@mkleehammer
Copy link
Owner

As I noted in #996, the right answer is probably to scan rows before binding. That is, instead of binding based on the first row, record the types of the first row. Then refine those types and lengths as more rows are scanned. When a row is found that is incompatible with the types, bind and execute the scanned rows.

If all scanned rows had None, we'd still need to call SQLDescribeParam, but if any had a value we would deduce the SQL type from that value.

It would also ensure we use the longest value for any length parameters.

Thoughts?

I'd like to drop Python 2 support first, so let's wait for 5.x which I'd like to release soon.

@v-chojas
Copy link
Contributor

Do you mean scan all the rows!? That will certainly not be fast, especially if there are a lot of rows, and seems to defeat the purpose of fast_executemany.

@gordthompson
Copy link
Collaborator

What I mean by having the fast_executemany code "learn as it goes" is illustrated by the following example:

import pyodbc

cnxn = pyodbc.connect("DSN=mssql_199;UID=scott;PWD=tiger^5HHH", autocommit=True)
crsr = cnxn.cursor()

crsr.execute("DROP TABLE IF EXISTS foo")
crsr.execute("CREATE TABLE foo (col1 varchar(max), col2 varchar(max))")

data = [
    ("a", None),
    (None, "b"),
]
sql = "INSERT INTO foo (col1, col2) VALUES (?, ?)"
crsr.fast_executemany = True
crsr.executemany(sql, data)

SQL.LOG shows that when processing the first row, pyodbc determines the column types to be SQL_C_WCHAR and SQL_C_BINARY

issue_741       1cd4-1d8c   EXIT  SQLBindParameter  with return code 0 (SQL_SUCCESS)
        HSTMT               0x0000009556AB0080
        UWORD                        1 
        SWORD                        1 <SQL_PARAM_INPUT>
        SWORD                       -8 <SQL_C_WCHAR>
        SWORD                       12 <SQL_VARCHAR>
        SQLULEN                    0
        SWORD                        0 
        PTR                0x0000000000000010
        SQLLEN                    16
        SQLLEN *            0x0000000000000020 (BADMEM)

…

issue_741       1cd4-1d8c   EXIT  SQLBindParameter  with return code 0 (SQL_SUCCESS)
        HSTMT               0x0000009556AB0080
        UWORD                        2 
        SWORD                        1 <SQL_PARAM_INPUT>
        SWORD                       -2 <SQL_C_BINARY>
        SWORD                       12 <SQL_VARCHAR>
        SQLULEN                    0
        SWORD                        0 
        PTR                0x0000000000000028
        SQLLEN                    16
        SQLLEN *            0x0000000000000038 (BADMEM)

When it goes to process the second row it sees the wrong type for col2 and re-scans the row, determining the column types to be SQL_C_BINARY and SQL_C_WCHAR:

issue_741       1cd4-1d8c   EXIT  SQLBindParameter  with return code 0 (SQL_SUCCESS)
        HSTMT               0x0000009556AB0080
        UWORD                        1 
        SWORD                        1 <SQL_PARAM_INPUT>
        SWORD                       -2 <SQL_C_BINARY>
        SWORD                       12 <SQL_VARCHAR>
        SQLULEN                    0
        SWORD                        0 
        PTR                0x0000000000000010
        SQLLEN                    16
        SQLLEN *            0x0000000000000020 (BADMEM)

…

issue_741       1cd4-1d8c   EXIT  SQLBindParameter  with return code 0 (SQL_SUCCESS)
        HSTMT               0x0000009556AB0080
        UWORD                        2 
        SWORD                        1 <SQL_PARAM_INPUT>
        SWORD                       -8 <SQL_C_WCHAR>
        SWORD                       12 <SQL_VARCHAR>
        SQLULEN                    0
        SWORD                        0 
        PTR                0x0000000000000028
        SQLLEN                    16
        SQLLEN *            0x0000000000000038 (BADMEM)

That is, it "forgets" the correct binding for col1, so the next time col1 is non-null it will detect the wrong type (SQL_C_BINARY) and trigger another re-scan. If we're lucky and hit a re-scan where all the columns are non-null — and hence all of the C types are correct — then we should be fine for the rest of the rows (barring any weird heterogeneous data in a given column).

A possible solution would be to store the column types with an extra attribute indicating whether that type was a "guess" based on a null value. For example, after the scan of the first row instead of something like (the C++ equivalent of)

{"col1": SQL_C_WCHAR, "col2": SQL_C_BINARY}

it would be

{"col1": {"c_type": SQL_C_WCHAR, "guess_for_null": False}, "col2": {"c_type": SQL_C_BINARY, "guess_for_null": True}}

Then, when it hits the null value for col1 in the second row it will remember the previous binding for col1 and not overwrite it with a wrong guess. After re-scanning for the second row the type map would be

{"col1": {"c_type": SQL_C_WCHAR, "guess_for_null": False}, "col2": {"c_type": SQL_C_WCHAR, "guess_for_null": False}}

@gordthompson
Copy link
Collaborator

As was pointed out, since it is working, I'd like to postpone this to version 5.1

Maybe create a 5.1 milestone and add this to it @mkleehammer ?

@abillingsley
Copy link

Are there any updates on progress relative to this issue?

I have a use case where I am inserting 18k rows, in this use case with fast_executemany = True, the inserts will occur at a rate of > 1000 per sec as long as the first row contains no Nones but < 50 per sec if the first row contains a None

In my use case, it may be possible for all rows of a particular column to contain None. The progressive enhancement concept if I understand the proposal would assume at least one row for a given column is non-none so that guess_for_null can be True eventually. I appreciate the automatic nature of the progressive enhancement approach but for my use case if being more explicit in the column definition improves performance significantly, it would be worth it

To state differently if the progressive enhancement may still see use cases where performance is < 50 per sec that is far to slow for our needs and we'd likely opt for a fake first row as @nelsonwcf describes in the original posting. I would prefer to make use of an api like setinputsizes if that meant the performance was consistently > 1000 rows per sec or better

@gordthompson
Copy link
Collaborator

@abillingsley - Have you tried the JSON trick mentioned in the comment above?

@abillingsley
Copy link

I did a quick POC using the JSON trick. Performance is better than fast_executemany = True with None.
In my case

  • fast_executemany = True with None takes around ~420 sec for 18382 rows
  • fast_executemany = True with non None first row takes around 12 sec
  • JSON Trick = 31sec

31 sec is obviously better than the out-of-the-box execute many option but still not as good if execute many were performed without None

@gordthompson
Copy link
Collaborator

gordthompson commented Sep 23, 2024

@abillingsley - The performance hit depends on how many "re-scans" take place. One particularly unfortunate circumstance is a table structure with two interdependent columns where only one of them can be non-null. A re-scan is triggered each time the null value swaps from one column to the other, and no row can have non-null values in both columns so the issue never gets resolved as the rows are processed.

If you're curious about what is triggering re-scans with your data you could experiment with the little routine I posted here. If you find that a couple of columns are triggering the majority of re-scans then you could mitigate the effect by sorting your source data on those columns to keep the blocks of null values together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants