Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore binary offsets of PDOStatement parameters #5897

Open
wants to merge 4 commits into
base: 3.9.x
Choose a base branch
from

Conversation

mcurland
Copy link

@mcurland mcurland commented Feb 2, 2023

Entity identifiers that are PHP resource streams need to work for all references, not just the first one. PDOStatement->execute is reading the binary resources but not restoring the original offsets. No other code is restoring these streams to their original position so that they can be reused.

Examples of failures include empty collections on read (the first lazy load collection on an entity populates correctly, the second is empty) and foreign-key violations while persisting changes (the first entity join produces the correct SQL, the second has no data to read and the FK is violated with the missing binary data).

Making this change as close as possible to the external code that moves the stream pointer eliminates the need to do this in calling code.

This might also be an issue with other driver's Statement implementations. This request just handles the PDO driver. I have also not attempted to fix the deprecated bindParam code path. I do not believe this is called by the current Doctrine code, and is regardless much harder to patch because of the binding to variables that may or may not be populated when they are bound.

#5895

@derrabus
Copy link
Member

derrabus commented Feb 6, 2023

Please always add a functional test that reproduces the problem that you want to address.

@derrabus
Copy link
Member

derrabus commented Feb 26, 2023

Hello again. In order to consider your patch, I need a green CI (including PHPCS and static analysis) and a functional test that covers your change.

@derrabus
Copy link
Member

derrabus commented Mar 8, 2023

Not sure if you're reading any of my comments, but there's a tool called phpcbf in the vendor/bin directory which can fix most of the CS errors automatically.

@mcurland
Copy link
Author

mcurland commented Mar 8, 2023

I just saw your comment on what you needed and started with coding standards. Did I miss some inline comments? Up until now I've been testing integrated into another project (changing the file in the vendor directory), not directly in this one. I figured I'd need a couple of pushes, but the errors have come in piecemeal. I think it is clean now from a code standards perspective.

I am not even close to being a PHP programmer (>90% of the PHP code I generally use is generated, and over half of the rest is JSON-in-php, so I'm only writing a few non-declarative lines for each project and am only in PHP at all to use Doctrine, which was chosen because a previous client wanted me to plug into a PHP project). However, I'm several decades removed from being a newbie programmer (first production code was with Microsoft in 1989) and have also been a primary code reviewer on multiple projects for decades. I have to say some of the things in this set of standards seem arbitrary. Spaces after a ! operator? Enforcing early exit (I find code with multiple return statements much harder to follow than a single-indent if block, and asking for early exit to avoid an if block with 3 lines in it is absurd)? No non-boolean falsy/truthy statements (a core part of the language instead of trying to emulate C# language rules without compiler support)? Aligned assignment operators (this just makes it hard to search for assignment in your codebase, and routinely increases the size of git diffs)?

Seriously, I get some of the spacing checks and type checks (just took some work for me to figure out where to put all of the declarations with some in code and some in annotations), but most of the rest of this is frankly subjective. It is obviously easier with a tool to do it for you, but that doesn't make it less arbitrary. Anyway, it's not my codebase, so do as you will.

So as not to blast your CI server more I can try to run out of the codebase. I'm not really seeing how I'm supposed to set all of this up (I'm assuming you're running against real databases at some point instead of mocking the backend), or any guidelines on where you want functional tests. Do I need a test for each PDO driver type or can I cover PDO once for any driver? Do you want to replicate this for the other drivers (which I'm pretty sure will fail and need the same treatment)? Any tips or document pointers would be useful.

@derrabus
Copy link
Member

derrabus commented Mar 8, 2023

I just saw your comment on what you needed and started with coding standards. Did I miss some inline comments?

No, it's just that I never got a response from you and I've been asking for a test for a very long time now.

I have to say some of the things in this set of standards seem arbitrary.

Very well, but that shouldn't be your concern. As I said, there's a tool that fixes the code to match the configured standard. Just run it after you've changed the code.

I'm not really seeing how I'm supposed to set all of this up

You don't have to. We run the same test suite against different databases. Just run it against your favorite one. By default, the test suite is run against SQLite which does not require any additional setup.

any guidelines on where you want functional tests.

Have a look at the existing test suite. They should give you good examples of how functional tests look like.

Do I need a test for each PDO driver type

Usually you don't. Most of our functional tests are run with all drivers.

@mcurland
Copy link
Author

OK, I added the functional test, which covers bindValue on a resource and is passing with this fix (and fails without it).

This fix is currently not implemented for other drivers or for bindParam.

The bindParam test (testBindParamProcessesStream) does not actually pass a resource (the form of fopen returns a string, not a resource). However, even if I passed the resettable resource into the bindParam function the bound variable was a string after the call, so could not be reset. I do not know if this is intentional or not or if it is consistent across other drivers, but it definitely complicates resetting the stream. bindParam was no actually used by the original doctrine scenario and is deprecated as far as I can tell.

This could also be implemented for bindParam, but would need to change somewhat due to the use of reference variables. The resources array would need to be iterated immediately before the call to execute, with the variable type (is a resource) verified and stream positions recorded at this time. We could also consider doing this for the bindValue case to eliminate the possible modification of the resource position between the bindValue and execute calls. This would also eliminate one of the global variables.

I think this is all that you asked me to do. Let's see what CI says.

@github-actions
Copy link

There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days.
If you want to continue working on it, please leave a comment.

@github-actions github-actions bot added the Stale label Jul 12, 2023
@derrabus
Copy link
Member

@mcurland Do you intend to continue your work? Do you need any more input from me? The CI is still red.

@mcurland
Copy link
Author

I'm seeing one static CLI issue now at:
ERROR: TypeDoesNotContainType - src/Driver/PDO/Statement.php:195:25 - Type null for $resourceOffsets is always null (see https://psalm.dev/056)
The $resourceOffsets is set inside a finally that is inside a for loop. It will be null on the first pass, but may not be on subsequent passes. This psalm error is clearly wrong. I'm not sure if there is a way to skip this check or if a psalm fix available. (I didn't see this until reviewing right now.)

As far as I know CI is passing for the original issue flagged here, which is with the PDO driver along with the 3 other similar drivers (PDO, IBMDB2, MySQL and PgSQL). However, the process seems to be aborting testing later drivers if an earlier one fails so I'm not getting CI feedback on the 5 drivers I believe to be fixed. I know I haven't done OCI8, SQLite3 (possibly affected by #5994) and SQLSrv drivers.

Honestly I need some core team feedback on the approach I'm taking here before I spend another minute on it. The CI issues is that this is a problem with all of the drivers, not just PDO. I'm swamped on multiple other projects that I'm 100% responsible for (especially after new tasks resulting from a week long June meeting at ESTEC--I'm providing the primary modeling tool and related generative technologies for the https://mb4se.esa.int/OSMOSE_Main.html project and have major tech demonstrations in September and November and 2 little kids full time through the end of July while school is out). I've spent days on this already, including initial trackdown of the issue. I just don't have more days to burn trying to get the testing environments running locally for each driver.

As a primary on this project I'm sure you or collaborators have local environments set up to test each of the drivers without waiting overnight for the CI process. This would let you add the 10-20 lines of code (mostly copied) in a few minutes per driver. This is a severe defect from the perspective of the full doctrine super project, so I don't think asking for a little collaboration on specific drivers is breaking PR protocol. This issue isn't blocking me (as much as I hate to replace individual vendor files in principal, fixing a symfony version and updating one file in my standard base container is a trivial build step).

I can merge up to the working head as the branch is obviously well behind now (maybe the psalm hallucination will go away), but beyond that it might be a while before I have bandwidth to touch the last 3 drivers.

@github-actions github-actions bot removed the Stale label Jul 13, 2023
@derrabus
Copy link
Member

Thank you for your response.

This psalm error is clearly wrong. I'm not sure if there is a way to skip this check or if a psalm fix available. (I didn't see this until reviewing right now.)

That's okay. Psalm is mainly a review tool for us. We are allowed to be smarter than the tool. ✌🏻

If we are certain that it reports a false positive, I can help you with making Psalm understand your code. And if this turns out to be a bug in Psalm, we can ignore the error and open a bug report in the Psalm repository. However, let's sort this out last. Fixing the failing tests should be the priority.

Honestly I need some core team feedback on the approach I'm taking here before I spend another minute on it. The CI issues is that this is a problem with all of the drivers, not just PDO.

Okay, no worries. My initial problem with your PR was that you did not include a test which (given the little time that I can spend on open source) made it hard for me to understand what problem exactly you wanted to solve. You've delivered that test now and I thank you for that!

I'm swamped on multiple other projects that I'm 100% responsible for

Heh, we all are. The Doctrine core team is maintaining this library in their freetime. I (like the other team members) have to prioritize payed work and family over unpaid work. That mainly means that I can help getting a change merged, but I'm not investigating or fixing bugs right away unless they block me personally.

As a primary on this project I'm sure you or collaborators have local environments set up to test each of the drivers without waiting overnight for the CI process. This would let you add the 10-20 lines of code (mostly copied) in a few minutes per driver.

I have a local PHP interpreter and Docker. That's all it takes to spin up an environment for running the tests and this is also what the CI uses to run them. But I realize that this could be documented better.

This is a severe defect

As far as I understand the current code, the implementation assumes that the stream that is passed to the DBAL has been opened exclusively for the DBAL to consume. And under that assumption, the current implementation works well. This and the fact that this behavior has existed for years without complaints lowers the severity of this issue for me, to be honest.

src/Driver/IBMDB2/Statement.php Outdated Show resolved Hide resolved
src/Driver/IBMDB2/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PDO/Statement.php Outdated Show resolved Hide resolved
src/Driver/PgSQL/Statement.php Outdated Show resolved Hide resolved
tests/Functional/BlobTest.php Outdated Show resolved Hide resolved
@derrabus
Copy link
Member

Please perform a rebase onto the 3.6.x branch to resolve the conflicts. We've added a new scenario to the BlobTest functional test and I'd like to see that it still passes with your changes.

@mcurland
Copy link
Author

Thank you for the feedback. I've rebuilt, replaced and rebased the branch on my fork, so you can now looking at a single commit without the noise (unless github didn't like the cross-fork history rewrite). I'll batch other suggestions after comments.

A pointer or two on how to retrieve a Docker image for running all tests locally without trying to set up each db system would be very helpful.

Acknowledged that we all have day jobs. Your after hours work is much appreciated.

We can agree to disagree on severity. I came of age in this business in Microsoft, which used a severity/priority rating system and I still think in those lines. "Nobody else has hit this yet" can be an argument for lowering priority, but not severity. Crash and data-loss defects have the highest severity. This isn't data loss, but is very close because you can't retrieve the data from the second (or third,...) delay-loaded collection, which is where my severity analysis comes from. As you say, issues that affect you personally get higher priority (as they should when time is volunteered).

@derrabus
Copy link
Member

I've rebuilt, replaced and rebased the branch on my fork, so you can now looking at a single commit without the noise

Perfect.

A pointer or two on how to retrieve a Docker image for running all tests locally without trying to set up each db system would be very helpful.

I don't know how fluent you are in GitHub Actions config files, but there's a .github/workflows/continuous-integration.yml file in the repository that configures our CI. Here you can see which Docker images we use when running the tests. For instance, this is the Postgres server:

services:
postgres:
image: "postgres:${{ matrix.postgres-version }}"
env:
POSTGRES_PASSWORD: "postgres"
options: >-
--health-cmd "pg_isready"
ports:
- "5432:5432"

I think, with basic knowledge on Docker, you should be able to turn this into a docker run command line.

In the ci/github/phpunit directory, you find PHPUnit configuration files that configure the database connection that our functional tests are run against.

This isn't data loss

That's my point. We're not talking about losing data in production under some rare circumstance. We're talking about you using DBAL in a way that in 100% of the cases simply won't work, which you probably already noticed while testing your app locally.

We could even argue that this isn't a bug because the state of the stream pointer is not part of the contract (yet). You assumed a post-condition that we never defined. This would turn your "severe defect" into a feature request.

The way I've seen streams being used in apps using DBAL so far was, "I have some binary data stored in a file and I want it to be streamed to the database. I fopen() that file and fclose() it once DBAL is done." That basically means:

  • Only DBAL reads from that stream.
  • I don't care about the state of the stream pointer after DBAL has read from the stream.

If we go with your change, we would rewind every stream once we're done with it. In the scenario I've outlined, that feels very unnecessary. Question that we need to ask:

  • Does such an unnecessary fseek() hurt performance?
  • Could this new fseek() call be a breaking change for apps that (for whatever reason) assume that the pointer is at EOF after executing the query, although we've never guaranteed that either?

Or we could argue that the current behavior is a correct one.

If we know, that DBAL is going to mess with our stream pointer, can we compensate for it? In our app, we could ftell() before we execute our statement and fseek() afterwards if we really need the previous state restored. Would that be feasible? In that case, this whole thing would boil down to a documentation issue.

@mcurland
Copy link
Author

mcurland commented Jul 23, 2023

I'm proposing this as fix for a doctrine issue as a whole library. I'm 100% with you that this is a dbal feature request looking only at dbal, but experiencing the library as a whole it is a defect in the doctrine stack. My calling code for postgres doesn't even deal with resources. I changed an identifier type (from serial int to uuid) and regenerated a bunch of code, none of which even touches resources. The string-to-resource-to-string translation is internal to the Doctrine stack (and resets the resource location after reading it.) So, all I knew going into this is that with a UUID identifier--which is nominally a supported feature--only the first delay-loaded collection populates. No data is returned for any other collection. This is a bad defect. Even if the data made it to the db, if the only way of ever seeing that data again is to use a different data access layer then it might as well be missing..

Fundamentally, this results from a loose contract between the different doctrine layers. The code calling dbal (going back multiple layers) is treating all parameters as immutable and is assuming that that the original values can be used for multiple calls. Obviously, that is not true here. For the input values to be reusable someone needs to reset the immutable state, effectively making these objects immutable.

So, the question is where is the best place to fix this in the Doctrine stack to minimize code changes? This would be the place that moves the stream pointer and doesn't reset it. If you look at the reverse call tree, there are a ton of calls that eventually get to $driver->execute, which is the code that actually reads the stream without resetting it (outside the PHP runtime layer, which is not going to change). I just didn't see a better point to do this without large amounts of new code.

There are two areas to consider on the legitimate question of using fseek here: side effects, and cost.
-For side-effects, these streams are at the end, so are effectively empty after execute. Any code that was using them afterwards would already be doing the same seek operation we're doing here, so an existing fseek which would now (likely) be a noop. Obviously, if the stream is not seekable this code has no effect, so a large forward-only stream (which would be data, not a key) should not be affected.
-As for costs, I would surmise that any penalty from the seek call (certainly at the disk access or memory level) occurs when you read the data, which derefs the seek pointer. Therefore, moving the pointer itself--even multiple times--without a corresponding read of the stream should not have measurable performance implications. [I'm not digging through the PHP runtime code to verify this specifically for PHP.]

So, overall, I think fixing the Doctrine issue with this dbal feature change is the least amount of code and the least risky change when viewing the framework as a whole stack (not just dbal).

I'll see what I can do with the docker images. I'm reasonably conversant in docker. I haven't used the github actions feature, but thank you for the starting point.

@derrabus derrabus changed the base branch from 3.6.x to 3.7.x September 26, 2023 21:52
@mcurland
Copy link
Author

mcurland commented Dec 8, 2023

Code review implemented on latest commit. New drivers have not been added yet. I need an updated report from the CI.

@mcurland
Copy link
Author

All tests are passing now, so you can proceed with review.

I ended up with very similar code in each driver. Moving the trackParamResource/getResourceOffsets/restoreResourceOffsets functions into shared code is obviously an option, but I'm not sure where you would want this helper class. The code that calls each of these methods would still need to be called per-driver.

@derrabus derrabus changed the base branch from 3.7.x to 3.8.x January 25, 2024 22:55
Copy link

There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days.
If you want to continue working on it, please leave a comment.

@github-actions github-actions bot closed this Aug 3, 2024
@mcurland
Copy link
Author

mcurland commented Aug 3, 2024

I take one week of vacation and ... poof.

As far as I know, this had been a 'green' status ready to commit awaiting review since December 18, 2023. I don't know what else I can do.

Obviously, at this point the active branch has moved on, so this would need to be reset to the other branch. However, I don't know why I would take the time to do this when I reached this point 8 months ago and got no action or feedback in response.

I don't see a mechanism for me to reopen this, but it is still needed.

@derrabus
Copy link
Member

derrabus commented Aug 3, 2024

Sorry for the bot closing the PR. It was me who did not find the time to answer you properly. I'll try to have another look at your changes and your answer to my last message next week. 🤞

@derrabus derrabus reopened this Aug 3, 2024
@derrabus derrabus removed the Stale label Aug 3, 2024
@derrabus derrabus changed the base branch from 3.8.x to 3.9.x August 14, 2024 11:02
@mcurland
Copy link
Author

I rebased the branch onto 3.9.x to match your pull rebase. It looks like you want to go there instead of 4.x.x directly.

It would be really nice to get this (and doctrine/orm#11109) pushed so I can eliminate side processes.

Please let me know how you want to proceed or if you want this rebased to a later branch.

Copy link

There hasn't been any activity on this pull request in the past 90 days, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 7 days.
If you want to continue working on it, please leave a comment.

@github-actions github-actions bot added the Stale label Nov 18, 2024
@mcurland
Copy link
Author

Please, someone review this. The bot is telling me it has been another 90 days. CI had a 'green' run on Dec 18, 2023. It has been almost a year. I still get to explain to my downstream people why UUID identifiers don't work out-of-the-box for some database systems. I will upgrade to whatever branch you want now, just let me know.

@derrabus
Copy link
Member

I realize that this must be a most frustrating experience for you. Yes, a lot of time has passed during that time I've had phases where I couldn't invest much time in open source work. And this issue is one where I always need some time to get my head around what where actually trying to fix here. I'm taking this time now.

The problem with this PR is not that it needs another review. It needs a decision if we want to include this change or not. This is what I summed up in #5897 (comment).

The use-case that justifies the change for me would be that I'd like to bind a stream to a statement that I want to execute twice. And this seems to be what you're trying to do. I'm fine with accepting that use-case but then please as a last change, let's adjust the functional test to reflect exactly that use-case instead of running assertions against ftell().

But let me take over the PR from here.

I still get to explain to my downstream people why UUID identifiers don't work out-of-the-box for some database systems.

This is what still puzzles me. When I bind streams to a statement, I do that because I'm dealing with large data that I don't want to load i to memory first. But a UUID? Even in it's binary representation, that's how much? 16 byte? Why's that a resource and not a string?

So, you're basically representing UUIDs as streams (in-memory probably) and that doesn't work because you can only ever bind that UUID once and need to rewind it afterwards.

I don't want to make this an endless discussion, but can't rid myself of the feeling that we're fixing the wrong issue here.

Entity identifiers that are PHP resource streams need to work for all
references, not just the first one. PDOStatement->execute is reading
the binary resources but not restoring the original offsets. No other
code is restoring these streams to their original position so that
they can be reused.

Examples of failures include empty collections on read (the first lazy
load collection on an entity populates correctly, the second is empty)
and foreign-key violations while persisting changes (the first entity
join produces the correct SQL, the second has no data to read and the
FK is violated with the missing binary data).

Making this change as close as possible to the external code that
moves the stream pointer eliminates the need to do this in calling
code. Resource offsets are retrieved immediately before execute in
case they change between the bindValue and execute calls.

The request was originally for the PDO driver but IBMDB2, Mysql, and
PgSQL drivers are also covered. Other drivers will likely also need
work. No attempt has been made to fix the deprecated bindParam code
path. I do not believe this is called by the current Doctrine code,
and is regardless much harder to patch because the reference variables
can be replaced during execute, so the original resources may no
longer be available to restore after the call.

A functional test was added for bindValue and a resource with a
seekable position.

doctrine#5895
Adding drivers to completed set.
@derrabus derrabus force-pushed the RestoreResourceOffset_5895 branch 2 times, most recently from 150fdc7 to 904f4b5 Compare November 24, 2024 11:45
@derrabus
Copy link
Member

@mcurland I have updated your test to reflect multiple statement executions with the same resource. It seems like your implementation is incomplete here. While the stream is correctly reset after the first execution, this does not happen after the second execution, so the third execution reads an empty stream again. I've tested on MySQL. Can you look into this?

@mcurland
Copy link
Author

Thanks for looking at this. I am well acquainted with the delicate balance between food on the table and open source work. I should be able to find some space over Thanksgiving weekend, so Friday/Saturday. Very odd that it only resets once.

The test was written against a generic resource instead of UUID because UUID is handled differently across drivers. A standard UUID representation is 36 characters, so significantly more data as a string that the 16 bytes as a resource. There are also other issues with strings. For example, SQL Server offers both NEWID and NEWSEQUENTIALID, with the latter still unique but less random to offer much better index compaction (than would not be achieved with a string representation). Other identifying resource types (using a small number of bytes as an id instead of a string) will also cause this problem. I had a full rundown of which systems it was affecting a long, long time ago (I believe Postgres uses a byte representation and Oracle uses a string (but can optionally convert to a RAW(16))).

Basically, I didn't choose the underlying data type mapping as a string or resource. I just followed the documentation to specify the UUID data type in the relational mapping, some systems mapped it to a resource, and only the first delay-loaded query returned data. I wasn't trying to repeatedly send mutable resources as query parameters to a system that tacitly assumed all incoming parameters were immutable data.

@derrabus
Copy link
Member

derrabus commented Nov 25, 2024

A standard UUID representation is 36 characters, so significantly more data as a string that the 16 bytes as a resource.

No. A string can hold any binary data, not just characters. You can store a binary UUID in a string and it's 16 bytes. Storing UUIDs in a stream only complicates the access.

I just followed the documentation to specify the UUID data type in the relational mapping

Can you point me to "the documentation" that you're referring to?

@mcurland
Copy link
Author

Can you point me to "the documentation" that you're referring to?
https://www.doctrine-project.org/projects/doctrine-dbal/en/4.2/reference/types.html is a good starting point. This mentions the guid data type and indicates the mappings depend on native support. You'll also see the 'binary' type, which says that a binary string type is implemented as a resource.

The bottom line is that some supported key data types are mapped to PHP resources. As these types can be used as keys (a random guid is meaningless if it isn't an identifier) they will naturally be sent by the ORM as parameterized values to DBAL execute statements when retrieving delay-loaded collections.

If you look at this as an overall Doctrine problem and not a DBAL problem everything happening here is basic, boring boilerplate. All you need for a resource identifier is a 'guid' data type (run on postgres, which has native UUID support with a standard extension enabled). You should be able to do the same with 'binary' data type (add a length value on the Column) on any supported RDBMS.

namespace App\Entities;
use Doctrine\ORM\Mapping as ORM;

#[ORM\Entity]
#[ORM\Table(name: 'entity1')]
#[ORM\UniqueConstraint(name: 'entity1_PK', columns: ['uuid'])]
class Entity1 {
	#[ORM\Column(name: 'uuid', type: 'guid', nullable: false)]
	#[ORM\Id]
	protected $uuid;
	public function getUuid() {
		return $this->uuid;
	}
	public function setUuid($value) {
		$this->uuid = $value;
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants