Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RawState distorted by schema-aware transformations #1667

Open
t0yv0 opened this issue Feb 1, 2024 · 9 comments
Open

RawState distorted by schema-aware transformations #1667

t0yv0 opened this issue Feb 1, 2024 · 9 comments
Labels
kind/bug Some behavior is incorrect or out of spec

Comments

@t0yv0
Copy link
Member

t0yv0 commented Feb 1, 2024

What happened?

There are incorrect migration or replace plans (example: pulumi/pulumi-gcp#1488) where Pulumi
bridged providers differ from canonical TF behavior in what gets passed as RawState to the
provider's PlanResourceChange method.

Specifically, bridged providers store a Pulumi-oriented representation in the statefile. When
recovering stored state to the TF representation bridged providers call the MakeTerraformState
method that is informed by the provider's schema to not lose information. Schema is a necessary part
of the process as it informs decisions such as MaxItems=1 flattening where Pulumi and TF
representations do not naturally agree. The problem however is that when the state was written with
V1 version of the provider and is being processed by the V2 version of the provider, it is the V2
version of the schema that is informing the process. When V2/V1 schemas do not agree, this may lead
to incorrect results.

Migrations

Normally upstream providers should version the should be considering state migration framework when
taking on changes in the schema:

https://github.com/hashicorp/terraform-plugin-sdk/blob/main/website/docs/plugin/sdkv2/resources/state-migration.mdx

This framework versions the schema by numbers 1, 2, 3, and runs custom code to migrate forward.
Pulumi supports this by storing the resource version under __meta. However, this area is not
sufficiently tested and Pulumi might be deserializing the state under the wrong schema, using the
current provider schema instead of the schema the resource was written with. This is worth
double-checking.

DiffCustomizers

Upstream providers do not always use migrations. In pulumi/pulumi-gcp#1488 upstream is not using the
state migration facility explicitly, but instead just uses a customize diff function that assumes
that previously-shaped data is passed in:

https://github.com/hashicorp/terraform-provider-google-beta/blob/main/google-beta/services/secretmanager/resource_secret_manager_secret.go#L36

Excerpt:

func secretManagerSecretAutoCustomizeDiff(_ context.Context, diff *schema.ResourceDiff, meta interface{}) error {
	oAutomatic, nAutomatic := diff.GetChange("replication.0.automatic")
	_, nAuto := diff.GetChange("replication.0.auto")
	autoLen := len(nAuto.([]interface{}))

	// Do not ForceNew if we are removing "automatic" while adding "auto"
	if oAutomatic == true && nAutomatic == false && autoLen > 0 {
		return nil
	}

	if diff.HasChange("replication.0.automatic") {
		if err := diff.ForceNew("replication.0.automatic"); err != nil {
			return err
		}
	}

	if diff.HasChange("replication.0.auto") {
		if err := diff.ForceNew("replication.0.auto"); err != nil {
			return err
		}
	}

	return nil
}

This presents a problem because the original schema is not available, even through the migration
framework. Current Pulumi behavior is to drop "automatic" data during translation since it is not
accounted for by the V2 schema, which prevents the above diff customize function from doing its job,
causing an unexpected replacement plans.

Solutions

Opaque Raw State

Pulumi TF providers could write TF state as-is in an opaque blob under __meta to using the TF
canonical JSON representation, deserialize it accordingly and populate the RawState that way. This
method fully removes the behavior discrepancy between Pulumi and TF.

It is easy to implement and roll out but unfortunately adding an opaque blob costs extra space in
the test file (2x?), and may need masking to account for secret material in the state.

New State Representation

Perhaps a new representation could be designed that replaces what is currently stored in the
statefiles so that space is not wasted, that could serve both the purpose of recovering raw TF state
and the purpose of tracking Pulumi metadata and secrets correctly. In this case the previous schema
of storing state can be retired.

When rolling this out care needs to be taken to around backwards compatibility.

Incremental Representation

Perhaps something can be designed that encodes just enough extra information in the __meta field so
that the bridged provider can recover the TF RawState from the Pulumi representation without
resorting to the schema. This optimizes for minimizing space and migration problems (purely
additive) at the cost of some code complexity that will need to be careful and co-evolve with any
more impedance-mismatch features in the bridge.

Deserialize using the right schema version

Pulumi could consult resources's schema version from the statefile, pull the right schema from the TF migration machinery and deserialize against that. This is an incomplete solution though because:

  1. some resources do not use state migration machinery
  2. the information on Pulumi flags such as MaxItems=1 info overrides from the provider version that wrote the resource is still not available

Example

See above.

Output of pulumi about

N/A

Additional context

N/A

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@t0yv0 t0yv0 added needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Feb 1, 2024
@VenelinMartinov
Copy link
Contributor

VenelinMartinov commented Feb 1, 2024

may need masking to account for secret material in the state.

Would the same state which is secret in pulumi also be marked secret on tf side? I seem to remember some discussion around TF having weaker support for secrets - is this correct? Would that all be resolvable?

costs extra space in the test file

Do you mean the state file? How much is storage space a concern here since this approach seems really simple to implement.

Perhaps something can be designed that encodes just enough extra information in the __meta field

I'm not sure I understand that - isn't the issue that the schema for some property changed, so we might no longer have access to the old schema - would we need to store this special information for all properties?

@t0yv0
Copy link
Member Author

t0yv0 commented Feb 1, 2024

TF representation does not have first class secrets, but Pulumi does. This is one of the impedance mismatch issues.

I'm not sure I understand that - isn't the issue that the schema for some property changed, so we might no longer have access to the old schema - would we need to store this special information for all properties?

This is another way of saying, we can develop a sophisticated algorithm that stores TF representation cty.Value in Pulumi statefile and recovers it as is, which satisfies the expectations of TF around GetRawState(), in an incremental way, this would be something like storing paths at which flattening occurred or some such delta information. To optimize for space. Serializing the schema could be one way but that's likely too verbose.

@iwahbe
Copy link
Member

iwahbe commented Feb 2, 2024

Thanks for digging into this! @t0yv0 Do you have any sense of how many bugs correspond to this problem?

Incremental Representation seems like the most viable option, simply by eliminating other options. I don't think we can 2x our state. A New State Representation would require deep collaboration between all of Pulumi's engineering teams, and I think we would need a more compelling argument then improving TF state recovery.

@t0yv0
Copy link
Member Author

t0yv0 commented Feb 2, 2024

We have no easy way of telling. We can start cross-correlating bugs here. I suspect this condition is rare and specific to changing upstream resources schemas and upgrades, however when it does hit it can be very impactful like the GCP P1 issue.

Yes, IR can be interesting but introduces complexity and some brittleness going forward that needs to be maintained. Perhaps we can at least explore it though to see exactly how this may look like.

@iwahbe
Copy link
Member

iwahbe commented Apr 25, 2024

A design document has been written, but we have decided to delay this item of work.

@VenelinMartinov
Copy link
Contributor

Hit this again in pulumi/pulumi-aws#4410 (comment)

@VenelinMartinov
Copy link
Contributor

One idea here for fixing this is potentially using the cty Type in the StateUpgrader for recovering the old version state in the cases where there is a migration.

@iwahbe
Copy link
Member

iwahbe commented Sep 5, 2024

I've explained the problem a couple of times recently, so I though I'd write a quick diagram illustrating the issue.

Write TF State to Pulumi State

stateDiagram
TF_Provider --> TFState
TFState --> MakePulumiState
Provider@v1 --> (TFSchema,ProviderInfo)@v1
(TFSchema,ProviderInfo)@v1 --> MakePulumiState
MakePulumiState --> PulumiState
PulumiState --> gRPC(state_file)
Loading

Read TF State from Pulumi State

stateDiagram
gRPC(state_file) --> OldPulumiState
OldPulumiState --> MakeTerraformState
Provider@v2 --> (TFSchema,ProviderInfo)@v2
(TFSchema,ProviderInfo)@v2 --> MakeTerraformState: Takes old state and new info
MakeTerraformState --> TF_Provider
Loading

The Problem

When we read & write TF state, we use the "current" version of the provider. If we write with v1 and read with v2, then any information we use from the v2 state may not be correct for the v1 state.

@mjeffryes mjeffryes modified the milestones: 0.102, 0.110 Sep 12, 2024
@mjeffryes mjeffryes removed this from the 0.110 milestone Oct 2, 2024
@VenelinMartinov
Copy link
Contributor

@zbuchheit hit this with a user for a TypeList nested object which had MaxItems: 1. The MaxItems: 1 was dropped which resulted in the new provider having an object in the state for a List type property.
The workaround there was to add a TransformFromState to fix the state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

4 participants