Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poison message handling for DT.AS V2 #1130

Open
wants to merge 3 commits into
base: azure-storage-v12
Choose a base branch
from

Conversation

bachuv
Copy link
Collaborator

@bachuv bachuv commented Jul 9, 2024

This PR adds poison message handling for DurableTask.AzureStorage V2. There's a new PoisonMessageDeuqueCountThreshold setting in AzureStorageOrchestrationServiceSettings where the default is set to 20. If a message is dequeued and fails to deserialize more than the number that's set for PoisonMessageDeuqueCountThreshold, then it gets added to a new <TaskHubName>Poison table in Azure Storage. Customers can go to this table in their storage account and look at these poison messages.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Biggest blocker for me on this PR are the changes to DurableTask.Core. I don't think a case has been made for this yet, and it needs further discussion. At this point, my preference is to change only DurableTask.AzureStorage.

/// Gets or sets the maximum dequeue count of any message before it is flagged as a "poison message".
/// The default value is 20.
/// </summary>
public int PoisonMessageDeuqueCountThreshold { get; set; } = 20;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int PoisonMessageDeuqueCountThreshold { get; set; } = 20;
public int PoisonMessageDequeueCountThreshold { get; set; } = 20;

{
// We have limited information about the details of the message
// since we failed to deserialize it.
this.settings.Logger.MessageFailure(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abandoning a message is not necessarily an error. There are other cases, such as out-of-order messaging race conditions where we're able to recover gracefully by just abandoning a message once.

Consider putting this error message inside of an if (exception != null) block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this overload is only called in cases where there's an exception, as you can see by the comment in line 193:

// This overload is intended for cases where we aren't able to deserialize an instance of MessageData.
. In that case, doesn't that mean the exception will never be null, @cgillum ?

using Azure.Storage.Queues.Models;
using DurableTask.AzureStorage.Storage;
using DurableTask.Core;
using DurableTask.Core.History;


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove extra newline

// Create the poison message table if it doesn't exist
string poisonMessageTableName = this.settings.TaskHubName.ToLowerInvariant() + "Poison";
Table poisonMessagesTable = this.azureStorageClient.GetTableReference(poisonMessageTableName);
await poisonMessagesTable.CreateIfNotExistsAsync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider caching to remember if we've already created this table. Otherwise, we may end up spamming the storage account with a lot of these calls.

}
}

public async Task<bool> TryHandlingDeserializationPoisonMessage(QueueMessage queueMessage, Exception deserializationException)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's a lot of redundancy between these two methods. Is it possible to refactor them to share some logic?

/// Gets or sets user-facing details for why a message was labeled as poison.
/// This is to be set by each storage provider.
/// </summary>
public string PoisonGuidance { get; set; } = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with adding these properties to every DTFx history event, especially if we don't have a plan or design for whether/how to implement poison message handling for other backend providers. For the purposes of this PR, I'd prefer we make changes only to DurableTask.AzureStorage unless we can get broad agreement from across the team that this is the right approach.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Just to add more context here (may be obvious) - the reason for the DTFx.Core change here is solely that it allows us to easily fail the function (activity or orchestrator) with the poison message. In the activity/orchestrator dispatchers, we are checking if isPoison is true and, if so, we're replacing the history event of the poison message with a failure event.

I assume that if instead of doing this, we replace the message itself with a failure event in the DTFx.AS level, we should be able to get away without this "isPoison" property. I recall trying to do this, and not finding it to be super obvious, but it's worth trying again. @cgillum if you have suggestions, I'm all ears!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants