-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poison message handling for DT.AS V2 #1130
base: azure-storage-v12
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Biggest blocker for me on this PR are the changes to DurableTask.Core. I don't think a case has been made for this yet, and it needs further discussion. At this point, my preference is to change only DurableTask.AzureStorage.
/// Gets or sets the maximum dequeue count of any message before it is flagged as a "poison message". | ||
/// The default value is 20. | ||
/// </summary> | ||
public int PoisonMessageDeuqueCountThreshold { get; set; } = 20; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public int PoisonMessageDeuqueCountThreshold { get; set; } = 20; | |
public int PoisonMessageDequeueCountThreshold { get; set; } = 20; |
{ | ||
// We have limited information about the details of the message | ||
// since we failed to deserialize it. | ||
this.settings.Logger.MessageFailure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abandoning a message is not necessarily an error. There are other cases, such as out-of-order messaging race conditions where we're able to recover gracefully by just abandoning a message once.
Consider putting this error message inside of an if (exception != null)
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this overload is only called in cases where there's an exception, as you can see by the comment in line 193:
// This overload is intended for cases where we aren't able to deserialize an instance of MessageData. |
using Azure.Storage.Queues.Models; | ||
using DurableTask.AzureStorage.Storage; | ||
using DurableTask.Core; | ||
using DurableTask.Core.History; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove extra newline
// Create the poison message table if it doesn't exist | ||
string poisonMessageTableName = this.settings.TaskHubName.ToLowerInvariant() + "Poison"; | ||
Table poisonMessagesTable = this.azureStorageClient.GetTableReference(poisonMessageTableName); | ||
await poisonMessagesTable.CreateIfNotExistsAsync(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to consider caching to remember if we've already created this table. Otherwise, we may end up spamming the storage account with a lot of these calls.
} | ||
} | ||
|
||
public async Task<bool> TryHandlingDeserializationPoisonMessage(QueueMessage queueMessage, Exception deserializationException) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's a lot of redundancy between these two methods. Is it possible to refactor them to share some logic?
/// Gets or sets user-facing details for why a message was labeled as poison. | ||
/// This is to be set by each storage provider. | ||
/// </summary> | ||
public string PoisonGuidance { get; set; } = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with adding these properties to every DTFx history event, especially if we don't have a plan or design for whether/how to implement poison message handling for other backend providers. For the purposes of this PR, I'd prefer we make changes only to DurableTask.AzureStorage unless we can get broad agreement from across the team that this is the right approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Just to add more context here (may be obvious) - the reason for the DTFx.Core change here is solely that it allows us to easily fail the function (activity or orchestrator) with the poison message. In the activity/orchestrator dispatchers, we are checking if isPoison
is true and, if so, we're replacing the history event of the poison message with a failure event.
I assume that if instead of doing this, we replace the message itself with a failure event in the DTFx.AS level, we should be able to get away without this "isPoison" property. I recall trying to do this, and not finding it to be super obvious, but it's worth trying again. @cgillum if you have suggestions, I'm all ears!
This PR adds poison message handling for DurableTask.AzureStorage V2. There's a new
PoisonMessageDeuqueCountThreshold
setting inAzureStorageOrchestrationServiceSettings
where the default is set to 20. If a message is dequeued and fails to deserialize more than the number that's set forPoisonMessageDeuqueCountThreshold
, then it gets added to a new<TaskHubName>Poison
table in Azure Storage. Customers can go to this table in their storage account and look at these poison messages.