Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send an email when a file transfer fails #1353

Open
andreleblanc11 opened this issue Dec 20, 2024 · 13 comments
Open

Send an email when a file transfer fails #1353

andreleblanc11 opened this issue Dec 20, 2024 · 13 comments
Labels
enhancement New feature or request likely-fixed likely fix is in the repository, success not confirmed yet. NewUseCase needed to address a use case, we can't yet support. Sugar nice to have features... not super important. UserStory interesting to read to consider improving

Comments

@andreleblanc11
Copy link
Member

We have a client that would like to have an email be sent when a transfer fails.

I've been thinking this might be possible with message reports.

@petersilva said

the report messages have an added field "report" ... and if the original messages contained the data, that is removed to... (no embedded content.) so it's mostly the same as a normal message.

 "report" { "code": 999  - HTTP style response code.
                   "timeCompleted": "YYYYMMDDTHHMMSS.ss" - UTC date/timestamp.
                   "message" :  - status report message documented in `Report Messages`_
                 }

I'm thinking if we can have the reported message include the transfer error and feed that message to an email sender that this might work. I haven't checked the code to confirm this or not.

I'm not sure of another way to do this. Possibly another option could be to introduce a new flowCB entry point?

This could also be good for our team to have this implemented if critical data feeds start having transfer problems. Could send an email to NetOps.

@andreleblanc11 andreleblanc11 added enhancement New feature or request NewUseCase needed to address a use case, we can't yet support. UserStory interesting to read to consider improving Sugar nice to have features... not super important. labels Dec 20, 2024
@andreleblanc11
Copy link
Member Author

This could also be good for our team to have this implemented if critical data feeds start having transfer problems. Could send an email to NetOps.

#1350 should also already help in a similar way.

@petersilva
Copy link
Contributor

petersilva commented Dec 20, 2024

We don't need reports or any special entry_points ... there are multiple worklists for exacly this purpose...

  • worklist.incoming ... messages received but not transferred.
  • worklist.rejected ... messages for which transfers will not be attempted.
  • worklist.ok ... messages for files which were successfully transferred.
  • worklist.failed ... messages for files where the transfers failed.

when a send fails, the corresponding message should be in worklist.failed.
you can write an after_work plugin that sends an email for each message in that worklist.

  1. The original message and all it's fields are available at that point.
  2. It will only generate the mail after it has tried 3 times (based on attempts setting.)

but that message will go into the retry queue, and a be retried five minutes later... so if it fails again, there will be an email every five minutes ... for about 3 days (based on default settings.)

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Dec 20, 2024

I didn't even think about the fact that worklist.failed stuff will go through an after_work entry point. That's definitely a better option then what I was thinking.

@andreleblanc11
Copy link
Member Author

To avoid multiple emails being sent, I think we could probably leverage msg_get_from_file in Diskqueue.py to check if the file message is already in the retry queue or not.

If we also add callback_prepend work.my-plugin in the config, I think we should be able to run the plugin before the retry queue gets appended.

@petersilva
Copy link
Contributor

when I said the message and "all it's fields" ... that includes the "report" field you mentioned... so that could be leveraged in writing the mail message.

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Dec 27, 2024

I was able to get an email to send when a transfer failed in my test plugin.

However, I had to work around the sendTo option to do this. Both the email sender and a regular sender use sendTo.

This is what I did to work around the problem (in the __init__).

        self.o.add_option('email_server', 'str', default_value='')
        # Hacky way of having a correct mail server
        self.sendTo = self.o.sendTo
        self.o.sendTo = self.o.email_server

        self.email = sarracenia.flowcb.send.email.Email(self.o)
        self.o.sendTo = self.sendTo

A work around in the email plugin could be to have a new option that uses self.o.sendTo as a default.

self.o.add_option('email_server', 'str', default_value=f"{self.o.sendTo}")

@andreleblanc11
Copy link
Member Author

I've been trying to integrate the diskqueue in the plugin (to avoid multiple sendings of emails) and have gotten unsatisfactory results.

In the housekeeping, before files get retried, I'm not able to find the diskqueue file within the configs cache directory. This is what is seen from the running process.

# Housekeeping runs, gets the files from the diskqueue
2024-12-30 14:38:25,561 [DEBUG] sarracenia.diskqueue on_housekeeping work_retry_01 on_housekeeping, 0 msgs in queue file, 1 in new file
2024-12-30 14:38:25,563 [DEBUG] sarracenia.diskqueue on_housekeeping has queue False
2024-12-30 14:38:25,564 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /net/local/home/leblanca/.cache/sr3/sender/test_email_on_failure/diskqueue_work
_retry_01.new open read
2024-12-30 14:38:25,565 [DEBUG] sarracenia.diskqueue on_housekeeping retrieved 1 from the 1 retry
2024-12-30 14:38:25,566 [INFO] sarracenia.diskqueue on_housekeeping work_retry_01 Number of messages in retry list 1
2024-12-30 14:38:25,567 [DEBUG] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.004850
2024-12-30 14:38:25,567 [DEBUG] sarracenia.diskqueue on_housekeeping post_retry_001 on_housekeeping, 0 msgs in queue file, 0 in new file
2024-12-30 14:38:25,568 [DEBUG] sarracenia.diskqueue on_housekeeping has queue False
2024-12-30 14:38:25,568 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /net/local/home/leblanca/.cache/sr3/sender/test_email_on_failure/diskqueue_post
_retry_001.new open read
2024-12-30 14:38:25,569 [DEBUG] sarracenia.diskqueue on_housekeeping retrieved 0 from the 0 retry
2024-12-30 14:38:25,569 [DEBUG] sarracenia.diskqueue on_housekeeping post_retry_001 No retry in list
2024-12-30 14:38:25,570 [DEBUG] sarracenia.diskqueue on_housekeeping on_housekeeping elapse 0.003351

# The retry files are now in diskqueue_work_retry_01
2024-12-30 14:38:39,919 [DEBUG] sarracenia.diskqueue msg_get_from_file DEBUG /net/local/home/leblanca/.cache/sr3/sender/test_email_on_failure/diskqueue_work_retry_01 open read


# It retries to send the file.
# When my after_work plugin is called, the retry file doesn't exist, so we can't compare with what we are trying to filter out.
2024-12-30 14:38:25,580 [DEBUG] work.send_email_on_failure after_work Checking if message in retry queue
# os.listdir of the configs' cache directory
2024-12-30 14:38:25,580 [CRITICAL] work.send_email_on_failure after_work Files ['subscriptions.json', 'sender_test_email_on_failure_01.pid', 'sender.test_email_on_failure.tfeed.qname']
# Checking if the file pointer exists. It doesn't because we can't find the file.
2024-12-30 14:38:25,580 [DEBUG] work.send_email_on_failure after_work FP : None . queue file 
/net/local/home/leblanca/.cache/sr3/sender/test_email_on_failure/diskqueue_work_retry_01

I checked back the diskqueue logic, and see that when a get is made after the housekeeping runs, the retry file gets discarded.
😢

# after getting the last message from the file, close it
if self.msg_count == 0:
try:
os.unlink(self.queue_file)
except:
pass
self.queue_fp = None
return ml

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Dec 30, 2024

A band-aid fix work around for this is to append the retried messages to a list, and check if they exist within the list every time the plugin is called.

This is not a good long-term work around though. There's no way to clear the list if the retried file gets eventually sent. If the process also gets restarted, the emails will get sent again.

(in the __init__)
       self.accumulated_msgs = []

(in the after_work)
            if msg in self.accumulated_msgs: continue
            else: self.accumulated_msgs.append(msg)

@petersilva
Copy link
Contributor

petersilva commented Dec 30, 2024

I might not understand what you are trying to do. If you want to prevent retries after you have sent the email... all you need to do is remove the messages from the worklist.failed.
That should be it.

so the loop should be something like:

to_mail=worklist.failed
worklist.failed=[]

for m in to_mail:
     whatever the mail logic is.

@petersilva
Copy link
Contributor

Do you want to suppress retrying of sending the file,... or just prevent multiple emails (but keep retrying so it gets sent eventually.)

I guess I the problem here is that you are trying to use the unmodified email sender... sounded like a great idea at first... but it probably doesn't quite match (need to do different things with the worklists vs. the built-in email send thing. ...

I think you might need a custom callback that re-implements mail logic.

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Dec 30, 2024

I might not understand what you are trying to do. If you want to prevent retries after you have sent the email... all you need to do is remove the messages from the worklist.failed.

That won't work because we want the file to keep retrying. The email sent would just be to notify the client that a transfer failed, that's why we would only want to send it once. We don't want to spam the client, but we want to try to resend the file normally.

@petersilva
Copy link
Contributor

OK then look at the fields in the message... I think there is a field set when a message is a retry... something like msg['retry'] or msg['isRetry'] and you don't send the mail if that field is set.

@andreleblanc11
Copy link
Member Author

andreleblanc11 commented Dec 30, 2024

There is no retry field available in the message when it retries, even with report True set.

{'_format': 'v02', '_deleteOnPost': {'exchange', 'new_dir', 'new_baseUrl', '_mask_index', 'post_format', '_format', 'local_offset', 'topic', 'new_subtopic', 'new_file', 'new_relPath', 'subtopic'}, 'to_clusters': 'ALL', 'mtime': '20241108T204642.551517248', 'atime': '20241108T204802.322180033', 'mode': '755', 'pubTime': '20241230T143100.155074835', 'baseUrl': 'file:', 'relPath': '/net/local/home/leblanca/kill_orphaned_children.sh', 'subtopic': ['net', 'local', 'home', 'leblanca'], 'identity': {'method': 'md5', 'value': 'HHIlcIoBRG1Y5GYS4unzgw=='}, 'size': 3270, 'exchange': 'xs_tfeed', 'source': 'tsource', 'topic': 'v02.post.net.local.home.leblanca', 'local_offset': 0, '_mask_index': 0, 'new_dir': '/net/local/home/leblanca/test', 'new_file': 'kill_orphaned_children.sh', 'post_format': 'v03', 'new_baseUrl': 'file:', 'new_relPath': 'net/local/home/leblanca/test/kill_orphaned_children.sh', 'new_subtopic': ['net', 'local', 'home', 'leblanca', 'test'], 'contentType': 'text/x-shellscript'}

However, we can still add the field manually in the message. Adding the below works in my plugin.

# When checking if the field is there

            # We don't want to resend emails for messages that already have passed
            if 'isRetry' in msg:
                if msg['isRetry']: continue

# When setting the field

            # The message will get retried. Add a field in the message so that we can check for future occurences
            msg['isRetry'] = True
            # We still want to delete the field if ever it posts
            msg['_deleteOnPost'] |= set(['isRetry'])

@andreleblanc11 andreleblanc11 added the likely-fixed likely fix is in the repository, success not confirmed yet. label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request likely-fixed likely fix is in the repository, success not confirmed yet. NewUseCase needed to address a use case, we can't yet support. Sugar nice to have features... not super important. UserStory interesting to read to consider improving
Projects
None yet
Development

No branches or pull requests

2 participants