-
Notifications
You must be signed in to change notification settings - Fork 31
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #3582 from WikiWatershed/tt/add-maintenance-docume…
…ntation
- Loading branch information
Showing
1 changed file
with
35 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Maintaining Running Instances | ||
|
||
This document collects some common issues, workarounds for them, and plans for potential fixes in the future. | ||
|
||
## Prerequisites | ||
|
||
- Access to the AWS Web Console | ||
|
||
## Known Issues | ||
|
||
### SIGKILL search matches | ||
|
||
To solve the issue of Celery workers dying in production ([#1516](https://github.com/WikiWatershed/model-my-watershed/issues/1516), [#2981](https://github.com/WikiWatershed/model-my-watershed/issues/2981), [#3084](https://github.com/WikiWatershed/model-my-watershed/issues/3084)), we instituted two workarounds: | ||
|
||
1. Restart Celery every night [#3085](https://github.com/WikiWatershed/model-my-watershed/issues/3085), [#3286](https://github.com/WikiWatershed/model-my-watershed/issues/3286) | ||
2. Add Papertrail alert for SIGKILL [#3278](https://github.com/WikiWatershed/model-my-watershed/issues/3278) | ||
|
||
The second one incidentally also catches SIGKILLs of `snapd.service`. We don't actively use `snapd` in this project, but this generally indicates that the instance is out of hard drive space. Through a combination of higher disk usage in fresh VMs, and more verbose logs, the Worker VMs run out of space quicker. | ||
|
||
A longer term fix would involve a combination of reducing the base disk use, reducing the amount of logging done in those VMs, and rotating the logs so that they don't fill the VM up. Until we have the funding for such an investigation, we employ the following workaround. | ||
|
||
#### Workaround | ||
|
||
1. Login to the production account using `mmw-prd` credentials in Azavea's BitWarden | ||
2. Navigate to EC2 → Instances | ||
3. Filter the Instances by `Instance state = running` and `Name = Worker` | ||
4. Select all of them | ||
5. Go to Instance state → Terminate | ||
|
||
https://user-images.githubusercontent.com/1430060/204665413-370025b1-091e-4351-bc3d-04d5595e575b.mp4 | ||
|
||
6. The load balancer will spin up new instances to replace these | ||
7. Once they are up, go to https://modelmywatershed.org/ and **draw** a shape | ||
- Drawing a shape is important so that we don't use cached values and bypass Celery | ||
8. Ensure that the analyses for the drawn shape complete successfully |