Alert handling steps

If you are on call and receive an OpsGenie alert, you are responsible for handling that alert until taken over explicitly by another team member. This document covers the steps you should take around managing the alert and communicating with the rest of the team.

For information on how to investigate an outage, check out the on-call incident guide or, if relevant, the investigating user questions page.

graph TD
    newAlert(New OpsGenie alert) -->|Acknowledge| triage(Triage the event)
    triage --> investigate(Investigate the issue)
    investigate -->|I can solve this quickly| solve(Solve the problem)
    solve --> notifyOfResolution(Notify the team on Slack that the problem is resolved)
    investigate -->|Not sure if I can solve this quickly| askForHelp(Ask for help on Slack)
    investigate -->|I definitely can't solve this| escalateInOpsGenie(Escalate in OpsGenie)
    askForHelp --> continueInvestigation(Continue investigation)
    continueInvestigation -->|Can't resolve in a reasonable timeframe| escalateInOpsGenie
    continueInvestigation -->|Not a real bug| ignore(Ignore the issue, explain the situation on Slack)
    continueInvestigation -->|I solved the issue| notifyOfResolution
    escalateInOpsGenie -->|Wait for next on-call person to come online| pairResolution(Work together on determining next steps)
    pairResolution -->|Solve the issue together| notifyOfResolution
    pairResolution -->|Hand off ownership to new engineer| noLongerResponsible(You are no longer responsible)
    pairResolution -->|Neither of you knows what to do| escalateToMichael(Escalate to Michael Snoyman)
    investigate -->|There's a third party involved| contactThirdParty(Contact third party)
    contactThirdParty -->|Continue investigating| continueInvestigation

Basic steps

We will flesh out the details of these steps below.

Acknowledge in OpsGenie
- Acknowledge the alert within OpsGenie to take ownership and avoid the issue escalating to someone else
Do basic triage
- Review the alert message, investigate status pages, test the frontend site
- Determine if the alert is real
- Determine if you'll be able to quickly solve this on your own
Ask for help
- Do this if you are not sure you'll be able to handle the issue on your own
- Support engineers should err on the side of asking for help
- For developers, if you're completely unfamiliar with the source of the problem, also feel free to ask for help immediately
- While still working on resolving the issue, see if anyone else is available on #production-outage-discussion who can assist. Feel free to use an @channel ping.
Escalate
- When it becomes clear that you won't be able to resolve the issue on your own in a reasonable amount of time, use OpsGenie to escalate the alert to the next on-call person
- Until that person takes over, you are still responsible
- Stay on with the new person, provide any information you've collected so far, and decide together how to proceed
Communicate externally
- Some issues may be beyond Levana's infrastructure, and require third party assistance
- If you identify a third party that is relevant to the alert, reach out to them for support
- See the Slack contact list for more information on how to contact external teams

These steps primarily apply to OpsGenie alerts. For alerts in #production-monitoring which do not have an OpsGenie alert associated with them, use your best judgement on how to proceed with the review.

Note that even after asking for help on Slack, you are responsible for managing this alert until someone explicitly takes over from you.

Acknowledging the Alert

Once you receive an alert, the first step is to acknowledge it. This can be done through the OpsGenie app or web interface.
Acknowledging an alert signals that you are aware of the issue and are taking steps to resolve it. This prevents the alert from escalating further in the immediate term.
If you will not be able to resolve the alert within 5-10 minutes, put a message in the #production-outage-discussion channel that you're investigating so that, if others are able to assist, they can lend guidance.
- Additionally, see escalation protocol below. If you are unable to resolve the issue and do not receive assistance in Slack, you need to use OpsGenie to escalate to the next level of support (on call developer or higher).
Review the error message you've received and see if you know what the problem is. If so, address the problem and/or notify the responsible parties.
- Note that, for bot errors, you'll need to click through to the bot status page to see the real error message.
- If the alert has already resolved by the time you get the status page open, you can usually click on "view incident details" in UptimeRobot and then download the "full response" to see the message.
Check live: is the app accessible? Does the page load at all? If it loads, are there errors displayed? If there are no errors, are you able to open a transaction?
Check if other apps on the chain are still working.
Check Discord to see if there are notifications about it. Notify Discord users that we're investigating the issue.
Dive into the error details in Slack messages on the monitoring channels, any OpsGenie report, and potentially logging into AWS and looking at the logs.

Discussions of ongoing outages can take place in the #production-outage-discussion channel on Slack.

In case you require crypto for testing, please setup shared team wallet

You can check who is currently on call within the Levana Slack workspace by sending the message /genie whoisoncall.

Reviewing Alert Details

Examine the details provided in the alert to understand the nature and severity of the issue.
Check any attached logs, metrics, or links for additional context.
Remember that alerts from UptimeRobot will not include the response body from endpoints! You'll need to look at the status page in order to see those details.

Initial Troubleshooting

Start troubleshooting based on the information provided in the alert.
Document your actions and findings for future reference and communication.

Escalation (if solution not found in 30 minutes)

If you cannot resolve the issue within 30 minutes, you should escalate to the next on-call person.
If you know in less than 30 minutes that you will not be able to resolve the issue, escalate earlier.
If you believe it is warranted, escalate directly to Michael Snoyman even if he is not on call.
To escalate an alert in OpsGenie, open the alert in the OpsGenie app or web interface and add Michael Snoyman as a responder.

Follow-Up

After escalating, stay available for any follow-up questions or assistance.
Monitor the progress of the escalated alert.

Resolution and Documentation

Once the issue is resolved, ensure that the alert is marked as resolved in OpsGenie.
Document the resolution steps and any lessons learned to improve future responses.

Recommended communication around outages

If you're looking into an outage, say so on Slack. The #engineering channel is a reasonable place to do so, or adding a threaded comment on #production-monitoring.
If you're unable to resolve an alert, you need to decide on whether to ignore the alert, defer the alert, or escalate the alert. Let's go through each case.
Ignore an alert if you know that the alert is bogus.
- Real life example: UptimeRobot mistakenly alerted that levana.finance was down. It's the middle of the night. You've confirmed manually in your browser and via isup.me that the site is working.
- Acknowledge the OpsGenie alert so that no one else receives the alert.
- Add a threaded comment on #production-monitoring that you're ignoring the alert because it's spurious.
- Add a comment on the #engineering channel about ignoring the alert. This is important for two reasons:
  - We need to make sure to resolve the spurious alert in the morning so that real alerts can fire again.
  - You may be mistaken for some reason, and this actually needs to be addressed.
Defer an alert if there's a real problem but fixing right now is a bad idea.
- Example: you see that the indexer is unable to process new blocks because of a bug, but it's 3am.
- You know it's a terrible idea to deploy production code in the middle of the night without code review.
- Acknowledge the OpsGenie alert
- See if users are impacted and consider setting an emergency banner.
- Send a message on #engineering describing the situation and resume bug fixing in the morning.
Escalate an alert if the production system is impacted in a significant way and you're unable to resolve it.
- Escalating should happen within OpsGenie if it's the middle of the night to make sure to pierce do not disturb settings.
- Put messages on whatever Slack channels make the most sense, and don't be stingy: let everyone know there's a major issue.
- There's a balancing act here between proper responsiveness to a production system versus overzealously disturbing people's personal time. You'll need to make a decision on whether the situation warrants it on a case by case basis.

Levana Staff Site