On-call overview
This page discusses the overall responsibilities of an engineer being on-call. The purpose of an on-call engineer is to examine alerts, fix anything that can be easily fixed, and escalate to others if necessary.
There are more details and recommendations for how to resolve problems available in the on call guide document. Please use that page as a reference. All engineers working at Levana must be familiar with the contents of this page.
- Slack channels, OpsGenie, and UptimeRobot
- Slack channel rundown
- Handling OpsGenie and production-monitoring alerts
Slack channels, OpsGenie, and UptimeRobot
The basic mechanism for our monitoring and alerting system looks as follows:
- We have a number of web services that we monitor, such as frontend sites and bots
- Some of these web services provide status pages which will return an error HTTP status code if something is broken
- UptimeRobot is a monitoring tool which will detect these and send alerts to external systems as necessary
- Depending on which piece of the system is broken, UptimeRobot is configured to do one or more of the following:
- Send a message via webhook to a Slack channel (channels discussed below)
- Send an alert to OpsGenie, which will generate an alarm for the on-call engineer
Since OpsGenie will send an alert to an on call engineer at any time, and will bypass do not disturb settings, we try to configure the system so that only the most serious alerts go to it. Therefore, observing the Slack channels when working is vital as well, since some alerts will only go there.
Slack channel rundown
There are many different Slack channels for monitoring. The reason for multiple channels is so that we segregate alerts based on priority. For anyone on-call, and in general anyone on the team, you must respond to messages in the following channels:
#production-monitoring: any production system alert should go here.#production-monitoring-opsgenie: keeps track of alerts that have escalated to OpsGenie. This will provide some OpsGenie-specific information, such as who was alerted about an incident, if it's been acknowledged, etc.
There are additional monitoring channels which are useful for specific purposes but are not production system outages and do not require on-call coverage:
#production-monitoring-gasgives an early warning when the bots are running low on funds. This should fire approximately 3 days before we run out of funds, so there's no crisis if this is unresolved for a few hours. If you see something here, please notify Michael Belote and ask him to refill the wallet.#monitoringmonitors our sandbox systems. You'll see alerts here for both testnet and mainnet. That's because the sandbox system is a test environment for new code, and needs to operate on both testnet and mainnet data to detect issues early. If you're working on backend changes, such as modifications to bots, querier, or indexer, you'll likely want to pay attention to messages here.#monitoring-gastracks the testnet gas funds only, and should have little to no activity#production-devops-monitoringgives more fine-grained alerts about the production system. Unless you're working on the DevOps system itself, it's safe to ignore these. They give a lot of false positives, which is why the alerts are sent to a separate channel.#production-monitoring-statsprovides alerts when the mainnet markets have undesired utilization ratios, delta neutrality, etc. No engineer needs to monitor this channel, though watching it may give you opportunities to make some money by opening unpopular positions and beneifiting from funding fees and DNF payments.
Summary If you're on call, or you're working, and a message comes into #production-monitoring, you should check it out.
Handling OpsGenie and production-monitoring alerts
See the dedicated page on alert handling steps for details.