Engineering Processes

This page covers the processes we follow as an engineer team. The goal is to help improve communication around discussions and making decisions. As places for clarification arise over time, the contents of this page will grow. It's good to use this page as a reference when some additional clarity is needed.

Monitoring and alerting
Technical disagreements

Monitoring and alerting

What needs to be monitored

graph TD
  everValid(Is this event ever valid in the system)
  everValid-->|No|canBan(Can we prevent this event from ever occurring?)
  canBan-->|Yes|banInCode(Update code to make the situation impossible)
  banInCode-->canStillHappen(Are we concerned that this situation may still be possible?)
  canStillHappen-->|No|noAlertingNecessary(No alerting is necessary)
  canStillHappen-->|Yes|setUpAlert(Set up an alert)
  canBan-->|No|setUpAlert
  everValid-->|Yes|howCommonValid(How frequently will this event occur and be a valid state?)
  howCommonValid-->|Very infrequently|setUpAlert
  howCommonValid-->|At least somewhat frequently|designComplex(Design a complex monitor)
  setUpAlert-->canLevanaRespond(Can Levana reasonably respond to this alert condition?)
  canLevanaRespond-->|Yes|configureAlert(Configure the alert)
  canLevanaRespond-->|No|discussInternally(Discuss further internally)
  configureAlert-->teachTeam(Document and teach the team how to respond to such an alert)

The first step in monitoring a system is determining the events to look at. Generally, events that can be monitored fall into one of the following categories:

Always invalid, and we can prevent them from being possible programmatically. Example: opening a position with someone else's funds. We don't need to configure an alert for such a situation, assuming we trust our code to behave as expected.
Always invalid, but we can't guarantee it will never happen. Example: our frontend site goes down. We do everything possible to make our site resilient, but as we've seen even highly trusted entities like Cloudflare can have outages.
Some versions of the event are normal, but may indicate a problem in the system. Example: traders taking profits. A single trader taking profits is expected. A single trader taking very large profits once? Probably expected. Do we need an alert for it? Maybe, maybe not. See the next section for details on this.
The event is completely normal, and not even indicative of a problem. For example: trade volume increases or decreases by 5%. We may want to have business level stats to track this, but there's no monitoring event needed.

Another aspect to keep in mind for alerts is whether or not the team can do anything meaningful about them. For example, "the Osmosis chain is not accepting new transactions" would likely be something beyond Levana's control, but Levana could still notify the community, communicate with Osmosis to get updates, and set an emergency banner. Such an alert would make sense.

A final point to mention here is alert fatigue. Depending on where alerts are sent, they may wake people up, or at the very least make them spend significant time processing. Having too many alerts, and especially false positives, is dangerous and needs to be avoided. This is discussed more in the next section.

Where monitoring alerts are sent

We basically have four levels of where alerts can be sent:

OpsGenie: the situation is so dire that, if it ever occurs, it warrants waking someone up to address it. (This level automatically implies that level (2) is warranted as well.)
#production-monitoring: this is the primary Slack channel for time-critical alerts. Alerts which don't necessarily warrant a wake-up, but do demand urgent action, should go here.
Alternative Slack channels: if urgent action and team-wide awareness isn't necessary, using separate Slack channels is good to avoid alert fatigue.
Other collection system: this can apply to things like Sentry frontend errors, Amazon analytics, and more. The idea is that, in these cases, it's a proactive step for someone to decide to go and review these events.

The goal is that, if an alert ever lands in (1) or (2), the team will understand that this is important and should be handled.

How to handle complex monitoring requirements

Many topics that warrant alerting are complex. The December 26, 2023 exploit is a prime example of this. How do you set up a monitoring system to detect that? Some ideas:

Look for the exact situation that occurred: someone opens and closes a position in a short period of time using older-than-expected price points for entry. This is great for detecting a known attack vector, but doesn't help much with unknown attack vectors. And for known attack vectors, the correct solution is usually not to monitor for it, but to instead prevent it from happening (as we've done with deferred execution).
Raise an alert every time a trader takes profit. That's silly: the alert fatigue would be huge, and we'd lose the forest for the trees.
Raise an alert every time a trader takes profit over a certain limit. It's a possibility, but (1) may often still lead to false positives and (2) may miss many attack vectors.
Raise an alert when in aggregate profits over some period of time go above a certain level. This may work, but (1) finding the right parameters are very tricky and (2) the time period may be too long after the attack is successfully completed.

The point of this section isn't to say "this is how you monitor something complex." It's also not to say "it's impossible to monitor for complex situation." Instead, it's to point out: there are many cases where there are no easy wins. When we've identified such a situation, we need to:

Set aside serious time to design the requirements
Have technical brainstorming sessions involving relevant stakeholders
Design and implement a solution
Regularly review data manually and adjust our initial solution

Technical disagreements

There's an older and related document, originally from Notion but since migrated to this site: guideline to efficient technical discussions. This section is intended as a more direct guide for troubleshooting a broken discussion.

Identify why each side is making their claims. Every stance should ultimately have a business need it's trying to address.
Compare how each solution addresses those core business needs. If there are any gaps, identify them. In some cases, no solution will fully address all business needs, and it's ultimately a business decision around which trade-offs are acceptable.
If a topic is contentious, and there are other decisions that can be made without making a decision on that topic, table the discussion until later.
In emergency situations, taking short-cuts may occasionally be absolutely necessary, and it's worth calling those out explicitly.
If you believe your proposal is not being considered fairly, your best course of action is to stop debating it and instead clearly articulate it. Often times, the process of clearly articulating will either convince the other side or reveal a flaw in your proposal. Both outcomes will be helpful in progressing the discussion.
Do not repeat the same proposal. If a topic is contentious, and a proposal is not being accepted, continuing to raise it in future discussions, or slightly modified versions of it, is bad communication.
If you're convinced that you are correct, that you've answered all objections from the other side, and you're still not making progress with more clearly aritculating the idea, you will need to resort to escalation: asking someone with more authority to intervene. Since this is Michael Snoyman writing, I'll say explicitly for myself: at any time, feel free to raise a concern directly with Jonathan. My recommendation is to consider carefully how you do this, and weigh what you believe are critical errors versus minor differences of opinion. Be sure to properly explain the motivations of whoever you're arguing with (e.g., me) to avoid strawman arguments and wasting more time.
There are many valid ideas to be weighed when making a technical decision. These include, but are not limited to:
- What are the business needs?
- How complex will the implementation be? This is relevant because of both:
  - Time to market/cost of implementation
  - Risk to the project from making the change
- How familiar are we with the technologies involved? Even if an alternative technology seems like a better fit, familiarity with an existing approach is a very valid reason to stick with it. One simple reason why: you may be experiencing a "grass is greener" fallacy, and the new technology in fact has flaws you're simply not familiar with yet.
- How much code have we already written in a different direction? Not only is this about risks of change and costs, it's also about hidden requirements. When a codebase has been developed for a long time, it's common to have "old knowledge" baked in, resolution to issues you may not even remember having encountered in the past.

Levana Staff Site