On-call incident guide

Please see the on call overview document for general requirements of being on-call. This page provides various recommendations and collected knowledge of how to resolve different kinds of errors.

Anyone on the team should feel free at any point to add new information here, no PR required. (If you need feedback on your updates, feel free to use a PR instead.)

Checking status
Checking Incident reports
Emergency banner
Checking AWS logs
Slack contact list
Raw ingester grpc endpoint fails
Out of Sync validator nodes
Validator health
Indexer starts failing after a deploy
Bots fails when trying to fetch factory contract
Alternate GRPC endpoints
Find error ID in Sentry
Osmosis mainnet everyday downtime
Cosmos SDK error codes
Gas versus fees
Stale markets error page
Running ingester locally to resolve ingestion issues
Events API from indexer
Price issue

Checking status

The status pages document contains links to various status pages to check on the health of the system. This is always the current, live status. Historical outages require looking at our various logging services like Sentry, Uptime Robot, and AWS logs.

Checking Incident reports

Uptime robot sends incident reports to various monitoring channels. The most important is #production-monitoring as this is where mainnet outages are sent.

Click the View Incident Details button - this will open the Uptime Robot page
(if necessary) switch teams to Michael's team and then hit the View Incident Details button again to get to the specific incident.
There is unlikely to be enough useful information on the page itself, instead hit the Full Response button
This will download a log of the incident
In this file: scroll down past the styling and html preamble to get to the actual meat of the page, which may contain something like "Transaction failed" or some other error

The Downtime Announcement Page on Notion is a vital communication tool for interacting with Levana Perps users during service disruptions. It enables the support team to keep users informed while they investigate and develop solutions to resolve the issue. This page includes a text field specifically for announcements; text entered here is displayed as a banner on the website, accessible to all users. The banner automatically vanishes when the text is deleted and the field is emptied.

Notion Announcement Page

Web banner

Checking AWS logs

Login to AWS
Navigate to the Elastic Container Services in the Paris region
Go to the service in question, and switch to the Logs tab (e.g. bots)
Use the filters - narrow down the time window, search for keywords like "sequence" or "error", etc.

Osmosis: https://status.osmosis.zone/
Injective: https://status.injective.network/
Pyth: https://pyth.network/stats
Kingnodes: https://levana-status.kingnodes.com/
Cloudflare: https://www.cloudflarestatus.com/

For checking logs via CLI, refer Deployment logs page.

Slack contact list

Support members on-call on Saturdays should join these Slack channels to ensure they can coordinate with third parties if issues arise.

Component	Slack channel	Primary contact	Secondary contact
Osmosis	#levana-osmosis	Adam	Sunny
Sei	#levana-sei-collab	Philip	Uday
Injective	#injective-levana	Achilleas
Pyth	#levana-pyth	Marc
Kingnodes	#levana-kingnodes	Jerome

Raw ingester grpc endpoint fails

Solution: Run the raw ingester locally with an alternative grpc endpoint. Ingester is here.

Out of Sync validator nodes

If indexer or bots are giving alerts, there is a likelihood that the validator nodes itself are out of sync.

One way to check that is using mintscan explorer and checking the transactions. These are the links:

A sample snapshot where it shows that there have been no transactions for an hour:

No transactions for a hour

This indicates that most likely the validator is out of sync with the chain. You would want to check with the node providers about the same. You can check in mintscan itself, about what validators they are using. Example for osmosis testnet: https://testnet.mintscan.io/osmosis-testnet/validators

Validator health

Osmosis stats page: https://osmosis-stats.blockpane.com/ This may also be useful for detecting congestion issues

Indexer starts failing after a deploy

Solution: Immediately roll back to the previous stable version. That is done by changing the commit hash in the devops repository. As of this writing this is done in two places:

For the testnet ECS deployment.
For the mainnet ECS deploylment.

Only then investigate what’s wrong in the current version.

Bots fails when trying to fetch factory contract

This is how a sample log looks like when it fails:

[2023-10-17T02:01:07Z INFO  perps_bots::cli] Wallet address for gas: osmo1s503v5v37qpvwcdchj395l925trnje6hw46krl
Error: Unable to get 'factory' contract

Caused by:
    0: Calling ContractByFamily with factory and osmoqa against osmo1l3ypxctzcdev9dmyydlzrflqz52f0wsqmzkuyq9dvpdnqx69wvvq9zszh2
    1: Error response from gRPC endpoint: Status { code: Unknown, message: "panic", metadata: MetadataMap { headers: {"server": "nginx", "date": "Tue, 17 Oct 2023 02:01:07 GMT", "content-type": "application/grpc", "x-cosmos-block-height": "3197200"} }, source: None }

You can use the cosmos binary to debug it further. Check if you are able to query the contract:

❯ cosmos query-contract --network osmosis-testnet osmo1l3ypxctzcdev9dmyydlzrflqz52f0wsqmzkuyq9dvpdnqx69wvvq9zszh2 '{"version": {}}'

Error: Error response from gRPC endpoint: Status { code: Unknown, message: "panic", metadata: MetadataMap { headers: {"server": "nginx", "date": "Tue, 17 Oct 2023 02:23:44 GMT", "content-type": "application/grpc", "x-cosmos-block-height": "3197200"} }, source: None }

You can also try to override the GRPC endpoint and see if it works. Example:

❯ cosmos contract-info --cosmos-grpc="https://grpc.testnet.osmosis.zone"

You can check if the contract actually exists:

❯ cosmos contract-info --cosmos-grpc="https://grpc.testnet.osmosis.zone"  --network osmosis-testnet osmo1l3ypxctzcdev9dmyydlzrflqz52f0wsqmzkuyq9dvpdnqx69wvvq9zszh2
code_id: 48
creator: osmo12g96ahplpf78558cv5pyunus2m66guykt96lvc
admin: osmo12g96ahplpf78558cv5pyunus2m66guykt96lvc
label: Levana Perps Tracker

Alternate GRPC endpoints

Find error ID in Sentry

Errors on the webapp can be mapped to Sentry with an ID.

Click the more details button to the right of the error notification
A modal will be presented and the error ID can be found at the top
Copy this ID and navigate to Sentry
Within Sentry focus on the Custom Search input and add the key transaction.id with the value as the error ID
1. Example: transaction.id:0TX720DLL

Error Modal

Note: Only error IDs in production and staging will be sent to Sentry.

Osmosis mainnet everyday downtime

Osmosis mainnet has a period of about 15 minutes every day in which it's not processing any new transactions. That period may trigger crank alerts and potentially some other types of alerts also.

You can read more about how the epoch works here. According to the Osmosis team, the epoch happens 24 hours since it happened last time.

Query for finding the last epoch:

❯ osmosisd query epochs epoch-infos --node https://osmo-priv.kingnodes.com:443
epochs:
- current_epoch: "852"
  current_epoch_start_height: "11955200"
  current_epoch_start_time: "2023-10-18T17:16:09.898160996Z"
  duration: 86400s
  epoch_counting_started: true
  identifier: day
  start_time: "2021-06-18T17:00:00Z"
- current_epoch: "121"
  current_epoch_start_height: "11880838"
  current_epoch_start_time: "2023-10-13T17:02:07.229632445Z"
  duration: 604800s
  epoch_counting_started: true
  identifier: week
  start_time: "2021-06-18T17:00:00Z"

Using date utility convert it your local timings:

❯ date -d "2023-10-18T17:16:09.898160996Z"
Wed Oct 18 10:46:09 PM IST 2023

So the last epoch happened on October 18th, 10:46 PM local time. So On October 19th, if around 10:44 PM - you get alerts - it's highly likely that it's because of epoch.

Cosmos SDK error codes

You can find a list of Cosmos SDK error codes in the cosmos-sdk repo.

Note that the error codes are namespaced, so CosmWasm error codes are defined elsewhere, in the wasmd repo

These do conflict over the error code numbers, e.g. code 5 in CosmosSDK means "insufficient funds" while code 5 in CosmWasm means "contract execution failed"

Gas versus fees

There are a few similar error messages you may see coming from the Cosmos SDK:

out of gas in location: wasm contract; gasWanted: 198152, gasUsed: 202921: out of gas (code 11)
Transaction failed with insufficient fee (13) and log: insufficient fees; got: 1488uosmo which converts to 1488uosmo. required: 9129uosmo: insufficient fee (code 13)
Insufficient funds (code 5)

All of these have the flavor of "not enough gas," but they all mean different things, and therefore the solution to them is completely different. Let's start with terminology:

Every action you take on chain requires some gas. Gas is a unit of the work necessary to perform your action. You can think of it as something like "how much CPU time will it take to perform this action," though that's a huge simplification.
When you run a transaction, you have to declare in advance how much gas you want to use. This is gasWanted. This is defined when you construct the transaction.
The usual way we calculated gasWanted is by simulating the transaction, seeing how much gas it actually took during simulation to get simulated gasUsed, and then multiplying it by the gasMultiplier. The gasMultiplier accounts for both variations in how much work is performed (more on this in a bit), plus some straight up bugs in Cosmos where simulated gas underreports how much gas will be needed for a transaction.
Once you have a gasWanted, you need to provide a gas fee to pay for that gas. This is usually done in the native coin for the chain (e.g. uosmo). You determine how much gas fee to provide by multiplying gasWanted by the gas price.
Osmosis has rolled out a fee market mechanism where the gas price goes up and down based on demand. Most other Cosmos chains have a fixed gas price.

With this information in mind, let's go back to our errors above.

Code 11, out of gas, occurs when the gasWanted you specified in the transaction is insufficient to perform the transaction. This can happen because of bugs in Cosmos (this is especially true for coin transfers for some reason), or because the work performed is different from the work simulated. We have a common case of this in Perps: cranking. Each time you crank, you're working on a new set of work items. It's fairly common for cranking to fail with code 11 because the simulated work was simpler to perform than the actual work.
Code 13 (insufficient fee) is when the gas price you used is insufficient. This used to just be a configuration issue, but with Osmosis's fee market implementation, it seems that this can happen because different nodes have miscalculated the gas price.
Code 5 (insufficient funds) looks a lot like (2), but actually means something else entirely: you specified a certain amount of gas funds, let's say 0.3 OSMO. However, your wallet didn't have that balance available.

Solutions to each of these are completely different:

Code 5 can almost always be solved by transferring more coins into the wallet in question.
Code 13 is usually solved by increasing the gas price. How you do that is a separate question. With Keplr, for instance, you can manually set a gas fund amount. Remember that this is different from the gas amount itself! This is the same concept of Keplr showing low, medium, and high gas prices.
Code 11 is usually solved by either trying again (the correct response for crank errors), increasing the gasMultiplier, or manually setting a higher gasWanted.

If you're looking at bots, the cosmos-rs library handles a lot of this logic, and you'd probably need to make a modification to that library, or at least to environment variables on the bots, to make a change. Generally, if you're hitting an issue with an automated tool, the best thing to do is to try to manually perform the same action (either via the cosmos-rs CLI tool or the smart contract GUI).

The current osmosis gas price can be gotten via https://lcd.osmosis.zone/osmosis/txfees/v1beta1/cur_eip_base_fee or https://querier-mainnet.levana.finance/v1/chain/osmosis-mainnet-gas-price

Note that our bots set a maximum on the gas price on Osmosis mainnet. If you see "transaction failed with insufficient fee," you should check at the URLs above if the gas price has gone too high. At time of writing, our bots limit the gas price to 0.0054. If the number is higher than that, it's expected that transactions will fail because validator nodes will not pick up our transactions. In such a case, you should contact the Osmosis team for clarity on the current situation and notify the #product channel of the congestion.

Worked sample spreadsheet

Stale markets error page

When some markets become stale, open the status page for the bots(included in alert) and check not only the Stale section, but also the sections of individual crank messages even if they're green. They might include useful information.

Running ingester locally to resolve ingestion issues

Note that for this step to work, you would have to get in touch with KingNodes to white list your IP address temporarily. If instead you want to ingest a single block at a specific height, do this under the indexer repository:

❯ just single-ingest 69443328

There may be issue with ingester not being able to process blocks. One easy way to resolve this is running the ingester locally and see if it's able to process blocks and subsequently upload the blocks to the S3 storage.

For running this locally, you would need these things:

Access to the indexer codebase.
Postgres instance running locally
S3 bucket credentials (Get in touch with Sibi for this)
just tool.

These are the high level steps on how to run ingester locally for sei testnet:

Run postgres locally. Go inside the indexer repository's packages/indexer directory where there will be justfile to make things easier. You can run postgres like this:

❯ just postgres
docker run --name indexer_pg --rm -it -e POSTGRES_PASSWORD=postgres -p 5432:5432 postgres:15.3-alpine -c log_statement=all
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
...

Export the S3 credentials locally:

export AWS_ACCESS_KEY_ID="REDACTED"
export AWS_SECRET_ACCESS_KEY="REDACTED"

Based on the network export appropriate environment variables. Since we are going to run for sei testnet, these are the steps:

export COSMOS_NETWORK="sei-testnet"
export COSMOS_GRPC="https://grpc-testnet.sei-apis.com"
export COSMOS_GRPC_FALLBACKS="https://grpc.atlantic-2.seinetwork.io"

To know the exact string representation for COSMOS_NETWORK environment variable, you can refer the code here.

To find out the GRPC endpoints that you can use, you can refer from the file here.

Do a db reset to initialize the tables:

❯ just db-reset
env DATABASE_URL="postgres://postgres:postgres@localhost/indexer" sqlx database reset -y
Applied 0/migrate initial (32.973392ms)
Applied 1/migrate event tables (84.317193ms)
...

Now let's say that you want to start from block height of 60000000, you would have to do this:

❯ just psql
env PGPASSWORD="postgres" psql -U postgres -h localhost -d indexer
psql (15.5, server 15.3)
Type "help" for help.

indexer=# select * from chain;
 id |   chain_id
----+---------------
  1 | atlantic-2
  2 | osmosis-1
 10 | injective-1
 11 | injective-888
 14 | pacific-1
(5 rows)
indexer=# INSERT INTO latest_block_bulk VALUES (1, 60000000, now());
INSERT 0 1

If an value is already present, you can update it like this:

UPDATE latest_block_bulk SET height = '60000001' WHERE chain = 1;

And now you can start the ingester:

❯ just run-ingester
env LEVANA_INDEXER_BIND="[::]:3003" PGPASSWORD="postgres" PGHOST="localhost" PGUSER="postgres" PGDATABASE="indexer" PGPORT="5432" cargo run --bin indexer raw-ingester-bulk
    Finished dev [unoptimized + debuginfo] target(s) in 0.17s
     Running `/home/sibi/fpco/github/levana/levana-indexer/target/debug/indexer raw-ingester-bulk`
pid1-rs: Process not running as Pid 1: PID 36349
2024-01-12T06:10:07.416625Z  INFO indexer::raw_ingester: Checking if https://static.levana.finance/perps-indexer/ingested__atlantic-2__60000000-60002000.json.br exists, received: Response { url: Url { scheme: "https", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("static.levana.finance")), port: None, path: "/perps-indexer/ingested__atlantic-2__60000000-60002000.json.br", query: None, fragment: None }, status: 403, headers: {"date": "Fri, 12 Jan 2024 06:10:07 GMT", "content-type": "application/xml", "cf-ray": "84433c0abbef6033-SIN", "cf-cache-status": "DYNAMIC", "access-control-allow-origin": "*", "access-control-allow-headers": "*", "access-control-allow-methods": "GET, OPTIONS", "x-amz-id-2": "749v7S+wEOIdz8I/rGiDC/GOwwW7XvtlG3KNZU1EMF5rtNyKOHZufSa6fxtRcOuDMYgzQTZ3mD0=", "x-amz-request-id": "Z1T07RWWYQRWEJNQ", "server": "cloudflare"} }
2024-01-12T06:10:07.417025Z  INFO indexer::raw_ingester: https://static.levana.finance/perps-indexer/ingested__atlantic-2__60000000-60002000.json.br does not exist
2024-01-12T06:10:09.985046Z  INFO indexer::raw_ingester::ingest: Successfully ingested block 60000031, total txs: 1, filtered txs: 0
2024-01-12T06:10:10.877006Z  INFO indexer::raw_ingester::ingest: Successfully ingested block 60000029, total txs: 2, filtered txs: 0
2024-01-12T06:10:10.937725Z  INFO indexer::raw_ingester::ingest: Successfully ingested block 60000008, total txs: 3, filtered txs: 0
2024-01-12T06:10:11.021031Z  INFO indexer::raw_ingester::ingest: Successfully ingested block 60000017, total txs: 2, filtered txs: 0
2024-01-12T06:10:11.927936Z  INFO indexer::raw_ingester::ingest: Successfully ingested block 60000015, total txs: 4, filtered txs: 0

Events API from indexer

The events API in the indexer can be a convenient wait to get a stream of events related to a wallet. For example, when investigating a user claim of incorrect calculations, I used the following URL:

https://indexer-testnet.levana.finance/events?wallet=sei1jdr8zu4j488wqxx97frr0u8qsz799ywkhsfq3v

This gave me a JSON response with all events for the wallet. (Note: you may have to paginate.) By looking through, I was able to find the close position event I was looking for, get the transaction hash, and then look up the transaction on a block explorer.

Price issue

At a high level, this is what the Price bot does:

Fetches the Prices from the HTTP Hermes endpoint
Fetches oracle price from on chain (stored in the Pyth contracts)

And based on the above fetch, it calculates whether the price should be updated. The above fetch layer can also result in errors if there is issue with either the hermes endpoint or the Pyth contract.

If you receive an error like this:

AKT_USD: price is too old. Check the price feed and try manual cranking in the frontend.
Feed info: Pyth feed 0x4ea5bb4d2f5900cc2e97ba534240950740b4d3b89fe712a94a7304fd2fd92702.
Publish time: 2024-08-06 15:56:58 UTC. Checked at: 2024-08-06 15:57:57.539826136 UTC.
 Age: 59s. Tolerance: 42s.

You can receive the above error in two possible scenarios:

This means that the price fetched from the Hermes endpoint is not new enough and it has exceeded the configured tolerance seconds.
There is a significant chain congestion delaying when the price lands on-chain i.e. You have already updated the price via the bots, but it has not yet landed on chain because of the current on going congestion. There is a Jira ticket to report and distinguish this kind of issue.

If it's happening because of issues in Hermes endpoint, reporting to these channels would be a good next step:

#levana-pyth
#p2p-levana-grpc

Levana Staff Site