Back to blog

Consensus voting in uptime monitoring - why it makes sense

· 10 min read

Classic multi-region monitoring reduces false negatives but increases false positives. Consensus voting solves both. An educational breakdown of the pattern.

The weak spot of single-region monitoring

Single-region monitoring means your checker runs in one location. If a network problem happens on the path between this location and your server (BGP flap, routing change, ISP maintenance), the monitor reports DOWN even though the server is fine.

This is a false positive. You get the alert, you wake up at 3 AM, you open your laptop and find the site is working.

False positives hurt for two reasons:

  1. Alert fatigue. If you get notifications that turn out to be untrue, you gradually start to ignore them. Even the real ones.
  2. Loss of trust in monitoring. The team stops reacting to alerts because "it's probably just BGP again".

Naive multi-region solves one problem, makes the other worse

The common way to address this is multi-region monitoring. The checker runs from multiple locations and an alert fires any time any of them reports DOWN.

This improves detection of real outages - when the server is truly unreachable, not just one checker goes dark but multiple. Fine.

But the false positive problem gets worse. With one checker you had some rate of false alerts. With three checkers it is mathematically more likely that at least one of them will falsely report DOWN. You get more false alerts, not fewer.

Consensus voting solves both sides

Consensus voting works differently: on the first DOWN signal you do not fire an alert. You ask the other regions first. If the majority of them also report DOWN, it's a real outage. If not, it's a network anomaly in one region.

Pseudocode:

result = check_http(monitor)  # primary region
if result.status == 'down':
    secondary_results = check_from_other_regions(monitor)
    if secondary_results:
        # Default rule: 2 of 3 regions must agree on DOWN
        if count_down(secondary_results) + 1 >= 2:
            result.status = 'down'
        else:
            result.status = 'up'
            result.note = 'consensus mismatch'

Example scenario:

primary  → DOWN  (one checker had a BGP flap)
region_a → UP    (other regions see the server normally)
region_b → UP

Result: UP. No alert. Entry in debug log.

And the opposite, a real outage:

primary  → DOWN
region_a → DOWN
region_b → DOWN

Result: DOWN. Alert goes via Telegram/email/webhook.

Tradeoff: latency

Consensus voting has a cost - on a DOWN signal it adds a few seconds of latency to query other regions. For most use cases (uptime monitoring with a minute interval) it's negligible. For extremely strict SLAs with detection time under 30 seconds it can be a compromise.

When multi-region makes no sense

  • Internal APIs on 192.168.x.x. Nobody outside your network can reach them, so multi-region from the internet is meaningless. For LAN use a pull-agent pattern - the agent runs in your network and pushes results over HTTPS.
  • Single-customer internal app. If only a few people use it and you're one of them, you'll know about an outage before monitoring does.
  • A service that only runs in one region of the world. If your service is EU-only and you see it as DOWN from the US, it's not a false positive - it's expected.

How we do it in ePulz.io

ePulz.io has consensus voting in the architecture. gather_multiregion() and combine_consensus() in monitoring.py implement the pattern above. The threshold (how many regions must confirm DOWN) is configurable via the min_down parameter.

In the Check table each row stores a consensus field with CSV (e.g. "primary:up,region_a:up,region_b:down"), so for debugging you have the exact record of how the decision was made.

For self-hosting a worker in another region, the admin panel has a WireGuard bundle generator - it creates a tar.gz with configuration for a new worker node and adds it to worker_urls.

Conclusion

Consensus voting is not a magic solution. It won't give you zero false positives and it won't save you in a real outage. But it's a better compromise than single-region (high false positives during network anomalies) or naive multi-region (multiplication of false positives).

Check out ePulz.io. 7-day trial, 3 monitors, no credit card.


Try ePulz.io free - 7 days, no credit card needed.

Create account