Consensus voting in uptime monitoring | why it makes sense

Classic multi-region monitoring cuts false negatives but adds false positives. Consensus voting solves both. An educational breakdown of the pattern.

The weak spot of single-region monitoring

Single-region monitoring means your checker runs in one location. If a network problem happens on the path between this location and your server (BGP flap, routing change, ISP maintenance), the monitor reports DOWN even though the server is fine.

This is a false positive. You get the alert, you wake up at 3 AM, you open your laptop and find the site is working.

False positives hurt for two reasons:

Alert fatigue. If you get notifications that turn out to be untrue, you gradually start to ignore them. Even the real ones.
Loss of trust in monitoring. The team stops reacting to alerts because "it's probably just BGP again".

Naive multi-region solves one problem, makes the other worse

The common way to address this is multi-region monitoring. The checker runs from multiple locations and an alert fires any time any of them reports DOWN.

This improves detection of real outages: when the server is truly unreachable, more than one checker goes dark, not just one. So far, so good.

But the false-positive problem gets worse. With one checker you had a certain rate of false alerts. With three checkers it is mathematically more likely that at least one of them will falsely report DOWN. You end up with more false alerts, not fewer.

Consensus voting solves both sides

Consensus voting works differently: on the first DOWN signal, no alert fires. You ask the other regions first. If the majority of them also report DOWN, it's a real outage. If not, it's just a network anomaly in one region.

Pseudocode:

result = check_http(monitor)  # primary region
if result.status == 'down':
    secondary_results = check_from_other_regions(monitor)
    if secondary_results:
        # Default rule: 2 of 3 worker nodes must confirm DOWN
        if count_down(secondary_results) + 1 >= 2:
            result.status = 'down'
        else:
            result.status = 'up'
            result.note = 'consensus mismatch'

Example scenario:

primary  → DOWN  (a BGP flap between Liptov and your hosting)
eu1      → UP    (Bratislava; the other regions see the server normally)
eu2      → UP    (Liptov, secondary)

Result: UP. No alert. Entry in debug log.

And the opposite, a real outage:

primary  → DOWN
eu1      → DOWN
eu2      → DOWN

Result: DOWN. Alert goes via Telegram/email/webhook.

Tradeoff: latency

Consensus voting has a cost: on a DOWN signal it adds a few seconds of latency while it queries the other regions. For most use cases (uptime monitoring with a one-minute interval) this is negligible. For extremely strict SLAs with detection time under 30 seconds, it can become a real trade-off.

When multi-region makes no sense

Internal APIs on 192.168.x.x. Nobody outside your network can reach them, so multi-region monitoring from the internet is meaningless. For the LAN, use a pull-agent pattern: the agent runs inside your network and pushes results out over HTTPS.
Single-customer internal app. If only a few people use it and you're one of them, you'll learn about an outage before monitoring does.
A service that runs in only one region of the world. If your service is EU-only and you see it as DOWN from the US, that's not a false positive; it's expected.

How we do it in ePulz.io

ePulz.io runs consensus voting across 3 real worker nodes: primary in Liptovský Mikuláš, eu1 in Bratislava and eu2 in Liptovský Mikuláš (a secondary machine). The default threshold is 2 of 3 - an outage is recorded once at least two worker nodes confirm it. Geographic diversity: 2 cities; hardware redundancy: two independent machines in Liptov in case of a HW/SW failure of the primary node.

For every check the per-region results are stored (e.g. "primary:up, eu1:down"), so when debugging you have an exact record of how the decision was made.

Conclusion

Consensus voting is not a magic solution. It won't give you zero false positives and it won't save you in a real outage. But it's a better compromise than single-region (high false positives during network anomalies) or naive multi-region (multiplication of false positives).

Check out ePulz.io. 7-day trial, 3 monitors, no credit card.

Consensus voting in uptime monitoring - why it makes sense