Monitoring system /
Technical details

This describes the monitoring system that has evolved significantly since initial production deployment in March 2023. The system has progressed through multiple versions, with the current v4 system introducing sophisticated constraint management, candidate states, and enhanced selection algorithms.

Monitoring servers

Servers registered in the NTP Pool system are monitored by monitoring servers around the world.

The monitoring servers themselves can be in “candidate”, “testing”, or “active” modes for each server they monitor:

Candidate: Monitor is selected for potential assignment and serves as backup coverage
Testing: Monitor actively tests the server with reduced frequency while being evaluated
Active: Monitor performs frequent testing and contributes to the server’s performance score

For each NTP server, the system targets 7 “active” monitors and 5 “testing” monitors, with additional candidates maintained as backups. The system continuously optimizes these assignments based on performance, network topology, and various constraints.

Monitor and Server States

The monitoring system uses two separate state systems that work together to ensure reliable monitoring of the pool.

Monitor Global States

Every monitoring agent has a global state that determines whether it’s actively working:

Active means your monitor is fully operational and running all its assigned tests. This is the normal working state for a healthy monitor that’s been approved and is functioning properly.

Testing indicates your monitor is operational but still being evaluated by the system. New monitors typically start in this state while the system verifies they’re working correctly and determines optimal assignments.

Pending indicates your monitor is newly added and waiting to be activated. This state is rarely used in practice—most monitors are added directly as “testing” to begin evaluation immediately.

Paused means your monitor has been stopped and isn’t running any tests. This can happen due to technical issues, constraint violations, or administrative decisions. A paused monitor won’t contribute any data to the pool.

Deleted indicates the monitor has been removed from the system entirely.

Server-Monitor Relationship States

Separate from the global monitor state, each server-monitor relationship has its own status. These appear on individual server detail pages and show how the monitoring assignment is progressing:

Candidate means a monitor is being considered for assignment to test your server. Candidates also help validate the monitoring setup and serve as backups for testing monitors, ensuring ample coverage is available as conditions and monitoring needs change. The system evaluates factors like network location, existing monitor coverage, and constraint rules before making assignments.

Testing shows a monitor is actively testing your server and collecting performance data. Testing monitors check your server less frequently than active monitors while the system evaluates their suitability.

Active indicates a monitor is confirmed for regular monitoring of your server. Active monitors test your server more frequently, and their results typically contribute to your server’s overall performance score calculation.

State Transitions and System Optimization

The system continuously optimizes monitor assignments through automatic state transitions. Monitors typically progress through the flow: candidate → testing → active. However, the system also regularly reevaluates assignments and may move monitors in the opposite direction or try different monitor-server combinations to maintain optimal coverage.

Candidate to Testing: When the system needs more monitoring coverage for a server, it promotes candidates to testing status to begin collecting performance data.

Testing to Active: Monitors that demonstrate good performance and reliability in testing status get promoted to active for more frequent monitoring.

Backwards Transitions: The system may demote active monitors back to testing or testing monitors back to candidate status as part of ongoing optimization. This isn’t necessarily a sign of poor performance—it often reflects the system balancing coverage across the monitoring network or trying different combinations for better overall results.

This dynamic behavior ensures the pool maintains optimal monitoring coverage while adapting to changing network conditions, monitor availability, and performance patterns.

How the Systems Work Together

A monitor must have an “active” or “testing” global state before it can be assigned to test any servers. Even if a monitor-server relationship shows “active,” the monitor won’t actually perform tests if its global state is “paused.”

The system automatically manages these assignments based on network topology, geographic distribution, and various constraints designed to ensure fair and effective monitoring. For example, monitors and servers owned by the same account or located in the same network subnet typically won’t be paired together.

What This Means for Operators

Server operators will see different monitors cycling through candidate, testing, and active states on their server pages. This is normal system behavior as the monitoring network continuously optimizes coverage. Active monitors provide more frequent testing and contribute more heavily to your server’s performance score, while testing monitors help validate the monitoring setup. The system handles all transitions automatically.

Monitor operators should expect their agents to start in “testing” state and potentially progress to “active” for some servers. Your monitor may serve different roles for different servers—active monitoring for some, testing for others—as the system finds the optimal assignments. State changes reflect ongoing optimization rather than problems with your monitor setup.

Selector

The selector runs continuously and re-evaluates each server every 20 minutes (or every 60 minutes if changes were made in the previous run) to optimize which monitors should be in “candidate”, “testing”, or “active” states.

This is done via the GetMonitorPriority query and selection algorithms in the selector package. The system applies multiple criteria including constraint validation, performance evaluation, global monitor status, and gradual transitions for existing assignments that violate new constraints.

The selector uses change limits to ensure stability, typically making only a few changes per server per evaluation run. The system includes emergency override logic to maintain minimum monitor coverage and bootstrap logic for new servers.

Scorer

After each monitoring result a “step” is calculated and applied to the score for that monitor / server pair (the ‘1-score’).

The new scoring calculation is called recent median. In the UI it’s listed as “overall score”. It works simply by choosing the median score of the ‘1-scores’ from “active” monitors in the last 20 minutes. (GetScorerRecentScores). In testing this has shown to be very effective at avoiding impact from a few errant measurements, and still react quickly enough to server or network trouble (thanks to Miroslav Lichvar for this idea).

If there aren’t any scores from “active” monitors in the last 20 minutes (for example if the server is new) the scores are calculated based on the most recent scores from all monitors (including “testing” and “candidate” monitors) in the last 45 minutes.

The score is calculated after every monitoring probe, but only recorded every 15 minutes or when it has changed.

#Monitoring

Monitoring system / Technical details