Security monitoring optimization: typical problems and their solutions

Optimizing security monitoring

Surveillance camera

Security issues can be a challenge, and properly configured monitoring can reduce both downtime and investigation effort. However, as the network grows, the list of resources that should be watched may grow much faster.

A typical situation is a data center: when new hosts (servers) are added, multiple monitors of the same type can be added (depending on server type: Web server, mail server and so on). In such a situation, it is required to reduce possible amount of monitors to as small number as possible. Below we propose several approaches to optimize watching a number of critical resources.

Properly select dependencies

In “Main parameters” of every monitor there’s possibility to specify a dependency monitor: if the latter is in certain state (for example, Down), the monitor depending on it is paused (no alerts will be generated, until the dependency monitor leaves the specified state.

Assign dependencies from upstream availability checks toward the services that rely on them. Each logical group of hosts and monitors should depend on as few well-chosen upstream checks as possible, so one confirmed root problem does not generate dozens of secondary alerts.

For example, when monitoring group of servers connected through the single network switch, make them depend on the monitor checking switch availability: if it goes offline, all the connected devices become unavailable. Instead of getting dozens of alerts from affected servers’ monitors, a single alert from the network switch will be enough to attend to the issue.

In current best practices, it is also worth combining key upstream checks with spike filters, so a short-lived network glitch does not suppress or trigger large groups of alerts unnecessarily.

Similarly, for all processes running on a server, default PING dependency is enough to prevent getting spurious alerts, when a single PING check could do.

Attend primary possible failures first

When something isn’t functioning properly, that can result in multiple alarms that are not directly related to the true problem.

For example, if a site SSL certificate is out-of-date, connectivity to the site may become broken, thus all the monitors using HTTP(S) on the host in question may fail. However, if the SSL certificate age is set as dependency monitor for all HTTP(S)/Web Transaction Monitors for the site in question, the “secondary alerts” will not be generated.

For such critical service checks, it is useful to distinguish between early warning and actual failure. For example, an SSL certificate monitor can switch to Warning when renewal time is approaching and to Down when the remaining time becomes critically short. That way, the monitoring system highlights the primary issue before it causes a wider service problem. Optional HTTP response validation can then be used on the site itself to confirm that the application still returns the expected content.

The same approach can work for all cases where resources can get scarce and thus cause multiple cascade failures (such as disk space, CPU usage, RAM usage and so on).

Channel and aggregate monitoring data

Syslog monitor is a universal tool for detecting many kinds of events on Unix-like systems.

However, as the number of servers grows, the amount of possible Syslog alerts can become excessive. In that case, it is often better to aggregate those events before they reach the main monitoring logic. A single relay host can still be used for this, but current versions also allow Linux Remote Network Agents to receive Syslog messages and SNMP traps inside remote networks and forward the monitoring workflow back to the main installation. This is often a cleaner architecture for centralized security monitoring than exposing every sender directly.

In the case of script-based monitors, including scripts or programs run over SSH, it is also possible to collect several related signals at once and return one overall state. Alongside separate monitors, such an aggregate check can provide one high-value alert when multiple related issues appear together, which helps reduce alert fatigue without removing visibility.

Use visual representation to notice possible problems

In real-life use cases, the Web interface can still be valuable as a visual security dashboard, especially when it highlights only the most important monitors or host groups.

When combined with proper access control, such dashboards are useful on many kinds of devices, including mobile browsers. Current alerting options also allow sound playback in the Web Interface browser, which can make a visual dashboard more noticeable in situations where e-mail is not the fastest signal.

Conclusion

If you would need assistance setting up any of the monitoring setups mentioned above, just contact us. We could assist you with setting aggregate monitors, or provide samples of how custom Web interface dashboard can look like.

Do you know of any other useful tricks on saving time and resources when monitoring crucial resources> Please let us know either by contacting us via the above link.