Introduction
Let’s explore some of the advanced monitoring capabilities of 5×9 Lightweight Monitoring System (LMS).
Our previous blog post compared the Raspberry Pi 4 and Raspberry Pi 5 Single Board Computers (SBCs), focusing solely on WiFi performance. If you missed that post or want to refresh your memory, please visit the following link.
This blog post extends the discussion from the previous entry on our wired and WiFi network testing environment, shifting focus to comprehensive monitoring at remote sites offering both wired and WiFi access, such as offices or hotels.
A key question is what should be monitored to detect problems and pinpoint the root cause. The 5×9 approach, which utilizes the LMS, organizes monitoring into layers: network, services, and applications. The network layer purely focuses on IP transport performance, such as TWAMP, network bandwidth and ICMP. Services like DHCP, Radius, and DNS are essential for facilitating network connections and ensuring application functionality. Meanwhile, applications such as HTTP, Speedtest, and YouTube rely on both the network and service layers to operate effectively. Additionally, whenever possible, we will also collect historical non-performance data to aid in future troubleshooting and correlation analyses. To achieve this, we will use the WiFi RadioScan module to collect historical snapshots of WiFi radio conditions.
Test setup
Test network, Probe and Probe Management
The test is conducted in a real home network environment based on Mikrotik equipment. Raspberry Pi 5 SBC is used as a Probe both for wired and wireless measurements. All test network and RPi5 HW details are described in the previous blog.
- RPi5 Probe setup
- WiFi interface
- WiFi client role
- 1GE interface
- Wired client role
- Network Bandwidth server
- TWAMP reflector
- Probe management
- WiFi interface
The 5×9 LMS is used for test creation, schedule definition, test execution, measurement data collection, alerting, and performance metrics visualizations.
Used LMS modules
In total, 12 LMS modules are used for site monitoring across both Ethernet and WiFi probe interfaces. There are two functional types of LMS modules: the connectivity framework and the measurement modules.
Connectivity modules
The connectivity framework modules are responsible for establishing the connections required to execute measurements. Multiple types are supported, and generally, these can be divided into Wired (Ethernet, VPN, PPPoE), Wireless (WiFi), and Mobile (2G/3G/4G/5G, NB-IoT) frameworks.
Two connectivity framework modules are used:
- Ethernet – connects to wired Ethernet network (VLAN, IP address, …) and executes measurement modules
- WiFi – Connects to WiFi network (SSID, encryption, passes through Captive Portal) and executes measurement modules
It is important to note that the connectivity framework modules operate as totally independent entities. This independence enables parallel test executions across different framework modules or measurements that are, as in this case, directed from WiFi to a wired Probe interface (TWAMP and Network Bandwidth).
Measurement Modules
Measurement modules are dedicated pieces of software designed to perform a single type of measurement, such as DNS, TWAMP, HTTP or VoLTE
10 Measurement Modules are used to provide an exhaustive overview of site performance:
Network
- TWAMP – Two Way Active Monitoring Protocol, reports Round Trip Delay, Jitter and Loss, client responder based
- comes in as two separate modules for maximal control (TWAMP Client and Reflector)
- Network Bandwidth – reports TCP and UDP connection BW performance, client responder based
- comes in as two separate modules for maximal control (Network BW Client and Server)
- ICMP latency – Reports two-way latency towards the configured host, ICMP-based
Services
- DHCP – DHCP server check, reports server availability and response time
- Radius – Radius server check, reports server availability, authentication success and response time
- DNS – A/AAAA DNS resolver check, reports DNS server availability and response time
Applications
- Speedtest: Reports upload and download bandwidth, from the client to publicly available Internet infrastructure or TR-471 server
- HTTP – HTTP/HTTPS check, reports server availability and response time
- YouTube – reports YouTube server availability and response time, reports user service quality
Troubleshooting
- WiFi radio scan – Creates a snapshot of remote Probe 2,4 and 5GHz WiFi radio conditions
Monitoring setup
For wired access monitoring, which relies on 1GE connections, the following modules are employed: DHCP, DNS, ICMP Latency, and Speedtest. The first two modules, DHCP and DNS, are used to verify services required to ensure client connectivity. The latter two, ICMP Latency and Speedtest, are used to assess the latency and speed of the fixed Internet connection.
- DHCP: Monitors DHCP server performance from a fixed network client perspective
- DNS: Monitors DNS server performance from a fixed network client perspective
- Radius: Monitors Radius server performance, primarily used for WiFi WPA2 Enterprise authentication
- ICMP: Monitors Internet connection latency
- Speedtest: Monitors Internet connection speed
WiFi connectivity is secured with WPA2 Enterprise so in addition to validating DNS and DHCP services, a Radius health check is introduced (the Radius is responsible for client authentication). TWAMP and Network Bandwidth modules, running between Probe WiFi and wired interfaces, validate the stability of the WiFi service, monitoring delay, packet loss, jitter, and bandwidth. ICMP, HTTP, and YouTube modules target Internet destinations. The ICMP module validates WiFi client latency towards public DNS servers used in the test (Google, Quad9, and Cloudflare). The HTTP module accesses public websites to verify NAT operation, while the YouTube module streams video, providing user perspective statistics that relate to WiFi performance stability, such as playback interruptions, duration of interruptions, and average video quality.
- DHCP: Monitors the performance of DHCP servers from the perspective of WiFi network clients, performed by default by WiFi connectivity framework
- TWAMP: Measures delay, jitter, and packet loss over the WiFi interface; the TWAMP reflector is located on the wired Probe interface, directly related to WiFi stability and quality
- DNS: Monitors the performance of DNS servers from the perspective of WiFi network clients.
- Network Bandwidth: Assesses the WiFi network’s bandwidth capacity and utilization, the bandwidth server is on the Probe’s wired interface, directly related to WiFi stability and quality
- HTTP: Measures the performance and response time of HTTP servers.
- YouTube: Evaluate the streaming performance and quality of YouTube videos, directly related to WiFi stability and quality
- WiFi radio scan – Creates a snapshot of remote Probe 2,4 and 5GHz WiFi radio conditions
Alarming setup
Of course, you will not need to constantly monitor measurement results and search for problems – this is the responsibility of the 5×9 LMS alarming module. In this scenario, email notifications are utilized, and alarming thresholds for demonstration purposes are defined for the DHCP, DNS, Speedtest, HTTP, and YouTube modules. Should there be any performance degradation, the LMS will notify you through the designated communication channel.
Why all this? To detect and pinpoint the problem
At this point, we have completed the LMS Probe measurement and Node Manager alarming setup, along with all visualizations through automatically created Grafana dashboards.
When you arrive at work, you can check the home dashboard, which gives an overview of your monitoring setup, including a geo map, probe infrastructure status, and a list of active alarms. If everything is green with no alarms, you can go ahead and grab some coffee.
If there are active alarms, you can easily navigate through the performance panels to identify the issue and pinpoint the exact problem cause.
In this particular case, there are two active alarms: one indicating a drop in Internet connection speed and another reflecting increased DNS response times over the 5GHz WiFi network. The suspects in this scenario include the Internet connection link and potential site WiFi issues.
Let’s start by troubleshooting and examining WiFi-related performance panels: connection establishment, RADIUS performance, and bandwidth.
Since these panels show no signs of any problem or performance degradation, we can immediately rule out WiFi issues. Therefore, the active DNS alarm was not caused by a WiFi-related issue.
Let’s now examine what was happening with the fixed Internet connection speed. We observed a temporary drop in speed, likely due to the service provider, which is the primary and sole suspect for the performance degradation noted in both alarms.
As the Internet connection speed has recovered, it appears to have been an isolated event, so no further intervention is necessary. However, we will continue to monitor dashboards or email inbox for any future potential degradation or speed-related alarms.
And that is it – troubleshooting completed and the case closed within three minutes 🙂
To keep this blog post concise, we did not repeat the dashboards for modules used in the previous posts. If you are interested in seeing dashboards for the WiFi RadioScan, Speedtest and Network Bandwidth modules, you can view them here.
Last Thoughts
LMS is a versatile tool capable of providing comprehensive insights into performance across all monitored layers, aimed at rapidly detecting problems through alarming in various scenarios—such as AAA failures, WiFi connection and throughput issues, DHCP and DNS related problems, Internet connection disruptions, and more.
Cleverly deployed Probes with the right configuration can pinpoint the exact cause of problems in a single site (as in this case) or in distributed and complex network scenarios (such as ISP networks or data centers). Stay tuned for more insights on distributed and complex network scenarios in upcoming blog posts.
What is also important is that LMS provides the baseline for all performance measurements. This baseline, which gives you insight into what is historically normal performance, is crucial for a deep understanding of your network, services, and applications, and is a prerequisite for fast and efficient troubleshooting.
Author: Mario Jurcevic, 5×9 Networks