Ping daemon that pings a number of targets at once, collecting their response times in histograms. Meant to be deployed at strategic points in your network in order to detect weak links.
- Meshping instances can be peered with one another and will then ping the same targets.
- Scrapeable by Prometheus.
- Targets can be added and removed on-the-fly, without restarting or reloading anything.
- CLI tool to interact with the daemon.
- IPv6 supported.
- Docker images: https://hub.docker.com/r/svedrin/meshping
Meshping can render heatmaps from the pings measured over the last few (by default, three) days. They look like this:
Here you can nicely see that, while most of the pings are between 11 and 16ms, a significant number take around 22ms. This indicates that under some conditions, packets may take a different route to the recipient, or the recipient may just take longer to send a reply.
Here's the heatmap for a WiFi point-to-point link that spans a few kilometers:
Most pings are fine, but there does appear to be a fair bit of disturbance - maybe there's a tree in the way.
The time span covered by these can be configured by setting the MESHPING_HISTOGRAM_DAYS
environment variable to a value other than 3
.
If you have set up multiple Meshping instances, you can have them exchange targets via peering. To do this, set the
MESHPING_PEERS
env var in each instance to point to each other. That way, they will exchange target lists regularly,
and you will be able to retrieve statistics from both sides to see how your links are doing.
When doing mathematical analyses on measurements, monitoring tools usually apply calculations based on averages and standard deviations. With latency data however, these methods yield unsatisfactory results.
Let's take another look at this heatmap:
You'll see that most of the pings are between 11 and 16ms, a significant number take around 22ms. The average as calculated by meshping is 16ms, and the standard deviation is probably somewhere around 2ms.
Now suppose you're trying to formulate an alerting rule based on those numbers. Say you'd want to be alerted whenever ping results differ from the average for more than two standard deviations. This means that data points smaller than 12ms or greater than 20ms would potentially trigger an alert. Since this would probably be a bit noisy, let's assume you'll only alert after a bunch of those arrive over a given time span.
But as you can see from the graph, there's a significant number of pings that just take 22ms, for whatever reason. Since this is a WAN link that we don't have control over, we won't be able to fix it - we just have to take it for what it is. Now how do you express that in terms of averages and standard deviations? The answer is: you can't, because the data does not follow a Normal distribution. Instead, this signal consists of two separate signals (because it seems that these packets can take two different routes, resulting in a small difference in latency), each of which can be described using those terms: one lives at 13±3ms, the other one at 22±2ms. To conduct a meaningful analysis of this data, you can't just approach it as if it consisted of one single signal.
I'd like to start looking for a solution to this. At the moment I'm focusing on getting the graphs to a point that they visualize this, and I'm pretty satisfied with the heatmaps as they are currently. Next, I'll probably look into modality detection and finding ways to extract patterns out of the data that I can then use to draw conclusions from.
Meshping provides a /metrics
endpoint that is meant to be scraped by Prometheus. You can run queries on the data for things such as:
- loss rate in %:
rate(meshping_lost{target="$target"}[2m]) / rate(meshping_sent[2m]) * 100
- quantiles:
histogram_quantile(0.95, rate(meshping_pings_bucket{target="$target"}[2m]))
- averages:
rate(meshping_pings_sum{target="10.5.1.2"}[2m]) / rate(meshping_pings_count[2m])
Grafana added Heatmap support in v5.1, so we now can produce graphs similar to the images above, that contain a histogram over time that shows the pings.
To get these graphs, add a Heatmap panel to Grafana, and configure it with:
- Query:
increase(meshping_pings_bucket{target=\"$target\"}[1h])
- Legend format:
{{ le }}
- Unit:
ms
- Min step:
1h
For a one-histogram-per-day panel, use these settings:
- Query:
increase(meshping_pings_bucket{target=\"$target\"}[1d])
- Min step:
1d
Meshping is meant to look at pings over a long time range (e.g. two days or a week), so be sure not to make those time frames too short. Otherwise you'll lose the heatmap effect because every data point will be its own histogram.
In the examples directory, there's also a json dashboard definition that you can import.
Deploying meshping is easiest using docker-compose
, with the
docker-compose.yaml file from this repo.
This will deploy meshping, along with a Watchtower instance that keeps
Meshping up-to-date. It can be deployed as-is by adding a Stack through Portainer, or using docker-compose
:
mkdir meshping
cd meshping
wget https://github.com/Svedrin/meshping/raw/master/examples/docker-compose.yml
docker-compose up --detach
Meshping should now be reachable at http://<your-ip>:9922
.
Meshping is configured through environment variables. These exist:
MESHPING_TIMEOUT
: Ping timeout (default: 5s).MESHPING_PEERS
: Comma-separated list of other Meshping instances to peer with (onlyip:port
, no URLs).MESHPING_HISTOGRAM_DAYS
: How many days of data to keep in the histograms (default: 3).
Building locally for development is easiest by running the ./run-dev.sh
script. It will build the container and start up Meshping.
- If you're running meshping behind nginx, be sure to set
proxy_http_version 1.1;
or it'll be unbearably slow.