-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: First iteration of a prometheus exporter for ara #483
base: master
Are you sure you want to change the base?
Conversation
Heavily a work in progress and learning experience over which we will iterate a number of times. The intent is to make a prometheus exporter gather metrics from an ara instance and expose them so that prometheus can scrape them.
- Added support for querying results through pagination - Added support for paginating through pages of results - Query everything at boot via result limit (i.e, ?limit=1000) and pagination - Store the latest object timestamp such that next scrape will only pick up objects created after that using ?created_after=<timestamp>
- Move it under our existing ara CLI so it can re-use all the boilerplate about instanciating an API client with all the settings - Add args for limits, poll frequency and port for the exporter to listen on
405c187
to
86dfdf8
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 09s |
- Added --max-days to limit backfill at boot - Added a bit of verbosity - Adjust hosts to be scanned before tasks (there are way, way more tasks than hosts in terms of volume)
- First try at a playbook histogram containing the timestamp and duration
I've added a bit more context in the issue (#177 (comment)) and got two quick iterations in:
Edit: I've put up an example /metrics response from a single playbook's metric as an histogram in the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0#file-playbooks_as_histogram-txt It wants to group metrics based on their label uniqueness, I suppose in our case we want each playbook to be represented individually so we should include their id ? More on that later. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 24s |
Still heavily a work in progress but getting a better undertanding of how things work. Host and Tasks have now have gauges by status. Disable playbook metrics temporarily until we revisit it with newfound knowledge.
I think my brain is starting to understand what is happening. I've temporarily commented out the current iteration of the playbook metrics until I revisit it with newfound knowledge. This latest iteration re-works the host and tasks metrics to have gauges per status such that we are able to do graphs like this, for example: Prometheus task results in grafanaPrometheus host results in grafanaA snippet of what this looks like when querying the prometheus exporter:
|
Build failed. ✔️ ara-tox-py3 SUCCESS in 9m 57s |
- Add a summary metric for tracking the duration of tasks. This is what was intended when trying to do the playbook histogram so we'll come back to that later.
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 14s |
Build succeeded. ✔️ ara-tox-py3 SUCCESS in 4m 15s |
- Substantial cleanup and cut on code duplication - Fix linting and style - Metric labels moved to default constants, leave the door opened for the possibility of customizing them - Retrofit what we learned back to the playbook metrics - Re-enable playbook metrics
feadacf
to
7558a6f
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 12s |
- More cleanup - Removed Gauges for each status of playbooks and tasks, they were not useful once understanding how to use Summaries and generated a lot of needless metrics in hindsight - Added a package extra for [prometheus] - First iteration of docs - Add first iteration of grafana dashboard
b82da8c
to
6283872
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 15s |
I feel this is ready for a first look to a wider audience so I've asked around for testing and feedback:
The final implementation may change before landing (for example if I screwed up in metric types) but this will be useful to make sure we did the right decisions and do the necessary changes before merging. I am narrowing the scope of this first PR to playbooks, tasks and hosts for now. Results and plays can come in a later patch as necessary. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 50s |
0ce3cf1
to
c92b29b
Compare
Build failed. ✔️ ara-tox-py3 SUCCESS in 3m 49s |
c92b29b
to
6283872
Compare
Nothing special pushed, just rebased on top of latest master. |
Build failed. ✔️ ara-tox-py3 SUCCESS in 4m 05s |
I will eventually include it in the docs but in the meantime, I've come up with the following graph that explains how one might use the exporter:
|
|
||
ara doesn't provide monitoring or alerting out of the box (they are out of scope) but it records a number of granular metrics about Ansible playbooks, tasks and hosts, amongst other things. | ||
|
||
Starting with version 1.6.2, ara provides an integration of `prometheus_client <https://github.com/prometheus/client_python>`_ that queries the ara API and then exposes these metrics for prometheus to scrape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.6.2 didn't pan out, we went straight to 1.7.0. It can be included in a release as soon as it's ready.
help='Maximum number of days to backfill metrics for (default: 90)', | ||
default=90, | ||
type=int | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be interesting for the exporter to be able to filter queries like the general CLI commands work, for example ara playbook list
(docs) has:
--ansible_version <ansible_version>
List playbooks that ran with the specified Ansible
version (full or partial)
--client_version <client_version>
List playbooks that were recorded with the specified
ara client version (full or partial)
--server_version <server_version>
List playbooks that were recorded with the specified
ara server version (full or partial)
--python_version <python_version>
List playbooks that were recorded with the specified
python version (full or partial)
--user <user> List playbooks that were run by the specified user
(full or partial)
--controller <controller>
List playbooks that ran from the provided controller
(full or partial)
--name <name> List playbooks matching the provided name (full or
partial)
--path <path> List playbooks matching the provided path (full or
partial)
--status <status> List playbooks matching a specific status
('completed', 'running', 'failed')
Hi, I think you can transform for example this metric : into several metric, ara_tasks_duration { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } number seconds (or micro seconds if needed) ara_tasks_results { action="command", name='Echo the abc binary string", path="/home/.......", playbook="30" } 1 We can work together to build correct metric, then we will produce correct python for exporter. |
Hi @voileux and thanks for reaching out! What you suggest makes sense to me and it's worth looking into. I don't have bandwidth to look into this /right now/ but I will revisit this in the near future. |
Hello, depending on your goal here : it might be easier for you to limit the "exporter part" to what you want to monitor live (i.e. what you want to trigger alerts on) And for the visualization aspects, directly connect grafana to your database with the specific grafana
something like : flowchart TD
G[Grafana] -->|promql <br/> visualize <b>alerts</b><br/> and correlate current metrics| P(Prometheus )
G -->|db datasource <br/> visualize <b>metrics</b> <br/>current and historical| D
W(alertmanager) -->|promql<br/>trigger alerts| P
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
instead of (from your previous schema here) flowchart TD
G[Grafana] -->|promql| P(Prometheus)
P-->|scrapes /metrics<br/> stores data| E(Prometheus Exporter<br/>prometheus_client)
E --> |query metrics| D(ara API server <br/> django <br/>fa:fa-database recorded playbooks)
A(ansible playbook) -->|collects data<br/>& sends it| D
(edit: I forgot to put the mermaid keyword, and took this opportunity to add This indeed requires you to rewrite your panels in grafana in order to make use of the proper SQL, and you will need to open the connection between grafana and your DB Also it avoids to transform the whole content of the DB opentelemetry format and scraping it each time, which will scale better :-D |
Hi, I haven't revisited this in a little while but I wanted to say it was still on my radar and I plan to work on this some more in the near future. |
Hello @dmsimard, Thank you for the great project! Really nice to see / use! I'm interested in taking over the topic if that's alright with you? And also willing to build the dashboard for grafana based on the metric gathered. I'm no expert, but I've used them a bit. I've created a branch on my repo tried to take into account your suggestions & @voileux 's. However I'm currently stuck on the testing phase. I've read your documentation / code, but I can't make the
The project runs locally, I still have access to everything as before, but no way to get access to prometheus through the CLI:
Logs of the previous steps:
I feel like this part of the documentation is a bit thin, and having to use/understand buildah/docker/tox (is it needed ?) or the overall parser is difficult to me. My feeling is that there's either some cache that I haven't cleaned and that it still uses some old version ( I'm also willing to help the doc on those part to help other people participate to it - but so far is still too blurry for me to write anything clear. If you can fill in the blank it would be amazing! Thanks! |
Hi @xlr-8, thanks for your interest and for looking into this. I haven't yet revisited this topic but I did talk about it at configuration management camp last year. I am still interested in making this work :) In the backup slides for last year's presentation there's a condensed how-to for testing this: Demo: Trying out the exporter
This should help you get started without needing to re-build container images after every change.
You can make changes to the exporter code and re-run it with the The prometheus config supplied in the backup slides:
Start a Prometheus container:
It's probably worthwhile for the branch to be rebased on top of the latest master by now. There hasn't been changes that would impact the prometheus implementation, I don't think, but there's been things like django updates and such. I can take care of that if you'd like. Otherwise:
Personal preference :) I can be reached over matrix (or the slack bridge) and maybe IRC for discussion. |
Awesome! Thank you so much for the detailed answer!
Alright, I figured perhaps there was some better integration with RedHat / RedHat like distros, as I could see you were using Fedora/CentOS. No worries for the rebase, I'll take care of it. I should take a look at it within the next few days ❤️ |
@xlr-8 did you end up spending some cycles on this? It is coming back into my radar in the not-too-distant future. |
As discussed on the issue for this topic: #177
It's not finished and still very much a WIP but I figured it might be worthwhile to iterate under a branch in a PR instead of the gist: https://gist.github.com/dmsimard/68c149eea34dbff325c9e4e9c39980a0
If prometheus_client is installed, there will be an
ara prometheus
command to expose prometheus metrics gathered and parsed from an ara instance: