|
1 | 1 | # Platform Monitoring |
2 | 2 |
|
3 | | -Platform monitoring is a work in progress. If you wish to follow the progress you can do so [here](https://github.com/deis/monitor). |
| 3 | +## Description |
| 4 | +With the release of Workflow Beta4 we now include a monitoring stack for introspection on a running Kubernetes cluster. The stack includes 4 components: |
| 5 | +* [Telegraf](https://docs.influxdata.com/telegraf/v0.12/) - Metrics collection daemon written by team behind InfluxDB. |
| 6 | +* [InfluxDB](https://docs.influxdata.com/influxdb/v0.12/) - Time series database |
| 7 | +* [Grafana](http://grafana.org/) - Graphing tool for time series data |
| 8 | +* [Stdout-Metrics](https://github.com/deis/stdout-metrics) - Tool for consuming metrics via standard out and forwards them to InfluxDB |
| 9 | + |
| 10 | +## Architecture Diagram |
| 11 | + |
| 12 | +``` |
| 13 | + ┌────────┐ |
| 14 | + │ Router │ |
| 15 | + └────────┘ |
| 16 | + │ |
| 17 | + │ |
| 18 | + ▼ ┌──────────┐ |
| 19 | +┌─────────────┐ ┌─────────┐ │ stdout │ |
| 20 | +│ HOST │ │ fluentd │────▶│ metrics │ |
| 21 | +│ Telegraf │───┐ └─────────┘ └──────────┘ |
| 22 | +└─────────────┘ │ │ |
| 23 | + │ │ |
| 24 | +┌─────────────┐ │ │ |
| 25 | +│ HOST │ │ ┌───────────┐ │ |
| 26 | +│ Telegraf │───┼───▶│ InfluxDB │◀─────────┘ |
| 27 | +└─────────────┘ │ └───────────┘ |
| 28 | + │ │ |
| 29 | +┌─────────────┐ │ │ |
| 30 | +│ HOST │ │ ▼ |
| 31 | +│ Telegraf │───┘ ┌──────────┐ |
| 32 | +└─────────────┘ │ Grafana │ |
| 33 | + └──────────┘ |
| 34 | +``` |
| 35 | + |
| 36 | +### Grafana |
| 37 | +We expose Grafana through the router using [service annotations](https://github.com/deis/router#how-it-works). This allows users to access the Grafana UI by accessing `grafana.mydomain.com`. While we provide a default username/password of `admin/admin` this can be overridden at any time by setting the following environment variables in `$CHART_HOME/workspace/workflow-$WORKFLOW_RELEASE/manifests/deis-monitor-grafana-rc.yaml`: `GRAFANA_USER` and `GRAFANA_PASSWD`. |
| 38 | + |
| 39 | +It will preload several dashboards that we've created to help operators get started with monitoring their Kubernetes and Workflow installations. Each dashboard is meant to be a starting place for the operator and is not representative of all the dashboards needed to monitor a production installation. |
| 40 | + |
| 41 | +We are currently not writing the data to the host file system or to long term storage. Therefore, if the Grafana instance dies you will lose all custom and modified dashboards. It is recommended that you export your dashboards and store them in version control until a solution is implemented for long term storage. |
| 42 | + |
| 43 | +### InfluxDB |
| 44 | +As of the Beta4 release InfluxDB is writing data to the host disk, however, if the InfluxDB pod dies and comes back on another host the data will not be recovered. We intend to fix this in a future release. The InfluxDB Admin UI is also exposed through the router allowing users to access the query engine by going to `influx.mydomain.com`. You will need to configure where to find the `influx-api` endpoint by clicking the "gear" icon at the top right and changing the host to `influxapi.mydomain.com` and port to `80`. |
| 45 | + |
| 46 | +** Note: Each user accessing the Influx UI will need to make this change. ** |
| 47 | + |
| 48 | +You can choose to not expose the Influx UI and API to the world by updating `$CHART_HOME/workspace/workflow-$WORKFLOW_RELEASE/manifests/deis-monitor-influxdb-api-svc.yaml` and `$CHART_HOME/workspace/workflow-$WORKFLOW_RELEASE/manifests/deis-monitor-influxdb-ui-svc.yaml` and removing the following line - `router.deis.io/routable: "true"`. |
| 49 | + |
| 50 | +### Telegraf |
| 51 | +Telegraf is the metrics collection daemon used within the monitoring stack. It will collect and send the following metrics to InfluxDB: |
| 52 | + |
| 53 | +* System level metrics such as CPU, Load Average, Memory, Disk, and Network stats |
| 54 | +* Container level memtrics such as CPU and Memory |
| 55 | +* Kubernetes metrics such as API request latency, Pod Startup Latency, and number of running pods |
| 56 | + |
| 57 | +It is possible to send these metrics to other endpoints besides InfluxDB. For more information please consult the following [file](https://github.com/deis/monitor/blob/master/telegraf/rootfs/config.toml.tpl) |
| 58 | + |
| 59 | +### Stdout-Metrics |
| 60 | +Stdout-Metrics is a custom tool built by the Deis team to provide metrics that are reported via standard out - like Nginx. It consumes the log stream from FluentD filtering out messages that are not from the [Deis Router](https://github.com/deis/router). Once it finds a message it can parse it will turn that into a metric and send it directly to InfluxDB. |
| 61 | + |
| 62 | +### Customizing |
| 63 | +Each of these components allows for customization via environment variables. If you would like to learn more please visit the following github repositories: |
| 64 | + |
| 65 | +* [stdout-metrics](https://github.com/deis/stdout-metrics) |
| 66 | +* [monitor](https://github.com/deis/monitor) |
0 commit comments