Skip to content

Commit 6181e8b

Browse files
authored
Merge pull request #39 from jianxiaoguo/main
feat(monitor): use prometheus as grafana datasource
2 parents cb83811 + 03d664b commit 6181e8b

8 files changed

Lines changed: 93 additions & 86 deletions

File tree

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,8 @@ Please see below for links and descriptions of each component:
1919
- [registry](https://github.com/drycc/registry) - The Docker registry
2020
- [logger](https://github.com/drycc/logger) - The (in-memory) log buffer for `drycc logs`
2121
- [monitor](https://github.com/drycc/monitor) - The platform monitoring components
22-
- [influxdb](https://github.com/drycc/influxdb) - The monitor database
22+
- [influxdb](https://github.com/drycc/influxdb) - The controller app metrics database
23+
- [prometheus](https://github.com/drycc/prometheus) - The monitor database
2324
- [rabbitmq](https://github.com/drycc/rabbitmq) - RabbitMQ is a message broker used with controller celery
2425
- [storage](https://github.com/drycc/storage) - The in-cluster, kubernetes storage, s3 api compatible, hybrid storage system.
2526
- [workflow-cli](https://github.com/drycc/workflow-cli) - Workflow CLI `drycc`

_scripts/install.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -478,6 +478,14 @@ monitor:
478478
telegraf:
479479
imageRegistry: ${DRYCC_REGISTRY}
480480
481+
prometheus:
482+
prometheus-server:
483+
retention: ${PROMETHEUS_SERVER_RETENTION:-"15d"}
484+
persistence:
485+
enabled: true
486+
accessMode: ReadWriteOnce
487+
size: ${PROMETHEUS_SERVER_PERSISTENCE_SIZE:-10Gi}
488+
storageClass: ${PROMETHEUS_SERVER_PERSISTENCE_STORAGE_CLASS:-""}
481489
482490
passport:
483491
replicas: ${PASSPORT_REPLICAS}

charts/workflow/Chart.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,9 @@ dependencies:
4242
- name: passport
4343
repository: oci://registry.drycc.cc/charts-testing
4444
version: x.x.x
45+
- name: prometheus
46+
repository: oci://registry.drycc.cc/charts-testing
47+
version: x.x.x
4548
description: Drycc Workflow
4649
home: https://github.com/drycc/workflow
4750
maintainers:

charts/workflow/values.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,12 @@ global:
4242
# - on-cluster: Run Influxdb within the Kubernetes cluster
4343
# - off-cluster: Influxdb is running outside of the cluster and credentials and connection information will be provided.
4444
influxdbLocation: "on-cluster"
45+
# Set the location of Workflow's influxdb cluster
46+
#
47+
# Valid values are:
48+
# - on-cluster: Run prometheus within the Kubernetes cluster
49+
# - off-cluster: prometheus is running outside of the cluster and credentials and connection information will be provided.
50+
prometheusLocation: "on-cluster"
4551
# Set the location of Workflow's grafana instance
4652
#
4753
# Valid values are:
@@ -286,6 +292,29 @@ passport:
286292
databaseUrl: ""
287293
databaseReplicaUrl: ""
288294

295+
prometheus:
296+
## prometheus-server configuration##
297+
prometheus-server:
298+
replicas: 1
299+
retention: 15d
300+
# persistence config
301+
persistence:
302+
enabled: true
303+
accessMode: ReadWriteOnce
304+
size: 10Gi
305+
storageClass: ""
306+
## node-exporter configuration##
307+
node-exporter:
308+
enabled: true
309+
## kube-state-metrics configuration
310+
##
311+
kube-state-metrics:
312+
enabled: true
313+
# Configure the following ONLY if using an off-cluster prometheus database
314+
# URL configuration is only available in off-cluster prometheus database
315+
url: "http://my.prometheus.url:9090"
316+
317+
289318
# acme configuration takes effect if and only if certManagerEnabled is true
290319
acme:
291320
server: https://acme-v02.api.letsencrypt.org/directory

src/managing-workflow/platform-logging.md

Lines changed: 5 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -39,29 +39,13 @@ Error: There are currently no log messages. Please check the following things:
3939
│ Router │ ┌────────┐ ┌─────┐
4040
└────────┘ │ Logger │◀───▶│Redis│
4141
│ └────────┘ └─────┘
42-
Log file ▲
42+
Log file
4343
│ │
4444
▼ │
45-
┌────────┐ ┌─────────┐ logs/metrics ┌──────────────┐
46-
│App Logs│──Log File──▶│ fluentd │───────topics─────▶│ Redis Stream │
47-
└────────┘ └─────────┘ └──────────────┘
48-
49-
50-
┌─────────────┐ │
51-
│ HOST │ ▼
52-
│ Telegraf │───┐ ┌────────┐
53-
└─────────────┘ │ │Telegraf│
54-
│ └────────┘
55-
┌─────────────┐ │ │
56-
│ HOST │ │ ┌───────────┐ │
57-
│ Telegraf │───┼───▶│ InfluxDB │◀────Wire ───────────┘
58-
└─────────────┘ │ └───────────┘ Protocol
59-
│ ▲
60-
┌─────────────┐ │ │
61-
│ HOST │ │ ▼
62-
│ Telegraf │───┘ ┌──────────┐
63-
└─────────────┘ │ Grafana │
64-
└──────────┘
45+
┌────────┐ ┌─────────┐ logs/metrics ┌──────────────┐
46+
│App Logs│──Log File──▶│ fluentd │───────topics─────▶│ Redis Stream │
47+
└────────┘ └─────────┘ └──────────────┘
48+
6549
```
6650

6751
## Default Configuration

src/managing-workflow/platform-monitoring.md

Lines changed: 37 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,31 @@
22

33
## Description
44

5-
We now include a monitoring stack for introspection on a running Kubernetes cluster. The stack includes 3 components:
5+
We now include a monitoring stack for introspection on a running Kubernetes cluster. The stack includes 4 components:
66

7-
* [Telegraf](https://docs.influxdata.com/telegraf) - Metrics collection daemon written by team behind InfluxDB.
8-
* [InfluxDB](https://docs.influxdata.com/influxdb) - Time series database
9-
* [Grafana](http://grafana.org/) - Graphing tool for time series data
7+
* [kube-state-metrics](https://github.com/kubernetes/kube-state-metrics), kube-state-metrics (KSM) is a simple service that listens to the Kubernetes API server and generates metrics about the state of the objects.
8+
* [Node Exporter](http://github.com/prometheus/node_exporter), Prometheus exporter for hardware and OS metrics exposed by *NIX kernels.
9+
* [Prometheus](https://prometheus.io/), a [Cloud Native Computing Foundation](https://cncf.io/) project, is a systems and service monitoring system.
10+
* [Grafana](http://grafana.org/), Graphing tool for time series data
1011

1112
## Architecture Diagram
1213

1314
```
14-
┌────────┐
15-
│ Router │ ┌────────┐ ┌─────┐
16-
└────────┘ │ Logger │◀───▶│Redis│
17-
│ └────────┘ └─────┘
18-
Log file ▲
19-
│ │
20-
▼ │
21-
┌────────┐ ┌─────────┐ logs/metrics ┌──────────────┐
22-
│App Logs│──Log File──▶│ fluentd │───────topics─────▶│ Redis Stream │
23-
└────────┘ └─────────┘ └──────────────┘
24-
25-
26-
┌─────────────┐ │
27-
│ HOST │ ▼
28-
│ Telegraf │───┐ ┌────────┐
29-
└─────────────┘ │ │Telegraf│
30-
│ └────────┘
31-
┌─────────────┐ │ │
32-
│ HOST │ │ ┌───────────┐ │
33-
│ Telegraf │───┼───▶│ InfluxDB │◀────Wire ───────────┘
34-
└─────────────┘ │ └───────────┘ Protocol
35-
│ ▲
36-
┌─────────────┐ │ │
37-
│ HOST │ │ ▼
38-
│ Telegraf │───┘ ┌──────────┐
39-
└─────────────┘ │ Grafana │
40-
└──────────┘
15+
┌────────────────┐
16+
│ HOST │
17+
│ node-exporter │◀──┐ ┌──────────────────┐
18+
└────────────────┘ │ │kube-state-metrics│
19+
│ └──────────────────┘
20+
┌────────────────┐ │ ▲
21+
│ HOST │ │ ┌────────────┐ │
22+
│ node-exporter │◀──┼────│ Prometheus │─────────────┘
23+
└────────────────┘ │ └────────────┘
24+
│ ▲
25+
┌───────────────┐ │ │
26+
│ HOST │ │ ▼
27+
│ node-exporter│◀───┘ ┌──────────┐
28+
└───────────────┘ │ Grafana │
29+
└──────────┘
4130
```
4231

4332
## [Grafana](https://grafana.com/)
@@ -75,44 +64,28 @@ If you wish to have persistence for Grafana you can set `enabled` to `true` in t
7564

7665
If you wish to provide your own Grafana instance you can set `grafanaLocation` in the `values.yaml` file before running `helm install`.
7766

78-
## [InfluxDB](https://docs.influxdata.com/influxdb)
79-
InfluxDB writes data to the host disk; however, if the InfluxDB pod dies and comes back on another host, the data will not be recovered. The InfluxDB Admin UI is also exposed through the router allowing users to access the query engine by going to `influx.mydomain.com`. You will need to configure where to find the `influx-api` endpoint by clicking the "gear" icon at the top right and changing the host to `influx-api.mydomain.com` and port to `80`.
67+
## [Prometheus](https://prometheus.io/)
68+
Prometheus writes data to the host disk; however, if the prometheus pod dies and comes back on another host, the data will not be recovered. The prometheus graph UI is also exposed through the router allowing users to access the query engine by going to `prometheus.mydomain.com`.
8069

8170
### On Cluster Persistence
82-
If you wish to have persistence for InfluxDB you can set `enabled` to `true` in the `values.yaml` file before running `helm install`.
71+
You can set `node-exporter` and `kube-state-metrics` to `true` or `false` in the `values.yaml`.
72+
If you wish to have persistence for Prometheus you can set `enabled` to `true` in the `values.yaml` file before running `helm install`.
8373

8474
```
85-
influxdb:
86-
# Configure the following ONLY if you want persistence for on-cluster grafana
87-
# GCP PDs and EBS volumes are supported only
88-
persistence:
89-
enabled: true # Set to true to enable persistence
90-
size: 5Gi # PVC size
75+
prometheus:
76+
prometheus-server:
77+
persistence:
78+
enabled: true # Set to true to enable persistence
79+
size: 10Gi # PVC size
80+
node-exporter:
81+
enabled: true
82+
kube-state-metrics:
83+
enabled: true
9184
```
9285

93-
### Off Cluster Influxdb
94-
95-
To use off-cluster Influx v2, please provide the following values in the `values.yaml` file before running `helm install`.
96-
97-
* `influxdbLocation=off-cluster`
98-
* `url = "http://my-influxhost.com:8086"`
99-
* `bucket = "metrics"`
100-
* `org = "drycc"`
101-
* `token = "MysuperSecurePassword"`
102-
103-
104-
## [Telegraf](https://docs.influxdata.com/telegraf)
105-
106-
Telegraf is the metrics collection daemon used within the monitoring stack. It will collect and send the following metrics to InfluxDB:
107-
108-
* System level metrics such as CPU, Load Average, Memory, Disk, and Network stats
109-
* Container level metrics such as CPU and Memory
110-
* Kubernetes metrics such as API request latency, Pod Startup Latency, and number of running pods
111-
112-
It is possible to send these metrics to other endpoints besides InfluxDB. For more information please consult the following [file](https://github.com/drycc/monitor/blob/main/telegraf/rootfs/config.toml.tpl)
113-
114-
### Customizing the Monitoring Stack
86+
### Off Cluster Prometheus
11587

116-
To learn more about customizing each of the above components please visit the [Tuning Component Settings][] section.
88+
To use off-cluster Prometheus, please provide the following values in the `values.yaml` file before running `helm install`.
11789

118-
[Tuning Component Settings]: tuning-component-settings.md#customizing-the-monitor
90+
* `global.prometheusLocation=off-cluster`
91+
* `url = "http://my.prometheus.url:9090"`

src/quickstart/install-workflow.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,9 @@ HELMBROKER_REPLICAS | Number of helmbroker api replicas t
234234
HELMBROKER_CELERY_REPLICAS | Number of helmbroker celery replicas to deploy
235235
HELMBROKER_PERSISTENCE_SIZE | The size of the persistence space allocated to `helmbroker`, which is `5Gi` by default
236236
HELMBROKER_PERSISTENCE_STORAGE_CLASS | StorangeClass of `helmbroker`; default storangeclass is used by default
237+
PROMETHEUS_SERVER_RETENTION | Prometheus data retention period (default if not specified is 15 days)
238+
PROMETHEUS_SERVER_PERSISTENCE_SIZE | The size of the persistence space allocated to `prometheus-server`, which is `10Gi` by default
239+
PROMETHEUS_SERVER_PERSISTENCE_STORAGE_CLASS| StorangeClass of `prometheus-server`; default storangeclass is used by default
237240
K3S_DATA_DIR | The config of k3s data dir; If not set, the default path is used
238241
ACME_SERVER | ACME Server url, default use letsencrypt
239242
ACME_EAB_KEY_ID | The key ID of which your external account binding is indexed by the external account

src/understanding-workflow/components.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -146,6 +146,12 @@ Helm Broker is a Service Broker that exposes Helm charts as Service Classes in S
146146
To do so, Helm Broker uses the concept of addons. An addon is an abstraction layer over a Helm chart
147147
which provides all information required to convert the chart into a Service Class.
148148

149+
## Prometheus
150+
151+
**Project Location:** [drycc/rabbitmq](https://github.com/drycc/prometheus)
152+
153+
Prometheus is an open-source systemsmonitoring and alerting toolkit originally built atSoundCloud.
154+
149155
## See Also
150156

151157
* [Workflow Concepts][concepts]

0 commit comments

Comments
 (0)