Bug 1776135 - ansible-service-broker-operator doesn't notify users and admins via alerts in prometheus
Summary: ansible-service-broker-operator doesn't notify users and admins via alerts in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Broker
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.0
Assignee: Jesus M. Rodriguez
QA Contact: Cuiping HUO
URL:
Whiteboard:
Depends On: 1782058 1783829
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-25 08:25 UTC by Cuiping HUO
Modified: 2020-01-23 11:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1782058 (view as bug list)
Environment:
Last Closed: 2020-01-23 11:14:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ansible-service-broker pull 1260 0 None closed Bug 1776135: Allow ASBO to access endpoints 2020-08-14 08:49:46 UTC
Github openshift ansible-service-broker pull 1265 0 None closed Bug 1776135: make labels match the metrics endpoint 2020-08-14 08:49:46 UTC
Github openshift ansible-service-broker pull 1267 0 None closed Bug 1776135: fix the service monitor selector 2020-08-14 08:49:46 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:14:24 UTC

Description Cuiping HUO 2019-11-25 08:25:21 UTC
Description of problem:
ansible-service-broker-operator doesn't notify users and admins via alerts in prometheus

Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-11-24-183610
tsb operator commit.id:

How reproducible:
Always

Steps to Reproduce:
1. Install asb operator 
2. check prometheus rule for the template service broker operator
3. check prometheus targets for the template service broker operator

Actual results:
2. no rule for the ansible service broker operator
3. no targets for the ansible service broker operator

Expected results:
2. prometheus rule with alert name  AnsibleServiceBrokerEnabled can be found
3. prometheus targets for ansible-service-broker-operator with an Endpoint can be found


Additional info:
$ oc get csv -n openshift-ansible-service-broker
NAME                                               DISPLAY                                     VERSION              REPLACES   PHASE
openshiftansibleservicebroker.4.3.0-201911220712   OpenShift Ansible Service Broker Operator   4.3.0-201911220712              Succeeded
$ oc image info registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-ansible-service-broker-operator:v4.3.0-201911220712 --filter-by-os linux/amd64 | grep commit.id
               io.openshift.build.commit.id=d8cb6fe7bdeef19b888aa9a01e02701e356f630f
[chuo@dhcp-140-51 .kube]$ oc image info registry-proxy.engineering.redhat.com/rh-osbs/openshift-ose-ansible-service-broker:v4.3.0-201911220712 --filter-by-os linux/amd64 | grep commit.id
               io.openshift.build.commit.id=d8cb6fe7bdeef19b888aa9a01e02701e356f630f

Comment 2 Jesus M. Rodriguez 2019-12-03 04:05:13 UTC
I noticed that in the script used to deploy the brokers the namespace did not have the monitoring labels. 

https://docs.openshift.com/container-platform/4.2/applications/service_brokers/installing-ansible-service-broker.html#sb-install-asb-operator_sb-installing-asb

Enter openshift-ansible-service-broker in the Name field and openshift.io/cluster-monitoring=true in the Labels field and click Create.

Please verify that the monitor label is on the namespace.

Comment 3 Cuiping HUO 2019-12-03 10:35:42 UTC
Confirming part of asb is working.

create labels openshift.io/cluster-monitoring=true for openshift-ansible-service-broker in the Name field and the prometheus rule with alert name  AnsibleServiceBrokerEnabled can be found and targets for the ansible service broker operator also shows the
Endpoint 


alert: AnsibleServiceBrokerEnabled
expr: automationbroker_info{automationbroker="ansible-service-broker",namespace="openshift-ansible-service-broker"}
  > 0
labels:
  severity: warning
annotations:
  message: Indicates whether Ansible Service Broker is enabled

$ oc get ns openshift-ansible-service-broker --show-labels
NAME                               STATUS   AGE   LABELS
openshift-ansible-service-broker   Active   31h   openshift.io/cluster-monitoring=true

Comment 6 Cuiping HUO 2019-12-12 08:39:39 UTC
Verification failed. Alert: AnsibleServiceBrokerEnabled is not firing.
cluster version:4.3.0-0.nightly-2019-12-12-004325
asb commit.id: 346a81a77323baeb9f8bcb13437f7e7e32a0824f

$ oc get clusterservicebroker
NAME                     URL                                                          STATUS   AGE
ansible-service-broker   https://asb.openshift-ansible-service-broker.svc:1338/osb/   Ready    32m

$ oc -n openshift-ansible-service-broker get ep
NAME                                                ENDPOINTS                         AGE
asb                                                 10.130.2.8:1338,10.130.2.8:1337   50m
openshift-ansible-service-broker-operator-metrics   <none>                            50m

$ token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-1  -- curl -k -H "Authorization: Bearer $token" 'https://10.130.2.8:1338/metrics'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP apiserver_audit_event_total Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_client_certificate_expiration_seconds Distribution of the remaining lifetime on the certificate used to authenticate a request.
# TYPE apiserver_client_certificate_expiration_seconds histogram
apiserver_client_certificate_expiration_seconds_bucket{le="0"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="21600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="43200"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="86400"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="172800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="345600"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="604800"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="2.592e+06"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="7.776e+06"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="1.5552e+07"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="3.1104e+07"} 0
apiserver_client_certificate_expiration_seconds_bucket{le="+Inf"} 0
apiserver_client_certificate_expiration_seconds_sum 0
apiserver_client_certificate_expiration_seconds_count 0
# HELP asb_deprovision_jobs How many deprovision jobs are actively in the buffer.
# TYPE asb_deprovision_jobs gauge
asb_deprovision_jobs 0
# HELP asb_provision_jobs How many provision jobs are actively in the buffer.
# TYPE asb_provision_jobs gauge
asb_provision_jobs 0
# HELP asb_sandbox Gauge of all sandbox namespaces that are active.
# TYPE asb_sandbox gauge
asb_sandbox 0
# HELP asb_specs_deleted Specs deleted from data-store.
# TYPE asb_specs_deleted gauge
asb_specs_deleted 0
# HELP asb_specs_total Spec count of different registries and marked for deletion.
# TYPE asb_specs_total gauge
asb_specs_total{source="marked_for_deletion"} 0
asb_specs_total{source="test"} 4
# HELP asb_update_jobs How many update jobs are actively in the buffer.
# TYPE asb_update_jobs gauge
asb_update_jobs 0
# HELP authenticated_user_requests Counter of authenticated requests broken out by username.
# TYPE authenticated_user_requests counter
authenticated_user_requests{username="other"} 655
# HELP bundlelib_sandbox Guage of all sandbox namespaces that are active.
# TYPE bundlelib_sandbox gauge
bundlelib_sandbox 0
# HELP etcd_helper_cache_entry_count Counter of etcd helper cache entries. This can be different from etcd_helper_cache_miss_count because two concurrent threads can miss the cache and generate the same entry twice.
# TYPE etcd_helper_cache_entry_count counter
etcd_helper_cache_entry_count 0
# HELP etcd_helper_cache_hit_count Counter of etcd helper cache hits.
# TYPE etcd_helper_cache_hit_count counter
etcd_helper_cache_hit_count 0
# HELP etcd_helper_cache_miss_count Counter of etcd helper cache miss.
# TYPE etcd_helper_cache_miss_count counter
etcd_helper_cache_miss_count 0
# HELP etcd_request_cache_add_latencies_summary Latency in microseconds of adding an object to etcd cache
# TYPE etcd_request_cache_add_latencies_summary summary
etcd_request_cache_add_latencies_summary{quantile="0.5"} NaN
etcd_request_cache_add_latencies_summary{quantile="0.9"} NaN
etcd_request_cache_add_latencies_summary{quantile="0.99"} NaN
etcd_request_cache_add_latencies_summary_sum 0
etcd_request_cache_add_latencies_summary_count 0
# HELP etcd_request_cache_get_latencies_summary Latency in microseconds of getting an object from etcd cache
# TYPE etcd_request_cache_get_latencies_summary summary
etcd_request_cache_get_latencies_summary{quantile="0.5"} NaN
etcd_request_cache_get_latencies_summary{quantile="0.9"} NaN
etcd_request_cache_get_latencies_summary{quantile="0.99"} NaN
etcd_request_cache_get_latencies_summary_sum 0
etcd_request_cache_get_latencies_summary_count 0
# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 1.3107e-05
go_gc_duration_seconds{quantile="0.25"} 1.8163e-05
go_gc_duration_seconds{quantile="0.5"} 2.1872e-05
go_gc_duration_seconds{quantile="0.75"} 4.7245e-05
go_gc_duration_seconds{quantile="1"} 0.000413544
go_gc_duration_seconds_sum 0.00234589
go_gc_duration_seconds_count 60
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 25
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 9.079216e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 2.7715248e+08
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.49898e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 1.105552e+06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 2.422784e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 9.079216e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 5.459968e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 1.142784e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 39558
# HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes_total counter
go_memstats_heap_released_bytes_total 0
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 6.602752e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.5761396081224551e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 1.14511e+06
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 6944
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 128304
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 147456
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 1.0550512e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.124756e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.048576e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.048576e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
100 11866  100 11866    0     0  96503      0 --:--:-- --:--:-- --:--:-- 97262
_sys_bytes 7.2286456e+07
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.5"} 6066.34
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.9"} 6066.34
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.99"} 6066.34
http_request_duration_microseconds_sum{handler="ansible-service-broker"} 42629.947
http_request_duration_microseconds_count{handler="ansible-service-broker"} 6
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 1620.509
http_request_duration_microseconds{handler="prometheus",quantile="0.9"} 2559.998
http_request_duration_microseconds{handler="prometheus",quantile="0.99"} 4577.65
http_request_duration_microseconds_sum{handler="prometheus"} 309983.51600000006
http_request_duration_microseconds_count{handler="prometheus"} 172
# HELP http_request_size_bytes The HTTP request sizes in bytes.
# TYPE http_request_size_bytes summary
http_request_size_bytes{handler="ansible-service-broker",quantile="0.5"} 142
http_request_size_bytes{handler="ansible-service-broker",quantile="0.9"} 142
http_request_size_bytes{handler="ansible-service-broker",quantile="0.99"} 142
http_request_size_bytes_sum{handler="ansible-service-broker"} 852
http_request_size_bytes_count{handler="ansible-service-broker"} 6
http_request_size_bytes{handler="prometheus",quantile="0.5"} 214
http_request_size_bytes{handler="prometheus",quantile="0.9"} 214
http_request_size_bytes{handler="prometheus",quantile="0.99"} 214
http_request_size_bytes_sum{handler="prometheus"} 35908
http_request_size_bytes_count{handler="prometheus"} 172
# HELP http_requests_total Total number of HTTP requests made.
# TYPE http_requests_total counter
http_requests_total{code="200",handler="ansible-service-broker",method="get"} 6
http_requests_total{code="200",handler="prometheus",method="get"} 172
# HELP http_response_size_bytes The HTTP response sizes in bytes.
# TYPE http_response_size_bytes summary
http_response_size_bytes{handler="ansible-service-broker",quantile="0.5"} 33010
http_response_size_bytes{handler="ansible-service-broker",quantile="0.9"} 33010
http_response_size_bytes{handler="ansible-service-broker",quantile="0.99"} 33010
http_response_size_bytes_sum{handler="ansible-service-broker"} 198060
http_response_size_bytes_count{handler="ansible-service-broker"} 6
http_response_size_bytes{handler="prometheus",quantile="0.5"} 2292
http_response_size_bytes{handler="prometheus",quantile="0.9"} 2297
http_response_size_bytes{handler="prometheus",quantile="0.99"} 11865
http_response_size_bytes_sum{handler="prometheus"} 450346
http_response_size_bytes_count{handler="prometheus"} 172
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 5.7
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 11
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 4.089856e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.57613711977e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 5.2985856e+08

Comment 8 Cuiping HUO 2019-12-13 08:38:36 UTC
Verification failed in 2 points:
1.Alert: AnsibleServiceBrokerEnabled is not firing.
2.No metric named ansible_service_broker_enabled or automationbroker_info is found.

cluster version:4.3.0-0.nightly-2019-12-12-004325
asb commit.id: 489b0fa7201136510e6145f74fb133ba50c8a809

$ oc get clusterservicebroker
NAME                     URL                                                          STATUS   AGE
ansible-service-broker   https://asb.openshift-ansible-service-broker.svc:1338/osb/   Ready    32m

$ oc -n openshift-ansible-service-broker get ep
NAME                                                ENDPOINTS                           AGE
asb                                                 10.131.0.22:1338,10.131.0.22:1337   4m21s
openshift-ansible-service-broker-operator-metrics   10.131.0.12:8383,10.131.0.12:8686   20m


$ token=`oc -n openshift-monitoring sa get-token prometheus-k8s`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-1  -- curl -k -H "Authorization: Bearer $token" 'https://10.131.0.22:1338/metrics' | grep ansible-service-broker
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11864  100 11864    0     0  92863      0 --:--:-- --:--:-- --:--:-- 93417
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.5"} 8810.245
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.9"} 16059.347
http_request_duration_microseconds{handler="ansible-service-broker",quantile="0.99"} 16059.347
http_request_duration_microseconds_sum{handler="ansible-service-broker"} 33130.590000000004
http_request_duration_microseconds_count{handler="ansible-service-broker"} 3
http_request_size_bytes{handler="ansible-service-broker",quantile="0.5"} 142
http_request_size_bytes{handler="ansible-service-broker",quantile="0.9"} 142
http_request_size_bytes{handler="ansible-service-broker",quantile="0.99"} 142
http_request_size_bytes_sum{handler="ansible-service-broker"} 426
http_request_size_bytes_count{handler="ansible-service-broker"} 3
http_requests_total{code="200",handler="ansible-service-broker",method="get"} 3
http_response_size_bytes{handler="ansible-service-broker",quantile="0.5"} 33010
http_response_size_bytes{handler="ansible-service-broker",quantile="0.9"} 33010
http_response_size_bytes{handler="ansible-service-broker",quantile="0.99"} 33010
http_response_size_bytes_sum{handler="ansible-service-broker"} 99030
http_response_size_bytes_count{handler="ansible-service-broker"} 3
[chuo@dhcp-140-51 .kube]$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-1  -- curl -k -H "Authorization: Bearer $token" 'https://10.131.0.22:1338/metrics' | grep automationbroker_info
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 11853  100 11853    0     0  99169      0 --:--:-- --:--:-- --:--:-- 99605

$ oc get csv -n openshift-ansible-service-broker
NAME                                               DISPLAY                                     VERSION              REPLACES   PHASE
openshiftansibleservicebroker.4.3.0-201912121917   OpenShift Ansible Service Broker Operator   4.3.0-201912121917              Succeeded

Comment 9 Jesus M. Rodriguez 2019-12-13 14:03:13 UTC
You need to be grepping the 8383/8686 endpoint to see if they have the automationbroker_info. The 1338 is for metrics that the broker itself are outputting. The 8383/8686 endpoint have the metrics that the operator is pushing out which is what the alert is looking at.

The real confusion to me is why the alert still isn't firing.

Comment 11 Cuiping HUO 2019-12-17 06:46:57 UTC
verification blocked by bug 1783829

Comment 12 Cuiping HUO 2019-12-19 03:14:36 UTC
Verified. 
asb packagemanifest tag:4.3.0-201912171717
1.Alert: AnsibleServiceBrokerEnabled is firing.
2.automationbroker_info metric is showed as design.


$ oc -n openshift-ansible-service-broker get ep
NAME                                                ENDPOINTS                           AGE
asb                                                 10.128.2.49:1338,10.128.2.49:1337   46m
openshift-ansible-service-broker-operator-metrics   10.128.2.47:8383,10.128.2.47:8686   44h

$ oc -n openshift-monitoring exec prometheus-k8s-1 -c prometheus -- curl 'http://10.128.2.47:8686/metrics'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0# HELP automationbroker_info Information about the AutomationBroker custom resource.
# TYPE automationbroker_info gauge
automationbroker_info{namespace="openshift-ansible-service-broker",automationbroker="ansible-service-broker"} 1
100   232  100   232    0     0  35365      0 --:--:-- --:--:-- --:--:-- 38666

Comment 14 errata-xmlrpc 2020-01-23 11:14:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062


Note You need to log in before you can comment on or make changes to this bug.