Bug 1746097

Summary: configmap of trusted-ca-bundle for alertmanager fails to be mounted in alertmanager pod
Product: OpenShift Container Platform Reporter: Martin Bukatovic <mbukatov>
Component: MonitoringAssignee: Lili Cosic <lcosic>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.2.0CC: alegrand, anpicker, erooth, juzhao, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:37:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Martin Bukatovic 2019-08-27 16:14:29 UTC
Description of problem
======================

There are events in openshift-monitoring namespace, which complains about
missing alertmanager-trusted-ca-bundle-XXXXXX config map, but the one checks
it, the configmap seems to exists.

Version-Release number of selected component
============================================

cluster channel: stable-4.2
cluster version: 4.2.0-0.nightly-2019-08-26-235330
cluster image: registry.svc.ci.openshift.org/ocp/release@sha256:4b1f127d3d13e63ec0210568bc5aada642d1a97e3dfebd4b534257657011acce

namespace openshift-cluster-storage-operator
image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:135c4846c99f3da1f3e3e9c17ad37135efdd9d1bc3fa61231f1c41e10b8c2172
 * quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:135c4846c99f3da1f3e3e9c17ad37135efdd9d1bc3fa61231f1c41e10b8c2172

namespace openshift-storage
image quay.io/cephcsi/cephcsi:canary
 * quay.io/cephcsi/cephcsi@sha256:65bda97c05d01dd6bcb76c93e61cf0f0972b7e130406692143a6c18c7e9c00fa
image quay.io/k8scsi/csi-node-driver-registrar:v1.1.0
 * quay.io/k8scsi/csi-node-driver-registrar@sha256:13daf82fb99e951a4bff8ae5fc7c17c3a8fe7130be6400990d8f6076c32d4599
image quay.io/k8scsi/csi-attacher:v1.2.0
 * quay.io/k8scsi/csi-attacher@sha256:26fccd7a99d973845df1193b46ebdcc6ab8dc5f6e6be319750c471fce1742d13
image quay.io/k8scsi/csi-provisioner:v1.3.0
 * quay.io/k8scsi/csi-provisioner@sha256:e615e92233248e72f046dd4f5fac40e75dd49f78805801953a7dfccf4eb09148
image quay.io/k8scsi/csi-snapshotter:v1.2.0
 * quay.io/k8scsi/csi-snapshotter@sha256:6f12a57ef46c340c475489cac8d63c2431033961deaf40414208edebee50b640
image docker.io/ceph/ceph:v14.2.2-20190722
 * docker.io/ceph/ceph@sha256:567fe78d90a63ead11deadc2cbf5a912e42bfcc6ef4b1d6154f4b4fea4019052
image docker.io/rook/ceph:master
 * docker.io/rook/ceph@sha256:16feb1c77281e9eee66cdd3ee78e1b7642283e0f3537322873bf2cf5744b7517

How reproducible
================

1/1

Steps to Reproduce
==================

1. Install OCP/OCS cluster (I did this via red-hat-storage/ocs-ci, using
   upstream OCS images and with monitoring enabled, ocs-ci commit b304d0a)
2. List events in openshift-monitoring namespace

Actual results
==============

There are events reporting a mount failure of configmap which doesn't exists:

```
$ oc get events -n openshift-monitoring
LAST SEEN   TYPE      REASON        OBJECT                    MESSAGE
77m         Warning   FailedMount   pod/alertmanager-main-0   MountVolume.SetUp failed for volume "configmap-alertmanager-trusted-ca-bundle-cquddrmb6dfoh" : configmaps "alertmanager-trusted-ca-bundle-cquddrmb6dfoh" not found
4m49s       Warning   FailedMount   pod/alertmanager-main-1   MountVolume.SetUp failed for volume "configmap-alertmanager-trusted-ca-bundle-cquddrmb6dfoh" : configmaps "alertmanager-trusted-ca-bundle-cquddrmb6dfoh" not found
84m         Warning   FailedMount   pod/alertmanager-main-2   MountVolume.SetUp failed for volume "configmap-alertmanager-trusted-ca-bundle-cquddrmb6dfoh" : configmaps "alertmanager-trusted-ca-bundle-cquddrmb6dfoh" not found
```

But I can query for the configmap which was not found, I see it without any
problems:

```
$ oc get configmap/alertmanager-trusted-ca-bundle-cquddrmb6dfoh -n openshift-monitoring
NAME                                           DATA   AGE
alertmanager-trusted-ca-bundle-cquddrmb6dfoh   1      7s
```

Expected results
================

There is no event or the event reports the problem in more specific way, which
seems not to conflict with 1st observation.

Additional info
===============

All pods in openshift-monitoring namespace seems to be running:

```
$ oc get pods -n openshift-monitoring
NAME                                           READY   STATUS    RESTARTS   AGE
alertmanager-main-0                            3/3     Running   0          8h
alertmanager-main-1                            3/3     Running   0          8h
alertmanager-main-2                            3/3     Running   0          8h
cluster-monitoring-operator-775b45bc8b-sx4h7   1/1     Running   0          8h
grafana-867bfddd4d-bsj2g                       2/2     Running   0          8h
kube-state-metrics-7f4cdccd7c-4vlt2            3/3     Running   0          8h
node-exporter-hbvxl                            2/2     Running   0          8h
node-exporter-hs2q8                            2/2     Running   0          8h
node-exporter-jf6wd                            2/2     Running   0          8h
node-exporter-kxp88                            2/2     Running   0          8h
node-exporter-n89lb                            2/2     Running   0          8h
node-exporter-p9bh7                            2/2     Running   0          8h
openshift-state-metrics-6d66db6574-l8s7j       3/3     Running   0          8h
prometheus-adapter-bf745d6cd-25jlm             1/1     Running   0          8h
prometheus-adapter-bf745d6cd-jjx8x             1/1     Running   0          8h
prometheus-k8s-0                               6/6     Running   1          8h
prometheus-k8s-1                               6/6     Running   1          8h
prometheus-operator-6d5b8887d6-xtxgm           1/1     Running   0          8h
telemeter-client-678957d86-6wjk4               3/3     Running   0          8h
```

There is no obvious error with monitoring at first sight: watchdog alert is
firing and there are 115 not firigh alerts.
That said, I haven't tested particular monitoring features in more detail.

One can see alertmanager-main web interface (as listed eg. in `oc get routes -n openshift-monitoring`),
but it asks me again for kubeadmin credentials, even when I select to login with openshift.

Comment 4 Martin Bukatovic 2019-08-28 17:15:25 UTC
With 4.2.0-0.nightly-2019-08-28-083236 build, I noticed a similar problem with openshift-storage, and since this was triaged and fixed in cluster-monitoring-operator, I reported a new bug for openshift-storage: BZ 1746536 

Would it make sense to check whether is this the same kind of issue in a different operator or whether there is a something deeper in OCP to improve? Off course, it's also possible that the 2 bugs are not related at all.

Comment 5 Junqi Zhao 2019-08-30 08:19:24 UTC
let the cluster run a few hours, and check events, there is not "missing alertmanager-trusted-ca-bundle-XXXXXX config map" events
$ oc -n openshift-monitoring get event

payload:  4.2.0-0.nightly-2019-08-29-170426

Comment 7 Lili Cosic 2019-08-30 16:11:27 UTC
Just to confirm this only, at least for me, now happens when you delete the configmap, correct?

Comment 8 Junqi Zhao 2019-09-03 02:46:01 UTC
(In reply to Lili Cosic from comment #7)
> Just to confirm this only, at least for me, now happens when you delete the
> configmap, correct?

did not delete the configmap.

and there is not such issue in my fresh environment now, but there is some secrets report missing
# oc -n openshift-monitoring get event | grep "not found"
93m         Warning   FailedMount         pod/grafana-787654dccf-ccprz                        MountVolume.SetUp failed for volume "secret-grafana-tls" : secrets "grafana-tls" not found
95m         Warning   FailedMount         pod/node-exporter-4l7xt                             MountVolume.SetUp failed for volume "node-exporter-tls" : secrets "node-exporter-tls" not found
95m         Warning   FailedMount         pod/node-exporter-r454t                             MountVolume.SetUp failed for volume "node-exporter-tls" : secrets "node-exporter-tls" not found
95m         Warning   FailedMount         pod/node-exporter-wzkq4                             MountVolume.SetUp failed for volume "node-exporter-tls" : secrets "node-exporter-tls" not found


# oc -n openshift-monitoring get secrets | grep -e grafana-tls -e node-exporter-tls
grafana-tls                                   kubernetes.io/tls                     2      93m
node-exporter-tls                             kubernetes.io/tls                     2      95m

Comment 9 Junqi Zhao 2019-09-03 02:51:48 UTC
(In reply to Lili Cosic from comment #7)
> Just to confirm this only, at least for me, now happens when you delete the
> configmap, correct?

will watch on different clusters, if there is not such issue, will close this bug

Comment 12 Junqi Zhao 2019-09-05 11:07:34 UTC
close it with 4.2.0-0.nightly-2019-09-04-142146 build, there is not "missing alertmanager-trusted-ca-bundle-XXXXXX config map" events

Comment 13 errata-xmlrpc 2019-10-16 06:37:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922