Bug 1679500

Summary: Failed to attach PVs for monitoring
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: high    
Version: 4.1.0CC: fan-wxa, fbranczy, hongkliu, juzhao, mloibl, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:44:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
"x509: certificate signed by unknown authority" for worker nodes
none
info for Comment 6 none

Description Junqi Zhao 2019-02-21 09:35:06 UTC
Description of problem:
Cloned from https://jira.coreos.com/browse/MON-579
Create cluster-monitoring-config configmap to attach PVs, content see below

apiVersion: v1
data:
  config.yaml: |
    prometheusK8s:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          resources:
            requests:
              storage: 2Gi
    alertmanagerMain:
      volumeClaimTemplate:
        spec:
          storageClassName: gp2
          resources:
            requests:
              storage: 2Gi
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring

# oc get sc
NAME            PROVISIONER             AGE
gp2 (default)   kubernetes.io/aws-ebs   144m

But cluster-monitoring-operator pod reports error, "spec.storage.volumeClaimTemplate.metadata.creationTimestamp in body must be of type string: "null"", details please see 
# oc -n openshift-monitoring logs cluster-monitoring-operator-89d8d78df-rlbpc | grep "openshift-monitoring/cluster-monitoring-config"
E0219 10:56:33.794706       1 operator.go:244] Syncing "openshift-monitoring/cluster-monitoring-config" failed
E0219 10:56:33.794731       1 operator.go:245] sync "openshift-monitoring/cluster-monitoring-config" failed: running task Updating Prometheus-k8s failed: reconciling Prometheus object failed: updating Prometheus object failed: Prometheus.monitoring.coreos.com "k8s" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"monitoring.coreos.com/v1", "metadata":map[string]interface {}{"name":"k8s", "namespace":"openshift-monitoring", "resourceVersion":"12632", "generation":1, "uid":"5c5e724b-3421-11e9-a787-0ad8c958fe58", "creationTimestamp":"2019-02-19T08:35:49Z", "labels":map[string]interface {}{"prometheus":"k8s"}}, "spec":map[string]interface {}{"serviceMonitorNamespaceSelector":map[string]interface {}{}, "serviceAccountName":"prometheus-k8s", "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba01869048bf44fc5e8c57f0a34369750ce27e3fb0b5eb47c78f42022640154c", "baseImage":"openshift/prometheus", "secrets":[]interface {}{"prometheus-k8s-tls", "prometheus-k8s-proxy", "prometheus-k8s-htpasswd", "kube-rbac-proxy"}, "resources":map[string]interface {}{}, "nodeSelector":map[string]interface {}{"beta.kubernetes.io/os":"linux"}, "ruleSelector":map[string]interface {}{"matchLabels":map[string]interface {}{"prometheus":"k8s", "role":"alert-rules"}}, "version":"v2.5.0", "storage":map[string]interface {}{"volumeClaimTemplate":map[string]interface {}{"spec":map[string]interface {}{"resources":map[string]interface {}{"requests":map[string]interface {}{"storage":"2Gi"}}, "storageClassName":"gp2", "dataSource":interface {}(nil)}, "status":map[string]interface {}{}, "metadata":map[string]interface {}{"creationTimestamp":interface {}(nil)}}}, "containers":[]interface {}{map[string]interface {}{"volumeMounts":[]interface {}{map[string]interface {}{"name":"secret-prometheus-k8s-tls", "mountPath":"/etc/tls/private"}, map[string]interface {}{"name":"secret-prometheus-k8s-proxy", "mountPath":"/etc/proxy/secrets"}, map[string]interface {}{"name":"secret-prometheus-k8s-htpasswd", "mountPath":"/etc/proxy/htpasswd"}}, "name":"prometheus-proxy", "image":"quay.io/openshift/origin-oauth-proxy:latest", "args":[]interface {}{"-provider=openshift", "-https-address=:9091", "-http-address=", "-email-domain=*", "-upstream=http://localhost:9090", "-htpasswd-file=/etc/proxy/htpasswd/auth", "-openshift-service-account=prometheus-k8s", "-openshift-sar={\"resource\": \"namespaces\", \"verb\": \"get\"}", "-openshift-delegate-urls={\"/\": {\"resource\": \"namespaces\", \"verb\": \"get\"}}", "-tls-cert=/etc/tls/private/tls.crt", "-tls-key=/etc/tls/private/tls.key", "-client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token", "-cookie-secret-file=/etc/proxy/secrets/session_secret", "-openshift-ca=/etc/pki/tls/cert.pem", "-openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt", "-skip-auth-regex=^/metrics"}, "ports":[]interface {}{map[string]interface {}{"name":"web", "containerPort":9091}}, "resources":map[string]interface {}{}}, map[string]interface {}{"resources":map[string]interface {}{}, "volumeMounts":[]interface {}{map[string]interface {}{"name":"secret-prometheus-k8s-tls", "mountPath":"/etc/tls/private"}, map[string]interface {}{"name":"secret-kube-rbac-proxy", "mountPath":"/etc/kube-rbac-proxy"}}, "name":"kube-rbac-proxy", "image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:451274b24916b97e5ba2116dd0775cdb7e1de98d034ac8874b81c1a3b22cf6b1", "args":[]interface {}{"--secure-listen-address=0.0.0.0:9092", "--upstream=http://127.0.0.1:9095", "--config-file=/etc/kube-rbac-proxy/config.yaml", "--tls-cert-file=/etc/tls/private/tls.crt", "--tls-private-key-file=/etc/tls/private/tls.key", "--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256", "--logtostderr=true", "--v=10"}, "ports":[]interface {}{map[string]interface {}{"name":"tenancy", "containerPort":9092}}}, map[string]interface {}{"image":"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8675adb4a2a367c9205e3879b986da69400b9187df7ac3f3fbf9882e6a356252", "args":[]interface {}{"--insecure-listen-address=127.0.0.1:9095", "--upstream=http://127.0.0.1:9090", "--label=namespace"}, "resources":map[string]interface {}{}, "name":"prom-label-proxy"}}, "affinity":map[string]interface {}{"podAntiAffinity":map[string]interface {}{"preferredDuringSchedulingIgnoredDuringExecution":[]interface {}{map[string]interface {}{"weight":100, "podAffinityTerm":map[string]interface {}{"labelSelector":map[string]interface {}{"matchExpressions":[]interface {}{map[string]interface {}{"values":[]interface {}{"k8s"}, "key":"prometheus", "operator":"In"}}}, "namespaces":[]interface {}{"openshift-monitoring"}, "topologyKey":"kubernetes.io/hostname"}}}}}, "securityContext":map[string]interface {}{}, "replicas":2, "listenLocal":true, "serviceMonitorSelector":map[string]interface {}{}, "retention":"15d", "alerting":map[string]interface {}{"alertmanagers":[]interface {}{map[string]interface {}{"namespace":"openshift-monitoring", "name":"alertmanager-main", "port":"web", "scheme":"https", "tlsConfig":map[string]interface {}{"caFile":"/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt", "serverName":"alertmanager-main.openshift-monitoring.svc"}, "bearerTokenFile":"/var/run/secrets/kubernetes.io/serviceaccount/token"}}}, "externalUrl":"https://prometheus-k8s-openshift-monitoring.apps.qe-juzhao2.qe.devcluster.openshift.com/", "configMaps":[]interface {}{"serving-certs-ca-bundle", "csr-controller-ca-bundle"}}, "kind":"Prometheus"}: validation failure list:
spec.storage.volumeClaimTemplate.metadata.creationTimestamp in body must be of type string: "null"

Version-Release number of selected component (if applicable):
#oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-20-194410   True        False         58m     Cluster version is 4.0.0-0.nightly-2019-02-20-194410

configmap-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:037fa98f23ff812b6861675127d52eea43caa44bb138e7fe41c7199cb8d4d634
prometheus-config-reloader: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0b88f4c0bfc31f15d368619b951b9020853686ce46d36692f62ef437d83b1012
kube-state-metrics: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:36f168dc7fc6ada9af0f2eeb88f394f2e7311340acc25f801830fe509fd93911
prometheus-node-exporter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:42be8e58f00a54b4f4cbf849203a139c93bebde8cc40e5be84305246be620350
prometheus-alertmanager: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:455855037348f33f9810f7531d52e86450e5c75d9d06531d144abc5ac53c6786
kube-rbac-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4d229dee301eb7452227fefc2704b30cf58e7a7f85e0c66dd3798b6b64b79728
prometheus-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50de7804ddd623f1b4e0f57157ce01102db7e68179c5744bac4e92c81714a881
cluster-monitoring-operator: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:534a71a355e3b9c79ef5a192a200730b8641f5e266abe290b6f7c6342210d8a0
telemeter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9021d3e9ce028fc72301f8e0a40c37e488db658e1500a790c794bfd38903bef1
prom-label-proxy: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:90a29a928beffc938345760f88b6890dccdc6f1a6503f09fea7399469a6ca72a
prometheus: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ba51ac66b4c3a46d5445bdfa32f1f04b882498fe5405d88dc78a956742657105
k8s-prometheus-adapter: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ee79721af3078dfbcfaa75e9a47da1526464cf6685a7f4195ea214c840b59e9f
grafana: quay.io/openshift/origin-grafana:latest
oauth-proxy: quay.io/openshift/origin-oauth-proxy:latest

How reproducible:
Always

Steps to Reproduce:
1. See the description part
2.
3.

Actual results:
Failed to attach PVs for monitoring

Expected results:
Be able to attach PVs for monitoring

Additional info:

Comment 3 Junqi Zhao 2019-02-26 07:41:03 UTC
PVs could be attached, but this fix bring other problems, could not scrape kubelet from worker nodes, " x509: certificate signed by unknown authority" for the 10250/metrics/cadvisor and 10250/metrics targets on worker node

See from below, alertmanager-main pods and prometheus-k8s pods are recreated after attaching PVs, and allocated to worker nodes
$ oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-f509119d-398e-11e9-8827-0e9060eacf7c   2Gi        RWO            gp2            62m
alertmanager-main-db-alertmanager-main-1   Bound    pvc-05d73872-398f-11e9-8827-0e9060eacf7c   2Gi        RWO            gp2            62m
alertmanager-main-db-alertmanager-main-2   Bound    pvc-166daa08-398f-11e9-8827-0e9060eacf7c   2Gi        RWO            gp2            61m
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-d16161d8-398e-11e9-8827-0e9060eacf7c   4Gi        RWO            gp2            63m
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-d16637ec-398e-11e9-8827-0e9060eacf7c   4Gi        RWO            gp2            63m

$ oc -n openshift-monitoring get pod -o wide | grep -e alertmanager-main -e prometheus-k8s
alertmanager-main-0                            3/3     Running   0          116m    10.129.2.34    ip-10-0-174-68.us-east-2.compute.internal    <none>
alertmanager-main-1                            3/3     Running   0          115m    10.128.2.11    ip-10-0-143-223.us-east-2.compute.internal   <none>
alertmanager-main-2                            3/3     Running   0          115m    10.131.0.93    ip-10-0-146-225.us-east-2.compute.internal   <none>
prometheus-k8s-0                               6/6     Running   1          117m    10.128.2.10    ip-10-0-143-223.us-east-2.compute.internal   <none>
prometheus-k8s-1                               6/6     Running   1          117m    10.131.0.92    ip-10-0-146-225.us-east-2.compute.internal   <none>

$ oc get node -o wide | grep worker | awk '{print $1"   "$3"   "$6}'
ip-10-0-143-223.us-east-2.compute.internal   worker   10.0.143.223
ip-10-0-146-225.us-east-2.compute.internal   worker   10.0.146.225
ip-10-0-174-68.us-east-2.compute.internal   worker   10.0.174.68

See from the picture, "x509: certificate signed by unknown authority" for all the worker nodes
BTW, due to Bug 1678645 is not fixed, used following to check targets
$ prometheus_route=$(oc -n openshift-monitoring get route | grep prometheus-k8s | awk '{print $2}');curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://${prometheus_route}/targets > page_targets.html


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-25-194625   True        False         6h32m   Cluster version is 4.0.0-0.nightly-2019-02-25-194625

RHCOS build: 47.330

Comment 4 Junqi Zhao 2019-02-26 07:42:26 UTC
Created attachment 1538705 [details]
"x509: certificate signed by unknown authority" for worker nodes

Comment 5 Junqi Zhao 2019-02-26 07:46:43 UTC
Add info for Comment 3, all targets are UP before attaching PVs for monitoring, there is not error "x509: certificate signed by unknown authority" for the 10250/metrics/cadvisor and 10250/metrics targets on worker node

Comment 6 Frederic Branczyk 2019-02-26 10:25:29 UTC
You should be able to `kubectl port-forward` just fine to the Prometheus pod, for testing :) . Looking at the attachment, I find it striking, that this only applies to compute nodes. Did this maybe resolve itself after a few minutes? We may just need to wait for the kubelet serving certs CA to be (re-)mounted. Could you share the Prometheus StatefulSet as well as the content of the "openshift-monitoring/kubelet-serving-ca-bundle" and "openshift-config-managed/kubelet-serving-ca" ConfigMaps? Thanks!

Comment 7 Frederic Branczyk 2019-02-26 11:08:13 UTC
For what it's worth, I just tested the exact same thing on an origin cluster, and was not able to reproduce. I feel like what you saw was an unrelated thing to this bug.

Comment 9 Junqi Zhao 2019-02-26 12:41:20 UTC
Created attachment 1538804 [details]
info for  Comment 6

Comment 10 Junqi Zhao 2019-02-26 12:55:47 UTC
BTW:PVs already attached to pod, such as
    volumes:
    - name: alertmanager-main-db
      persistentVolumeClaim:
        claimName: alertmanager-main-db-alertmanager-main-0

Comment 18 Frederic Branczyk 2019-02-28 08:39:20 UTC
I agree having Kubernetes apply defaults (phase: Pending in status and creationTimestamp default) is a bit of a distraction, but the functionality works as expected. Should these beauty marks be an issue please file an RFE that we can schedule for later improvement. The TLS issue is distinct from using/provisioning persistence, and is being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1683913. Due to all of these facts, I'm moving this concrete issue to modified.

Comment 19 Junqi Zhao 2019-02-28 09:21:21 UTC
(In reply to Frederic Branczyk from comment #18)
> I agree having Kubernetes apply defaults (phase: Pending in status and
> creationTimestamp default) is a bit of a distraction, but the functionality
> works as expected. Should these beauty marks be an issue please file an RFE
> that we can schedule for later improvement. The TLS issue is distinct from
> using/provisioning persistence, and is being tracked in
> https://bugzilla.redhat.com/show_bug.cgi?id=1683913. Due to all of these
> facts, I'm moving this concrete issue to modified.

Agree, will verify this bug

Comment 20 Junqi Zhao 2019-03-01 03:47:55 UTC
RFE mentioned in Comment 18, please see bug 1684352

Since PVs now could attach to monitoring, close this issue


$ for i in $(oc -n openshift-monitoring get pod | grep -e alertmanager-main -e prometheus-k8s | grep -v NAME |awk '{print $1}'); do echo $i; oc -n openshift-monitoring get po $i -oyaml | grep -i claim;done
alertmanager-main-0
    persistentVolumeClaim:
      claimName: alertmanager-main-db-alertmanager-main-0
alertmanager-main-1
    persistentVolumeClaim:
      claimName: alertmanager-main-db-alertmanager-main-1
alertmanager-main-2
    persistentVolumeClaim:
      claimName: alertmanager-main-db-alertmanager-main-2
prometheus-k8s-0
    persistentVolumeClaim:
      claimName: prometheus-k8s-db-prometheus-k8s-0
prometheus-k8s-1
    persistentVolumeClaim:
      claimName: prometheus-k8s-db-prometheus-k8s-1


$ oc -n openshift-monitoring get pvc
NAME                                       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
alertmanager-main-db-alertmanager-main-0   Bound    pvc-2adefe29-3bd1-11e9-8b6c-0ac2ab4d1ff2   2Gi        RWO            gp2            25m
alertmanager-main-db-alertmanager-main-1   Bound    pvc-3bb52eb9-3bd1-11e9-8b6c-0ac2ab4d1ff2   2Gi        RWO            gp2            24m
alertmanager-main-db-alertmanager-main-2   Bound    pvc-4c427826-3bd1-11e9-8b6c-0ac2ab4d1ff2   2Gi        RWO            gp2            24m
prometheus-k8s-db-prometheus-k8s-0         Bound    pvc-3208ccb8-3bd1-11e9-8b6c-0ac2ab4d1ff2   4Gi        RWO            gp2            24m
prometheus-k8s-db-prometheus-k8s-1         Bound    pvc-3214392e-3bd1-11e9-8b6c-0ac2ab4d1ff2   4Gi        RWO            gp2            24m

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-02-27-213933   True        False         80m     Cluster version is 4.0.0-0.nightly-2019-02-27-213933

Comment 23 errata-xmlrpc 2019-06-04 10:44:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Comment 24 Sergiusz Urbaniak 2020-02-13 10:23:34 UTC
*** Bug 1801023 has been marked as a duplicate of this bug. ***

Comment 25 Sergiusz Urbaniak 2020-02-18 12:16:48 UTC
*** Bug 1801023 has been marked as a duplicate of this bug. ***