Description of problem: Upgrade from 3.5 to 3.6.1, the metrics PODs are not successfully comeing back after running the metrics playbook. MountVolume.SetUp failed for volume "kubernetes.io/secret/a91e8025-aebd-11e7-a9ca-005056b50758-heapster-certs" (spec.Name: "heapster-certs") pod "a91e8025-aebd-11e7-a9ca-005056b50758" (UID: "a91e8025-aebd-11e7-a9ca-005056b50758") with: secrets "heapster-certs" not found Unable to mount volumes for pod "heapster-5m4dj_openshift-infra(a91e8025-aebd-11e7-a9ca-005056b50758)": timeout expired waiting for volumes to attach/mount for pod "openshift-infra"/"heapster-5m4dj". list of unattached/unmounted volumes=[heapster-certs] [root@home]# oc describe svc heapster Name: heapster Namespace: openshift-infra Labels: metrics-infra=heapster name=heapster Annotations: service.alpha.openshift.io/serving-cert-secret-name=heapster-certs Selector: name=heapster Type: ClusterIP IP: 172.30.36.101 Port: <unset> 80/TCP Endpoints: <none> Session Affinity: None Events: <none> Cassandra and Metrics are up ok, but heapster will not start. NAME READY STATUS RESTARTS AGE po/hawkular-cassandra-1-zmdkb 1/1 Running 0 2h po/hawkular-metrics-s7f1g 1/1 Running 0 2h po/heapster-clb0b 0/1 ContainerCreating 0 12h po/recycler-for-pvnfs0003 0/1 Terminating 0 18h There is nothing in the logs related to heapster-certs. Only potentially interesting thing on the master is; Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.089292 49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.092346 49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.095631 49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.097806 49110 watcher.go:217] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow dispatching events to watchers Metrics were subsequently fully uninstalled and reinstalled with the same failure. Version-Release number of selected component (if applicable): 3.6.1 How reproducible: Has happened in 2 of 2 environments. Steps to Reproduce: Upgrade ETCD Upgrade Masters Upgrade NODES Migrate Etcd2->3 Enable Etcd data encryption Upgrade registry Upgrade router Upgrade metrics Actual results: Metrics fails Expected results: Metrics upgrades Additional info:
What is the OpenShift Ansible version they are using? They if they are installing metrics onto OCP 3.6.1 they need to be using a 3.6.x version of OpenShift Ansible. I suspect that is the problem here.
Looks like this was a documentation issue (or rather, lack thereof). I'm closing this issue as NOTABUG, but if you think there's something we could do here, feel free to reopen.
*** Bug 1500946 has been marked as a duplicate of this bug. ***
Same problem in v3.6.173.0.21. Nothing in docs on this particular issue. The heapster rc has a volume requesting mount of secret `heapster-certs`, but playbook does not create such a secret in `openshift-infra`.
If you're running into this, here is the solution: Check the master controller log for a skipping of the service-serving-cert controller. You may need to restart the controller if the logs has rolled over: Oct 12 10:23:27 atomic-openshift-master-controllers: I1012 10:23:27.804711 49015 start_master.go:773] Starting "openshift.io/service-serving-cert" Oct 12 10:23:27 atomic-openshift-master-controllers: W1012 10:23:27.804716 49015 start_master.go:780] Skipping "openshift.io/service-serving-cert" The root of this is a missing cert config for serviceServingCert in the master-config.yaml Check if this exists for the masters: controllerConfig: serviceServingCert: signer: certFile: service-signer.crt keyFile: service-signer.key You can generate those keys via; oc adm ca create-signer-cert --cert=service-signer.crt --key=service-signer.key --name=openshift-service-serving-signer --serial=service-signer.serial.txt Copy the crt and key into /etc/origin/master/ and add the above config to master-config.yaml and restart the masters. Service Certs should start working now if you recreate the service. Typically this would have been created on an OpenShift upgrade via: https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml But if you do manual upgrades, this would not have been done. Raised a docs issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1501994 And oc adm diagnostics / master startup will be enhanced to better WARN about this: https://github.com/openshift/origin/pull/16863 The missing secret is actually created by the heapster service using a special annotation 'serving-cert-secret': oc describe svc heapster [root@osemaster1 etcd]# oc describe svc heapster Name: heapster Namespace: openshift-infra Labels: metrics-infra=heapster name=heapster Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"service.alpha.openshift.io/serving-cert-secret-name":"heapster-certs"},"labels":{"metri... service.alpha.openshift.io/serving-cert-secret-name=heapster-certs service.alpha.openshift.io/serving-cert-signed-by=openshift-service-serving-signer@1498853387 Selector: name=heapster
Thank you for those details. I am in fact missing `serviceServingCert`. I do use the playbook for upgrades and have upgraded this to each point release since 3.0, so perhaps there is a corner case caused by cruft. It appears that those tasks were skipped by conditional evaluation in the playbook. I have a case open 01948010 and will update details there when I investigate further. From upgrade log: ``` [root@ose-prod-master-01 3.6]# grep -B1 -A3 create_service_signer_cert.yml 20171008-1736-ansible-upgrade.log TASK [Create local temp directory for syncing certs] ****************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:8 skipping: [localhost] => { "changed": false, "skip_reason": "Conditional result was False", -- TASK [Create remote temp directory for creating certs] **************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:17 skipping: [ose-prod-master-01.example.com] => { "changed": false, "skip_reason": "Conditional result was False", -- TASK [Create service signer certificate] ****************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:23 skipping: [ose-prod-master-01.example.com] => { "changed": false, "skip_reason": "Conditional result was False", -- TASK [Retrieve service signer certificate] **************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:34 skipping: [ose-prod-master-01.example.com] => (item=service-signer.crt) => { "changed": false, "item": "service-signer.crt", -- TASK [Delete remote temp directory] *********************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:46 skipping: [ose-prod-master-01.example.com] => { "changed": false, "skip_reason": "Conditional result was False", -- TASK [Deploy service signer certificate] ****************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:56 skipping: [ose-prod-master-01.example.com] => (item=service-signer.crt) => { "changed": false, "item": "service-signer.crt", -- TASK [Delete local temp directory] ************************************************************************************************************************************ task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:71 skipping: [localhost] => { "changed": false, "skip_reason": "Conditional result was False", ```
See https://github.com/openshift/openshift-ansible/commit/3e5d38caf39d53c917a78542a04ebb6a109e7e6f
Reopening this bug as customer in new attached case is seeing the "start_master.go:780] Skipping "openshift.io/service-serving-cert"" message in the master's logs and they do have the controllerConfig stuff in the master config. We even tried recreating the certs but it still is producing that message ^.
And we might've identified a typo/possible other bug. I will update this bug again after getting confirmation from customer
Problem was caused by a typo. Customer had [0] but it should be [1]. [0] controllerConfig: servicesServingCert: signer: certFile: service-signer.crt keyFile: service-signer.key [1] controllerConfig: serviceServingCert: signer: certFile: service-signer.crt keyFile: service-signer.key Correcting this fixed the customer's issue. reclosing this bz.