1500981 – Metrics upgrade failing to create heapster-certs - 3.5 to 3.6.1

Bug 1500981 - Metrics upgrade failing to create heapster-certs - 3.5 to 3.6.1

Summary: Metrics upgrade failing to create heapster-certs - 3.5 to 3.6.1

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.6.1
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Juraci Paixão Kröhling
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1500946 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-11 21:59 UTC by Matthew Robson
Modified:	2021-03-11 15:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-27 18:47:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Matthew Robson 2017-10-11 21:59:27 UTC

Description of problem:

Upgrade from 3.5 to 3.6.1, the metrics PODs are not successfully comeing back after running the metrics playbook.

MountVolume.SetUp failed for volume "kubernetes.io/secret/a91e8025-aebd-11e7-a9ca-005056b50758-heapster-certs" (spec.Name: "heapster-certs") pod "a91e8025-aebd-11e7-a9ca-005056b50758" (UID: "a91e8025-aebd-11e7-a9ca-005056b50758") with: secrets "heapster-certs" not found

Unable to mount volumes for pod "heapster-5m4dj_openshift-infra(a91e8025-aebd-11e7-a9ca-005056b50758)": timeout expired waiting for volumes to attach/mount for pod "openshift-infra"/"heapster-5m4dj". list of unattached/unmounted volumes=[heapster-certs]

[root@home]# oc describe svc heapster
Name:			heapster
Namespace:		openshift-infra
Labels:			metrics-infra=heapster
			name=heapster
Annotations:		service.alpha.openshift.io/serving-cert-secret-name=heapster-certs
Selector:		name=heapster
Type:			ClusterIP
IP:			172.30.36.101
Port:			<unset>	80/TCP
Endpoints:		<none>
Session Affinity:	None
Events:			<none>

Cassandra and Metrics are up ok, but heapster will not start.

NAME                            READY     STATUS              RESTARTS   AGE
po/hawkular-cassandra-1-zmdkb   1/1       Running             0          2h
po/hawkular-metrics-s7f1g       1/1       Running             0          2h
po/heapster-clb0b               0/1       ContainerCreating   0          12h
po/recycler-for-pvnfs0003       0/1       Terminating         0          18h

There is nothing in the logs related to heapster-certs.

Only potentially interesting thing on the master is;

Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.089292   49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic

Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.092346   49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic

Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.095631   49110 watcher.go:337] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow decoding, user not receiving fast, or other processing logic

Oct 11 15:47:18 atomic-openshift-master-api[49110]: W1011 15:47:18.097806   49110 watcher.go:217] Fast watcher, slow processing. Number of buffered events: 100.Probably caused by slow dispatching events to watchers

Metrics were subsequently fully uninstalled and reinstalled with the same failure.

Version-Release number of selected component (if applicable):

3.6.1

How reproducible:

Has happened in 2 of 2 environments.


Steps to Reproduce:

Upgrade ETCD
Upgrade Masters
Upgrade NODES
Migrate Etcd2->3
Enable Etcd data encryption
Upgrade registry
Upgrade router
Upgrade metrics

Actual results:
Metrics fails

Expected results:
Metrics upgrades

Additional info:

Comment 6 Matt Wringe 2017-10-12 13:43:12 UTC

What is the OpenShift Ansible version they are using? They if they are installing metrics onto OCP 3.6.1 they need to be using a 3.6.x version of OpenShift Ansible. I suspect that is the problem here.

Comment 12 Juraci Paixão Kröhling 2017-10-13 08:38:34 UTC

Looks like this was a documentation issue (or rather, lack thereof). I'm closing this issue as NOTABUG, but if you think there's something we could do here, feel free to reopen.

Comment 13 Juraci Paixão Kröhling 2017-10-13 08:45:07 UTC

*** Bug 1500946 has been marked as a duplicate of this bug. ***

Comment 14 dlbewley 2017-10-13 17:38:39 UTC

Same problem in v3.6.173.0.21. Nothing in docs on this particular issue.

The heapster rc has a volume requesting mount of secret `heapster-certs`, but playbook does not create such a secret in `openshift-infra`.

Comment 15 Matthew Robson 2017-10-13 17:53:43 UTC

If you're running into this, here is the solution:

Check the master controller log for a skipping of the service-serving-cert controller. You may need to restart the controller if the logs has rolled over:

Oct 12 10:23:27 atomic-openshift-master-controllers: I1012 10:23:27.804711   49015 start_master.go:773] Starting "openshift.io/service-serving-cert"
Oct 12 10:23:27 atomic-openshift-master-controllers: W1012 10:23:27.804716   49015 start_master.go:780] Skipping "openshift.io/service-serving-cert"

The root of this is a missing cert config for serviceServingCert in the master-config.yaml

Check if this exists for the masters:

controllerConfig:
  serviceServingCert:
    signer:
      certFile: service-signer.crt
      keyFile: service-signer.key

You can generate those keys via;

oc adm ca create-signer-cert --cert=service-signer.crt --key=service-signer.key --name=openshift-service-serving-signer --serial=service-signer.serial.txt

Copy the crt and key into /etc/origin/master/ and add the above config to master-config.yaml and restart the masters.  Service Certs should start working now if you recreate the service.

Typically this would have been created on an OpenShift upgrade via:

https://github.com/openshift/openshift-ansible/blob/release-3.6/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml

But if you do manual upgrades, this would not have been done.

Raised a docs issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1501994

And oc adm diagnostics / master startup will be enhanced to better WARN about this:
https://github.com/openshift/origin/pull/16863

The missing secret is actually created by the heapster service using a special annotation 'serving-cert-secret':

oc describe svc heapster

[root@osemaster1 etcd]# oc describe svc heapster
Name:			heapster
Namespace:		openshift-infra
Labels:			metrics-infra=heapster
			name=heapster
Annotations:		kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Service","metadata":{"annotations":{"service.alpha.openshift.io/serving-cert-secret-name":"heapster-certs"},"labels":{"metri...
			service.alpha.openshift.io/serving-cert-secret-name=heapster-certs
			service.alpha.openshift.io/serving-cert-signed-by=openshift-service-serving-signer@1498853387
Selector:		name=heapster

Comment 16 dlbewley 2017-10-13 18:17:42 UTC

Thank you for those details. I am in fact missing `serviceServingCert`. 

I do use the playbook for upgrades and have upgraded this to each point release since 3.0, so perhaps there is a corner case caused by cruft. It appears that those tasks were skipped by conditional evaluation in the playbook. I have a case open 01948010 and will update details there when I investigate further.

From upgrade log:

```
[root@ose-prod-master-01 3.6]# grep -B1 -A3 create_service_signer_cert.yml 20171008-1736-ansible-upgrade.log
TASK [Create local temp directory for syncing certs] ******************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:8
skipping: [localhost] => {
    "changed": false,
    "skip_reason": "Conditional result was False",
--
TASK [Create remote temp directory for creating certs] ****************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:17
skipping: [ose-prod-master-01.example.com] => {
    "changed": false,
    "skip_reason": "Conditional result was False",
--
TASK [Create service signer certificate] ******************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:23
skipping: [ose-prod-master-01.example.com] => {
    "changed": false,
    "skip_reason": "Conditional result was False",
--
TASK [Retrieve service signer certificate] ****************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:34
skipping: [ose-prod-master-01.example.com] => (item=service-signer.crt)  => {
    "changed": false,
    "item": "service-signer.crt",
--
TASK [Delete remote temp directory] ***********************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:46
skipping: [ose-prod-master-01.example.com] => {
    "changed": false,
    "skip_reason": "Conditional result was False",
--
TASK [Deploy service signer certificate] ******************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:56
skipping: [ose-prod-master-01.example.com] => (item=service-signer.crt)  => {
    "changed": false,
    "item": "service-signer.crt",
--
TASK [Delete local temp directory] ************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/create_service_signer_cert.yml:71
skipping: [localhost] => {
    "changed": false,
    "skip_reason": "Conditional result was False",
```

Comment 17 dlbewley 2017-10-13 20:27:17 UTC

See https://github.com/openshift/openshift-ansible/commit/3e5d38caf39d53c917a78542a04ebb6a109e7e6f

Comment 18 Eric Jones 2017-11-27 16:44:57 UTC

Reopening this bug as customer in new attached case is seeing the "start_master.go:780] Skipping "openshift.io/service-serving-cert"" message in the master's logs and they do have the controllerConfig stuff in the master config.

We even tried recreating the certs but it still is producing that message ^.

Comment 19 Eric Jones 2017-11-27 16:54:45 UTC

And we might've identified a typo/possible other bug. 

I will update this bug again after getting confirmation from customer

Comment 20 Eric Jones 2017-11-27 18:47:56 UTC

Problem was caused by a typo. Customer had [0] but it should be [1].

[0]
controllerConfig:
  servicesServingCert:
    signer:
      certFile: service-signer.crt
      keyFile: service-signer.key

[1] 
controllerConfig:
  serviceServingCert:
    signer:
      certFile: service-signer.crt
      keyFile: service-signer.key

Correcting this fixed the customer's issue. reclosing this bz.

Note You need to log in before you can comment on or make changes to this bug.