Bug 1680342

Summary:	Installer failed at [openshift_service_catalog : Wait for Controller Manager rollout success]
Product:	OpenShift Container Platform	Reporter:	Takayoshi Tanaka <tatanaka>
Component:	Service Catalog	Assignee:	Dan Geoffroy <dageoffr>
Status:	CLOSED ERRATA	QA Contact:	Jian Zhang <jiazha>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, asolanas, jokerman, jrosenta, mmccomas, szustkowski
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:44:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1679898
Bug Blocks:

Description Takayoshi Tanaka 2019-02-24 05:16:07 UTC

Description of problem:
Installer failed at [openshift_service_catalog : Wait for Controller Manager rollout success].
It seems the issue is similar to the situation described in the KCS [1], but the issue is not temporary but permanent.

Version-Release number of the following components:
# rpm -q openshift-ansible
openshift-ansible-3.11.82-3.git.0.9718d0a.el7.noarch
# rpm -q ansible
ansible-2.6.13-1.el7ae.noarch
# ansible --version
ansible 2.6.13
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

How reproducible:
Always in the customer's env. 
They have tried both multi-master nodes and a single master node and both failed.

Steps to Reproduce:
1. Run the installer. Attach the files in private later.

Actual results:
Will add in private.

Expected results:
Succeded without failure.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Attach in private.

[1] 
https://access.redhat.com/solutions/3814821

Comment 5 Jay Boyd 2019-02-25 15:36:02 UTC

I'm not convinced you are hitting the issue of the secret  not being created yet or the volume unable to be mounted.  I see this early in the event list:  MountVolume.SetUp failed for volume \"service-catalog-ssl\" : secrets \"controllermanager-ssl\" not found"  but the last seen indicates an hour ago, and there is only 3 occurrences.

Do you have additional evidence that indicates this condition persisted for more then 3 attempts?  Could you collect several describe pods on the controller-manager during these failures and also attempt to get some logs from it?  Thanks!

Comment 10 Jay Boyd 2019-03-05 16:38:10 UTC

Thanks Joel.  I suspect you may be hitting https://github.com/kubernetes/kubernetes/issues/65848  but I don't know for certain.  What verbosity level is the Kube API Server configured for?  If its >5 it will cause this error and log output  "unable to set dialer for kube-service-catalog/apiserver as rest transport is of type *transport.debuggingRoundTripper\n"

Comment 11 Joel Rosental R. 2019-03-05 16:49:45 UTC

Indeed, they have DEBUG_LOGLEVEL=8, but do you mean this is what is causing the issue?

Comment 12 Jay Boyd 2019-03-05 17:01:53 UTC

Yes, that is correct.  If they drop the level to 5 or less and restart the master API servers it should address the issue.

Comment 13 Joel Rosental R. 2019-03-07 08:22:08 UTC

Customer confirmed that lowering the debug_loglevel did the trick, thanks Jay!

Comment 14 Jay Boyd 2019-03-07 14:43:24 UTC

Excellent, thanks Joel. This is a nasty issue that impacts all aggregated api servers, fixed in Kubernetes 1.12.

Comment 16 Jian Zhang 2019-03-11 09:46:26 UTC

For 4.0, Cluster version is 4.0.0-0.nightly-2019-03-06-074438

1, Change the log level of the kube-apiserver to "TraceAll"(-v=8).
[jzhang@dhcp-140-18 ocp-09]$ oc edit kubeapiserver cluster
spec:
  forceRedeploymentReason: ""
  logLevel: TraceAll

But, it doesn't take effect, depends on bug 1679898.

Comment 18 Jian Zhang 2019-03-14 09:37:19 UTC

Set the kubeapiserver/openshiftapiserver log level to "8".
[jzhang@dhcp-140-18 ocp14]$ oc edit kubeapiserver cluster
kubeapiserver.operator.openshift.io/cluster edited

[jzhang@dhcp-140-18 ocp14]$ oc rsh kube-apiserver-ip-10-0-141-91.us-east-2.compute.internal 
Defaulting container name to kube-apiserver-11.
Use 'oc describe pod/kube-apiserver-ip-10-0-141-91.us-east-2.compute.internal -n openshift-kube-apiserver' to see all of the containers in this pod.
sh-4.2# ps -elf|cat 
F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root          1      0 54  80   0 - 413686 futex_ 09:17 ?       00:01:42 hypershift openshift-kube-apiserver --config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml -v=8


[jzhang@dhcp-140-18 ocp14]$ oc edit openshiftapiserver cluster
openshiftapiserver.operator.openshift.io/cluster edited
[jzhang@dhcp-140-18 ocp14]$ oc rsh apiserver-gq9sx 
sh-4.2# ps -elf |cat
F S UID         PID   PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
4 S root          1      0  5  80   0 - 258941 futex_ 09:25 ?       00:00:18 hypershift openshift-apiserver --config=/var/run/configmaps/config/config.yaml -v=8

[jzhang@dhcp-140-18 ocp14]$ oc version
oc v4.0.0-0.177.0
kubernetes v1.12.4+6a9f178753
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.jian-14.qe.devcluster.openshift.com:6443
kubernetes v1.12.4+4836bc9

The apiserver of the Serice catalog works well.
[jzhang@dhcp-140-18 ocp14]$ oc get pods -n openshift-service-catalog-apiserver  
NAME              READY   STATUS    RESTARTS   AGE
apiserver-kq828   1/1     Running   0          2m
apiserver-mrt9z   1/1     Running   0          2m
apiserver-rxvm7   1/1     Running   0          2m2s

LGTM, verify it.

Comment 19 Jay Boyd 2019-03-19 11:44:23 UTC

*** Bug 1689263 has been marked as a duplicate of this bug. ***

Comment 22 errata-xmlrpc 2019-06-04 10:44:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758