Bug 1859161

Summary:	[OCP v4.5][OpenStack] OCP installation fails on OSP due `machine-config` operator is Degraded and time out
Product:	OpenShift Container Platform	Reporter:	Prashant Dhamdhere <pdhamdhe>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.5	CC:	lmohanty, m.andre, vlaad, wjiang, wking, wsun, xtian
Target Milestone:	---	Keywords:	Reopened, Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-27 09:16:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Prashant Dhamdhere 2020-07-21 11:17:31 UTC

Description of problem:

OCP installation fails on OSP due `machine-config` operator is Degraded and time out 

level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3"
level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3"
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating"
+ rc=1
+ pkill -P 726
/home/jenkins/workspace/Launch Environment Flexy/private-openshift-misc/v3-launch-templates/functionality-testing/aos-4_5/hosts/upi_on_openstack-scripts/provision.sh: line 64:  3480 Terminated              until oc observe --maximum-errors=-1 --exit-after=3600s csr -- oc adm certificate approve &>/dev/null; do
    :;
done
+ '[' 1 -ne 0 ']'
+ exit 1
+ teardown
+ deactivate
+ '[' -n /opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ']'
+ PATH=/opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ export PATH
+ unset _OLD_VIRTUAL_PATH
+ '[' -n '' ']'
+ '[' -n /usr/bin/bash -o -n '' ']'
+ hash -r
+ '[' -n '' ']'
+ unset VIRTUAL_ENV
+ '[' '!' '' = nondestructive ']'
+ unset -f deactivate
+ rm -rf /home/jenkins/venv
tools/launch_instance.rb:623:in `installation_task': shell command failed execution, see logs (RuntimeError)
	from tools/launch_instance.rb:748:in `block in launch_template'
	from tools/launch_instance.rb:747:in `each'
	from tools/launch_instance.rb:747:in `launch_template'
	from tools/launch_instance.rb:55:in `block (2 levels) in run'
	from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:184:in `call'
	from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:155:in `run'
	from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:452:in `run_active_command'
	from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:68:in `run!'
	from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/delegates.rb:15:in `run!'
	from tools/launch_instance.rb:92:in `run'
	from tools/launch_instance.rb:887:in `<main>'
waiting for operation up to 36000 seconds..

[09:15:32] INFO> Exit Status: 1

Version-Release number of the following components:

4.5.3-x86_64

How reproducible:

Always 

Steps to Reproduce:

1. Install OCP on upi-on-osp using 4.5.3-x86_64

Actual results:

The OCP installation fails on openstack due `machine-config` operator is Degraded and time out. 

level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3"
level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying"
level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3"
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating" 

Expected results:

The OCP installation should not fail on OpenStack due to any operator degrade or time out issue.

Comment 3 Antonio Murdaca 2020-07-21 12:43:55 UTC

I can't access master nodes with oc debug:

➜  ~ oc debug node/wj45uos721ao-sgg99-master-0
Starting pod/wj45uos721ao-sgg99-master-0-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
error: Back-off pulling image "registry.redhat.io/rhel7/support-tools"

Anyway, this looks like the usual installation drift as workers got up just fine.

I was trying to grab `/etc/mcs-machine-config-content.json` to make some preliminary diffing, can you jump on any master directly and grab that?

Comment 4 Scott Dodson 2020-07-21 12:55:08 UTC

kube-controller-manager is degraded:
status:
  conditions:
  - lastTransitionTime: "2020-07-21T12:47:04Z"
    message: |-
      StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a)
      StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a)
      NodeControllerDegraded: All master nodes are ready
    reason: AsExpected
    status: "False"
    type: Degraded

must-gather fails to lookup quay.io to pull the image, oc debug/node seems to be failing to pull images too.
sdodson@t490: ~$ oc adm must-gather
[must-gather      ] OUT unable to resolve the imagestream tag openshift/must-gather:latest
[must-gather      ] OUT 
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather      ] OUT namespace/openshift-must-gather-mqtn7 created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 created
[must-gather      ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-pzqf7] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 deleted
[must-gather      ] OUT namespace/openshift-must-gather-mqtn7 deleted
error: gather did not start for pod must-gather-pzqf7: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout

Comment 6 Martin André 2020-07-21 13:09:50 UTC

The kube-controller-manager issue is likely going to be solved with https://github.com/openshift/baremetal-runtimecfg/pull/73.
Not sure about what the issue with the DNS resolution could be.

Comment 8 Antonio Murdaca 2020-07-21 14:38:34 UTC


*** This bug has been marked as a duplicate of bug 1826150 ***

Comment 10 Lalatendu Mohanty 2020-07-23 13:35:01 UTC

The error reported in this issue "StaticPodsDegraded: pod/kube-controller-manager-ocp4-tg4n7-master-2 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed" looks different than https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked as dublicate. Can you please add more information around why it is duplicate.

Comment 13 Antonio Murdaca 2020-07-27 09:16:06 UTC

(In reply to Antonio Murdaca from comment #8)
> 
> *** This bug has been marked as a duplicate of bug 1826150 ***

(In reply to Lalatendu Mohanty from comment #10)
> The error reported in this issue "StaticPodsDegraded:
> pod/kube-controller-manager-ocp4-tg4n7-master-2 container
> "kube-controller-manager-recovery-controller" is not ready:
> CrashLoopBackOff: back-off 5m0s restarting failed" looks different than
> https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked
> as dublicate. Can you please add more information around why it is duplicate.

The error reported in this issue is not that - that's Scott comment in https://bugzilla.redhat.com/show_bug.cgi?id=1859161#c4

The underlying installation issue _is_ 1826150 which prevents installation from completing and the cause of the actual issue reported in the first comment. 1826150 is not a regression and it's fixed from 4.6 (and there's an easy workaround for installation)

Why that pod is crashlooping, I'm not sure, can you please open a different Bugzilla as it's not directly related to MCO? It can be a side effect of the MCO bug, but as said, it's fixed in 4.6 and there's a workaround.

*** This bug has been marked as a duplicate of bug 1826150 ***