Description of problem: OCP installation fails on OSP due `machine-config` operator is Degraded and time out level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating" + rc=1 + pkill -P 726 /home/jenkins/workspace/Launch Environment Flexy/private-openshift-misc/v3-launch-templates/functionality-testing/aos-4_5/hosts/upi_on_openstack-scripts/provision.sh: line 64: 3480 Terminated until oc observe --maximum-errors=-1 --exit-after=3600s csr -- oc adm certificate approve &>/dev/null; do :; done + '[' 1 -ne 0 ']' + exit 1 + teardown + deactivate + '[' -n /opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ']' + PATH=/opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + export PATH + unset _OLD_VIRTUAL_PATH + '[' -n '' ']' + '[' -n /usr/bin/bash -o -n '' ']' + hash -r + '[' -n '' ']' + unset VIRTUAL_ENV + '[' '!' '' = nondestructive ']' + unset -f deactivate + rm -rf /home/jenkins/venv tools/launch_instance.rb:623:in `installation_task': shell command failed execution, see logs (RuntimeError) from tools/launch_instance.rb:748:in `block in launch_template' from tools/launch_instance.rb:747:in `each' from tools/launch_instance.rb:747:in `launch_template' from tools/launch_instance.rb:55:in `block (2 levels) in run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:184:in `call' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:155:in `run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:452:in `run_active_command' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:68:in `run!' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/delegates.rb:15:in `run!' from tools/launch_instance.rb:92:in `run' from tools/launch_instance.rb:887:in `<main>' waiting for operation up to 36000 seconds.. [09:15:32] INFO> Exit Status: 1 Version-Release number of the following components: 4.5.3-x86_64 How reproducible: Always Steps to Reproduce: 1. Install OCP on upi-on-osp using 4.5.3-x86_64 Actual results: The OCP installation fails on openstack due `machine-config` operator is Degraded and time out. level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating" Expected results: The OCP installation should not fail on OpenStack due to any operator degrade or time out issue.
I can't access master nodes with oc debug: ➜ ~ oc debug node/wj45uos721ao-sgg99-master-0 Starting pod/wj45uos721ao-sgg99-master-0-debug ... To use host binaries, run `chroot /host` Removing debug pod ... error: Back-off pulling image "registry.redhat.io/rhel7/support-tools" Anyway, this looks like the usual installation drift as workers got up just fine. I was trying to grab `/etc/mcs-machine-config-content.json` to make some preliminary diffing, can you jump on any master directly and grab that?
kube-controller-manager is degraded: status: conditions: - lastTransitionTime: "2020-07-21T12:47:04Z" message: |- StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a) StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a) NodeControllerDegraded: All master nodes are ready reason: AsExpected status: "False" type: Degraded must-gather fails to lookup quay.io to pull the image, oc debug/node seems to be failing to pull images too. sdodson@t490: ~$ oc adm must-gather [must-gather ] OUT unable to resolve the imagestream tag openshift/must-gather:latest [must-gather ] OUT [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest [must-gather ] OUT namespace/openshift-must-gather-mqtn7 created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 created [must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created [must-gather-pzqf7] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 deleted [must-gather ] OUT namespace/openshift-must-gather-mqtn7 deleted error: gather did not start for pod must-gather-pzqf7: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout
The kube-controller-manager issue is likely going to be solved with https://github.com/openshift/baremetal-runtimecfg/pull/73. Not sure about what the issue with the DNS resolution could be.
*** This bug has been marked as a duplicate of bug 1826150 ***
The error reported in this issue "StaticPodsDegraded: pod/kube-controller-manager-ocp4-tg4n7-master-2 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed" looks different than https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked as dublicate. Can you please add more information around why it is duplicate.
(In reply to Antonio Murdaca from comment #8) > > *** This bug has been marked as a duplicate of bug 1826150 *** (In reply to Lalatendu Mohanty from comment #10) > The error reported in this issue "StaticPodsDegraded: > pod/kube-controller-manager-ocp4-tg4n7-master-2 container > "kube-controller-manager-recovery-controller" is not ready: > CrashLoopBackOff: back-off 5m0s restarting failed" looks different than > https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked > as dublicate. Can you please add more information around why it is duplicate. The error reported in this issue is not that - that's Scott comment in https://bugzilla.redhat.com/show_bug.cgi?id=1859161#c4 The underlying installation issue _is_ 1826150 which prevents installation from completing and the cause of the actual issue reported in the first comment. 1826150 is not a regression and it's fixed from 4.6 (and there's an easy workaround for installation) Why that pod is crashlooping, I'm not sure, can you please open a different Bugzilla as it's not directly related to MCO? It can be a side effect of the MCO bug, but as said, it's fixed in 4.6 and there's a workaround. *** This bug has been marked as a duplicate of bug 1826150 ***