Bug 1859161
| Summary: | [OCP v4.5][OpenStack] OCP installation fails on OSP due `machine-config` operator is Degraded and time out | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Prashant Dhamdhere <pdhamdhe> |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
| Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 4.5 | CC: | lmohanty, m.andre, vlaad, wjiang, wking, wsun, xtian |
| Target Milestone: | --- | Keywords: | Reopened, Upgrades |
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-27 09:16:06 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I can't access master nodes with oc debug: ➜ ~ oc debug node/wj45uos721ao-sgg99-master-0 Starting pod/wj45uos721ao-sgg99-master-0-debug ... To use host binaries, run `chroot /host` Removing debug pod ... error: Back-off pulling image "registry.redhat.io/rhel7/support-tools" Anyway, this looks like the usual installation drift as workers got up just fine. I was trying to grab `/etc/mcs-machine-config-content.json` to make some preliminary diffing, can you jump on any master directly and grab that? kube-controller-manager is degraded:
status:
conditions:
- lastTransitionTime: "2020-07-21T12:47:04Z"
message: |-
StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a)
StaticPodsDegraded: pod/kube-controller-manager-wj45uos721ao-sgg99-master-0 container "kube-controller-manager-recovery-controller" is waiting: CrashLoopBackOff: back-off 5m0s restarting failed container=kube-controller-manager-recovery-controller pod=kube-controller-manager-wj45uos721ao-sgg99-master-0_openshift-kube-controller-manager(d51ea6177c672c352c69793059d20a7a)
NodeControllerDegraded: All master nodes are ready
reason: AsExpected
status: "False"
type: Degraded
must-gather fails to lookup quay.io to pull the image, oc debug/node seems to be failing to pull images too.
sdodson@t490: ~$ oc adm must-gather
[must-gather ] OUT unable to resolve the imagestream tag openshift/must-gather:latest
[must-gather ] OUT
[must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift/origin-must-gather:latest
[must-gather ] OUT namespace/openshift-must-gather-mqtn7 created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 created
[must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather:latest created
[must-gather-pzqf7] OUT gather did not start: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-smw62 deleted
[must-gather ] OUT namespace/openshift-must-gather-mqtn7 deleted
error: gather did not start for pod must-gather-pzqf7: unable to pull image: ErrImagePull: rpc error: code = Unknown desc = error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp: lookup quay.io on 10.0.77.163:53: read udp 192.168.2.183:47037->10.0.77.163:53: i/o timeout
The kube-controller-manager issue is likely going to be solved with https://github.com/openshift/baremetal-runtimecfg/pull/73. Not sure about what the issue with the DNS resolution could be. *** This bug has been marked as a duplicate of bug 1826150 *** The error reported in this issue "StaticPodsDegraded: pod/kube-controller-manager-ocp4-tg4n7-master-2 container "kube-controller-manager-recovery-controller" is not ready: CrashLoopBackOff: back-off 5m0s restarting failed" looks different than https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked as dublicate. Can you please add more information around why it is duplicate. (In reply to Antonio Murdaca from comment #8) > > *** This bug has been marked as a duplicate of bug 1826150 *** (In reply to Lalatendu Mohanty from comment #10) > The error reported in this issue "StaticPodsDegraded: > pod/kube-controller-manager-ocp4-tg4n7-master-2 container > "kube-controller-manager-recovery-controller" is not ready: > CrashLoopBackOff: back-off 5m0s restarting failed" looks different than > https://bugzilla.redhat.com/show_bug.cgi?id=1826150 but current bug marked > as dublicate. Can you please add more information around why it is duplicate. The error reported in this issue is not that - that's Scott comment in https://bugzilla.redhat.com/show_bug.cgi?id=1859161#c4 The underlying installation issue _is_ 1826150 which prevents installation from completing and the cause of the actual issue reported in the first comment. 1826150 is not a regression and it's fixed from 4.6 (and there's an easy workaround for installation) Why that pod is crashlooping, I'm not sure, can you please open a different Bugzilla as it's not directly related to MCO? It can be a side effect of the MCO bug, but as said, it's fixed in 4.6 and there's a workaround. *** This bug has been marked as a duplicate of bug 1826150 *** |
Description of problem: OCP installation fails on OSP due `machine-config` operator is Degraded and time out level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating" + rc=1 + pkill -P 726 /home/jenkins/workspace/Launch Environment Flexy/private-openshift-misc/v3-launch-templates/functionality-testing/aos-4_5/hosts/upi_on_openstack-scripts/provision.sh: line 64: 3480 Terminated until oc observe --maximum-errors=-1 --exit-after=3600s csr -- oc adm certificate approve &>/dev/null; do :; done + '[' 1 -ne 0 ']' + exit 1 + teardown + deactivate + '[' -n /opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin ']' + PATH=/opt/rh/rh-ruby26/root/usr/local/bin:/opt/rh/rh-ruby26/root/usr/bin:/opt/rh/rh-git218/root/usr/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + export PATH + unset _OLD_VIRTUAL_PATH + '[' -n '' ']' + '[' -n /usr/bin/bash -o -n '' ']' + hash -r + '[' -n '' ']' + unset VIRTUAL_ENV + '[' '!' '' = nondestructive ']' + unset -f deactivate + rm -rf /home/jenkins/venv tools/launch_instance.rb:623:in `installation_task': shell command failed execution, see logs (RuntimeError) from tools/launch_instance.rb:748:in `block in launch_template' from tools/launch_instance.rb:747:in `each' from tools/launch_instance.rb:747:in `launch_template' from tools/launch_instance.rb:55:in `block (2 levels) in run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:184:in `call' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/command.rb:155:in `run' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:452:in `run_active_command' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/runner.rb:68:in `run!' from /opt/rh/rh-ruby26/root/usr/share/gems/gems/commander-4.5.2/lib/commander/delegates.rb:15:in `run!' from tools/launch_instance.rb:92:in `run' from tools/launch_instance.rb:887:in `<main>' waiting for operation up to 36000 seconds.. [09:15:32] INFO> Exit Status: 1 Version-Release number of the following components: 4.5.3-x86_64 How reproducible: Always Steps to Reproduce: 1. Install OCP on upi-on-osp using 4.5.3-x86_64 Actual results: The OCP installation fails on openstack due `machine-config` operator is Degraded and time out. level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.5.3" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.5.3: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node qeci-6445-dw974-master-0 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-1 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\", Node qeci-6445-dw974-master-2 is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-a56063191760ecece55a112f4f32046e\\\\\\\" not found\\\"\", retrying" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.5.3" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating" Expected results: The OCP installation should not fail on OpenStack due to any operator degrade or time out issue.