Description of problem: In the QE build watch of http://10.0.76.54/buildcorp/upgrade_CI/4169/console (you can see more details here), found below upgrade stuck error: "failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-176084702 ... Timeout exceeded while awaiting headers" Matrix: 4.5.4-x86_64 -> 4.6.0-0.nightly-2020-08-01-172303, 15_Disconnected UPI on GCP with RHEL7.7 OVN & http_proxy & Etcd Encryption on Manual running it, indeed costed not-few time 6m+ with 1.4G content: [xxia 2020-08-03 17:13:50 CST my]$ time oc image extract --path /:os-content-176084702 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc real 6m18.920s user 0m17.649s sys 0m9.285s [xxia 2020-08-03 17:20:10 CST my]$ [xxia 2020-08-03 17:40:49 CST my]$ du -sh os-content-176084702/ 1.4G os-content-176084702/ Version-Release number of the following components: 4.6.0-0.nightly-2020-08-01-172303 How reproducible: Don't know yet Steps to Reproduce: 1. The QE CI launched a 4.5.4 env of above matrix 2. The QE CI upgraded it to 4.6.0-0.nightly-2020-08-01-172303 Actual results: 2. The CI printed the upgrade process's output with one master is SchedulingDisabled, which shows "timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-176084702": NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.4 True True 25m Working towards 4.6.0-0.nightly-2020-08-01-172303: 84% complete NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.4 True True 26m Unable to apply 4.6.0-0.nightly-2020-08-01-172303: the cluster operator machine-config has not yet successfully rolled out ... ... NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.4 True True 119m Unable to apply 4.6.0-0.nightly-2020-08-01-172303: the cluster operator openshift-apiserver is degraded **************Post Action after upgrade fail**************** Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ugdci02033536-08011937-m-0.c.openshift-qe.internal Ready master 4h6m v1.18.3+012b3ec 10.0.0.4 Red Hat Enterprise Linux CoreOS 45.82.202007240629-0 (Ootpa) 4.18.0-193.13.2.el8_2.x86_64 cri-o://1.18.3-5.rhaos4.5.git1c13d1d.el8 ugdci02033536-08011937-m-1.c.openshift-qe.internal Ready,SchedulingDisabled master 4h5m v1.18.3+012b3ec 10.0.0.6 Red Hat Enterprise Linux CoreOS 45.82.202007240629-0 (Ootpa) 4.18.0-193.13.2.el8_2.x86_64 cri-o://1.18.3-5.rhaos4.5.git1c13d1d.el8 ugdci02033536-08011937-m-2.c.openshift-qe.internal Ready master 4h6m v1.18.3+012b3ec 10.0.0.5 Red Hat Enterprise Linux CoreOS 45.82.202007240629-0 (Ootpa) 4.18.0-193.13.2.el8_2.x86_64 cri-o://1.18.3-5.rhaos4.5.git1c13d1d.el8 ugdci02033536-08011937-w-a-l-rhel-0 Ready,SchedulingDisabled worker 157m v1.18.3+08c38ef 10.0.32.5 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1127.18.2.el7.x86_64 cri-o://1.18.3-8.rhaos4.5.gitbefe37e.el7 ugdci02033536-08011937-w-a-l-rhel-1 Ready worker 157m v1.18.3+08c38ef 10.0.32.6 Red Hat Enterprise Linux Server 7.7 (Maipo) 3.10.0-1127.18.2.el7.x86_64 cri-o://1.18.3-8.rhaos4.5.gitbefe37e.el7 Post action: #oc get co: NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-08-01-172303 True False True 107m ... ingress 4.6.0-0.nightly-2020-08-01-172303 True False True 113m ... machine-config 4.5.4 False True True 103m ... openshift-apiserver 4.6.0-0.nightly-2020-08-01-172303 True False True 112m ... print detail msg for node(SchedulingDisabled) if exist: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Name: ugdci02033536-08011937-m-1.c.openshift-qe.internal Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=n1-standard-4 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-central1 failure-domain.beta.kubernetes.io/zone=us-central1-b kubernetes.io/arch=amd64 kubernetes.io/hostname=ugdci02033536-08011937-m-1.c.openshift-qe.internal kubernetes.io/os=linux node-role.kubernetes.io/master= node.kubernetes.io/instance-type=n1-standard-4 node.openshift.io/os_id=rhcos topology.kubernetes.io/region=us-central1 topology.kubernetes.io/zone=us-central1-b Annotations: k8s.ovn.org/l3-gateway-config: {"default":{"mode":"local","interface-id":"br-local_ugdci02033536-08011937-m-1.c.openshift-qe.internal","mac-address":"7a:2d:9d:09:04:47",... k8s.ovn.org/node-chassis-id: 7459e79d-9f10-46e8-97c6-c7deae5cdf47 k8s.ovn.org/node-join-subnets: {"default":"100.64.0.0/29"} k8s.ovn.org/node-mgmt-port-mac-address: 1a:0c:30:4e:f8:d9 k8s.ovn.org/node-subnets: {"default":"10.128.0.0/23"} machineconfiguration.openshift.io/currentConfig: rendered-master-9cd892e38d01a0786e750356d578fada machineconfiguration.openshift.io/desiredConfig: rendered-master-eb5b061fa7a90a14a31f03301d1c9e2e machineconfiguration.openshift.io/reason: failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-... : exit status 1 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true print detail msg for co(AVAILABLE != True or PROGRESSING!=False or DEGRADED!=False or version != target_version) if exist: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal co details==~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... Name: machine-config ... Status: Conditions: Last Transition Time: 2020-08-01T22:17:55Z Message: Working towards 4.6.0-0.nightly-2020-08-01-172303 Status: True Type: Progressing Last Transition Time: 2020-08-01T22:33:26Z Message: Unable to apply 4.6.0-0.nightly-2020-08-01-172303: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-9cd892e38d01a0786e750356d578fada expected 057d852d0d10f94120aaa91e771503baa5b3c242 has 99eb744f5094224edb60d88ca85d607ab151ebdf: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci02033536-08011937-m-1.c.openshift-qe.internal is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-176084702 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\n: exit status 1\"", retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-08-01T22:15:52Z Message: Cluster not available for 4.6.0-0.nightly-2020-08-01-172303 Status: False Type: Available Last Transition Time: 2020-08-01T19:56:02Z Reason: AsExpected Status: True Type: Upgradeable Extension: Last Sync Error: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-9cd892e38d01a0786e750356d578fada expected 057d852d0d10f94120aaa91e771503baa5b3c242 has 99eb744f5094224edb60d88ca85d607ab151ebdf: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci02033536-08011937-m-1.c.openshift-qe.internal is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-176084702 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\n: exit status 1\"", retrying Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci02033536-08011937-m-1.c.openshift-qe.internal is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-176084702 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\n: exit status 1\"" Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci02033536-08011937-w-a-l-rhel-0 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod \\\"router-default-b465b6d9c-bzhwx\\\": global timeout reached: 1m30s, error when evicting pod \\\"router-default-76f5cc7db9-xn2rd\\\": global timeout reached: 1m30s]\"" Expected results: No such errors. Additional info: Another separate ingress issue shown above was filed as https://bugzilla.redhat.com/show_bug.cgi?id=1862892
*** This bug has been marked as a duplicate of bug 1862979 ***
(In reply to Xingxing Xia from comment #0) > Description of problem: > Matrix: 4.5.4-x86_64 -> 4.6.0-0.nightly-2020-08-01-172303, 15_Disconnected UPI on GCP with RHEL7.7 OVN & http_proxy & Etcd Encryption on This bug env had http_proxy, thus not a pure disconnected env bug as bug 1862979 which is pure disconnected env without proxy. And the error message is a different. This bug error is "Timeout exceeded while awaiting headers", while that bug is "Get "https://quay.io/v2/": Forbidden". So I don't think it is same issue. Reopenning therefore... Thanks.
(In reply to Xingxing Xia from comment #3) > This bug env had http_proxy One more clearer point, with http_proxy, this bug env does not use "mirror registry" like bug 1862979 at all to host the images.
(In reply to Xingxing Xia from comment #3) > (In reply to Xingxing Xia from comment #0) > > Description of problem: > > Matrix: 4.5.4-x86_64 -> 4.6.0-0.nightly-2020-08-01-172303, 15_Disconnected UPI on GCP with RHEL7.7 OVN & http_proxy & Etcd Encryption on > This bug env had http_proxy, thus not a pure disconnected env bug as bug > 1862979 which is pure disconnected env without proxy. And the error message > is a different. This bug error is "Timeout exceeded while awaiting headers", > while that bug is "Get "https://quay.io/v2/": Forbidden". So I don't think > it is same issue. Reopenning therefore... > Thanks. Please attach the must-gather. So far the bug cause is the same - MCO is now using `oc image extract` instead of `podman run`, which means mirror and proxy settings are not respected
Unlike other failed QE CI jobs with successful must-gather, this QE CI job shown above had failed must-gather result "must-gather file creation fails". Will rebuild it and keep the cluster and come back.
Reproduced in QE build upgrade testing: http://10.0.76.54/buildcorp/upgrade_CI/4207/console Cluster profile: Disconnected IPI on Azure & Private Cluster From 4.5.4 to 4.6 nightly build, oc adm upgrade --to-image=ugdci05072549.mirror-registry.qe.azure.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.6.0-0.nightly-2020-08-04-210224 --force=true --allow-explicit-upgrade=true Hit the same problem, ... Status: Conditions: Last Transition Time: 2020-08-05T02:43:01Z Message: Working towards 4.6.0-0.nightly-2020-08-04-210224 Status: True Type: Progressing Last Transition Time: 2020-08-05T02:57:45Z Message: Unable to apply 4.6.0-0.nightly-2020-08-04-210224: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-09c695ebfc8d928626a4c4b84684d89a expected ade383fc8b27be6bdc6aa7985b3154350beaec88 has 99eb744f5094224edb60d88ca85d607ab151ebdf: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci05072549-x2q48-master-2 is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-424723676 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) n: exit status 1\"", retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-08-05T02:41:04Z Message: Cluster not available for 4.6.0-0.nightly-2020-08-04-210224 ...
Reproduced the issue in QE CI upgrade testing. Below are the details: Build details : 4.5.5-x86_64 -> 4.6.0-0.nightly-2020-08-05-153221 Matrix : 14_Disconnected IPI on Azure & Private Cluster Upgrade command: ./oc adm upgrade --to-image=ugdci06022546.mirror-registry.qe.azure.devcluster.openshift.com:5000/openshift-release-dev/ocp-release:4.6.0-0.nightly-2020-08-05-153221 --force=true --allow-explicit-upgrade=true Hit the same problem: ======================== ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Name: ugdci06022546-lk2z9-master-2 Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_D8s_v3 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=centralus failure-domain.beta.kubernetes.io/zone=centralus-1 kubernetes.io/arch=amd64 kubernetes.io/hostname=ugdci06022546-lk2z9-master-2 kubernetes.io/os=linux node-role.kubernetes.io/master= node.kubernetes.io/instance-type=Standard_D8s_v3 node.openshift.io/os_id=rhcos topology.kubernetes.io/region=centralus topology.kubernetes.io/zone=centralus-1 Annotations: machine.openshift.io/machine: openshift-machine-api/ugdci06022546-lk2z9-master-2 machineconfiguration.openshift.io/currentConfig: rendered-master-7ff4949bb27c168d12c4ed000abfbea5 machineconfiguration.openshift.io/desiredConfig: rendered-master-e805edb843a99dd5f5f82c9c07789565 machineconfiguration.openshift.io/reason: failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-... : exit status 1 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 05 Aug 2020 15:02:27 -0400 Taints: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ugdci06022546-lk2z9-master-2 Last Sync Error: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-7ff4949bb27c168d12c4ed000abfbea5 expected ade383fc8b27be6bdc6aa7985b3154350beaec88 has 807abb900cf9976a1baad66eab17c6d76016e7b7: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci06022546-lk2z9-master-2 is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-308854769 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\n: exit status 1\"", retrying
Hit similar issue on another matrix and upgrade path, below are the details: Build Details: 4.5.5-x86_64 -> 4.6.0-0.nightly-2020-08-05-174122 Matrix : 26_Disconnected IPI on OSP13 with https_proxy & Etcd Encryption on oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-05-174122 --force=true --allow-explicit-upgrade=true ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Name: ugdci06040054-dl2kr-master-0 Roles: master Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m1.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=regionOne failure-domain.beta.kubernetes.io/zone=nova kubernetes.io/arch=amd64 kubernetes.io/hostname=ugdci06040054-dl2kr-master-0 kubernetes.io/os=linux node-role.kubernetes.io/master= node.kubernetes.io/instance-type=m1.xlarge node.openshift.io/os_id=rhcos topology.kubernetes.io/region=regionOne topology.kubernetes.io/zone=nova Annotations: machine.openshift.io/machine: openshift-machine-api/ugdci06040054-dl2kr-master-0 machineconfiguration.openshift.io/currentConfig: rendered-master-af74a630fea233a531a58c2184fcaa29 machineconfiguration.openshift.io/desiredConfig: rendered-master-c1501d842176d612c7c001810cd00069 machineconfiguration.openshift.io/reason: failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-... : exit status 1 machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 05 Aug 2020 16:15:35 -0400 Taints: node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ugdci06040054-dl2kr-master-0 AcquireTime: <unset> RenewTime: Wed, 05 Aug 2020 20:02:29 -0400 Conditions: Message: Unable to apply 4.6.0-0.nightly-2020-08-05-174122: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-af74a630fea233a531a58c2184fcaa29 expected ade383fc8b27be6bdc6aa7985b3154350beaec88 has 807abb900cf9976a1baad66eab17c6d76016e7b7: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ugdci06040054-dl2kr-master-0 is reporting: \"failed to run command oc (6 tries): timed out waiting for the condition: running oc image extract --path /:/run/mco-machine-os-content/os-content-507018740 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7 failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:16a8dde4b893ff0b5b4aeb05474f2f5e2ce9cac45d5d3e98b40c4309e23215a7: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\\n: exit status 1\"", retrying Reason: RequiredPoolsFailed Status: True
must-gather provided in comment #9 shows the same behaviour as in bug 1862979: 2020-08-06T03:35:16.45502974Z error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2020-08-06T03:35:16.457232233Z W0806 03:35:16.457158 264417 run.go:44] oc failed: running oc image extract --path /:/run/mco-machine-os-content/os-content-688451146 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc failed: error: unable to connect to image repository quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bb8cb875ed5ef903df8f3f056a3d48eaf4cca3b34af02a9d6728125ff507bcdc: Get https://quay.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Updated the bug to mention proxy environments too *** This bug has been marked as a duplicate of bug 1862979 ***
See https://bugzilla.redhat.com/show_bug.cgi?id=1862979#c19 *** This bug has been marked as a duplicate of bug 1862979 ***
For future reference, proxy issue was fixed in bug https://bugzilla.redhat.com/show_bug.cgi?id=1857162