Created attachment 1793996 [details] must-gather RHEL node stuck in Scheduling disabled state after upgrade from 4.7.17 to 4.8.0-0.nightly-2021-06-22-192915 Profile: UPI on Azure HTTP Proxy FIPS ETCD encryption. I see worker machine config pool is in UPDATING true state. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-22-192915 True False 5h32m Cluster version is 4.8.0-0.nightly-2021-06-22-192915 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME schoudha241528-06240728-master-0 Ready master 9h v1.21.0-rc.0+120883f 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 schoudha241528-06240728-master-1 Ready master 9h v1.21.0-rc.0+120883f 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 schoudha241528-06240728-master-2 Ready master 9h v1.21.0-rc.0+120883f 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 schoudha241528-06240728-rhel-0 Ready,SchedulingDisabled worker 7h48m v1.21.0-rc.0+766a5fe 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el7 schoudha241528-06240728-rhel-1 Ready worker 7h48m v1.20.0+87cc9a4 10.0.1.8 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7 schoudha241528-06240728-worker-centralus-1 Ready worker 8h v1.21.0-rc.0+120883f 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 schoudha241528-06240728-worker-centralus-2 Ready worker 8h v1.21.0-rc.0+120883f 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 schoudha241528-06240728-worker-centralus-3 Ready worker 8h v1.21.0-rc.0+120883f 10.0.1.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8 $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-69913347f90de5cdcb6775b82a9ca3b4 True False False 3 3 3 0 9h worker rendered-worker-85d54b151dc3dae658b002023a688f7a False True False 5 4 5 0 9h $ oc describe mcp worker Name: worker Namespace: Labels: machineconfiguration.openshift.io/mco-built-in= pools.operator.machineconfiguration.openshift.io/worker= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2021-06-24T07:51:56Z Generation: 4 Managed Fields: API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:machineconfiguration.openshift.io/mco-built-in: f:pools.operator.machineconfiguration.openshift.io/worker: f:spec: .: f:configuration: .: f:source: f:machineConfigSelector: .: f:matchLabels: .: f:machineconfiguration.openshift.io/role: f:nodeSelector: .: f:matchLabels: .: f:node-role.kubernetes.io/worker: f:paused: Manager: machine-config-operator Operation: Update Time: 2021-06-24T07:51:56Z API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: f:source: f:status: .: f:conditions: f:configuration: .: f:name: f:source: f:degradedMachineCount: f:machineCount: f:observedGeneration: f:readyMachineCount: f:unavailableMachineCount: f:updatedMachineCount: Manager: machine-config-controller Operation: Update Time: 2021-06-24T07:52:50Z Resource Version: 124500 UID: 13a6ba4a-1766-42ad-ac72-4d88215d6de2 Spec: Configuration: Name: rendered-worker-85d54b151dc3dae658b002023a688f7a Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-fips API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-generated-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: worker Node Selector: Match Labels: node-role.kubernetes.io/worker: Paused: false Status: Conditions: Last Transition Time: 2021-06-24T07:52:45Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2021-06-24T07:52:50Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2021-06-24T07:52:50Z Message: Reason: Status: False Type: Degraded Last Transition Time: 2021-06-24T11:27:02Z Message: Reason: Status: False Type: Updated Last Transition Time: 2021-06-24T11:27:02Z Message: All nodes are updating to rendered-worker-85d54b151dc3dae658b002023a688f7a Reason: Status: True Type: Updating Configuration: Name: rendered-worker-85d54b151dc3dae658b002023a688f7a Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-fips API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-generated-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Degraded Machine Count: 0 Machine Count: 5 Observed Generation: 4 Ready Machine Count: 4 Unavailable Machine Count: 1 Updated Machine Count: 5 Events: <none> $ oc describe node schoudha241528-06240728-rhel-0 Name: schoudha241528-06240728-rhel-0 Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=Standard_D4s_v3 beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=centralus failure-domain.beta.kubernetes.io/zone=0 kubernetes.io/arch=amd64 kubernetes.io/hostname=schoudha241528-06240728-rhel-0 kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=Standard_D4s_v3 node.openshift.io/os_id=rhel topology.kubernetes.io/region=centralus topology.kubernetes.io/zone=0 Annotations: machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-85d54b151dc3dae658b002023a688f7a machineconfiguration.openshift.io/desiredConfig: rendered-worker-85d54b151dc3dae658b002023a688f7a machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Thu, 24 Jun 2021 14:40:20 +0530 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: schoudha241528-06240728-rhel-0 AcquireTime: <unset> RenewTime: Thu, 24 Jun 2021 22:29:21 +0530 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Thu, 24 Jun 2021 22:28:51 +0530 Thu, 24 Jun 2021 17:06:36 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Thu, 24 Jun 2021 22:28:51 +0530 Thu, 24 Jun 2021 17:06:36 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 24 Jun 2021 22:28:51 +0530 Thu, 24 Jun 2021 17:06:36 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Thu, 24 Jun 2021 22:28:51 +0530 Thu, 24 Jun 2021 17:06:46 +0530 KubeletReady kubelet is posting ready status Addresses: Hostname: schoudha241528-06240728-rhel-0 InternalIP: 10.0.1.7 Capacity: attachable-volumes-azure-disk: 8 cpu: 4 ephemeral-storage: 28662Mi hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16265940Ki pods: 250 Allocatable: attachable-volumes-azure-disk: 8 cpu: 3500m ephemeral-storage: 27048856737 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15114964Ki pods: 250 System Info: Machine ID: 72d5f8ee97d141a2bd5151b18a1b1c57 System UUID: 62B4B769-FE93-406F-A14E-DE561431C2FE Boot ID: 5b33eaaa-f05d-4ea2-9797-d2abcea7b397 Kernel Version: 3.10.0-1160.31.1.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el7 Kubelet Version: v1.21.0-rc.0+766a5fe Kube-Proxy Version: v1.21.0-rc.0+766a5fe ProviderID: azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/schoudha241528-06240728-rg/providers/Microsoft.Compute/virtualMachines/schoudha241528-06240728-rhel-0 Non-terminated Pods: (12 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-node-tuning-operator tuned-hv4sp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 7h14m openshift-dns dns-default-spqqg 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 6h57m openshift-dns node-resolver-4hwc9 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 6h59m openshift-image-registry node-ca-mhh4l 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 7h14m openshift-ingress-canary ingress-canary-vb8j4 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 7h14m openshift-machine-config-operator machine-config-daemon-5bq6w 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 6h2m openshift-monitoring node-exporter-vfdzq 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 7h15m openshift-multus multus-additional-cni-plugins-fwsxn 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 7h8m openshift-multus multus-rmp8q 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 7h3m openshift-multus network-metrics-daemon-9gzdb 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 7h8m openshift-network-diagnostics network-check-target-sxbhm 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 7h4m openshift-sdn sdn-nbwmf 115m (3%) 0 (0%) 240Mi (1%) 0 (0%) 7h7m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 309m (8%) 0 (0%) memory 808Mi (5%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-azure-disk 0 0 Events: <none>
reproduced when upgrade from 4.7.18 to 4.8 nightly. $ oc get node NAME STATUS ROLES AGE VERSION minmli25111228-06250313-master-0 Ready master 5h22m v1.21.0-rc.0+766a5fe minmli25111228-06250313-master-1 Ready master 5h22m v1.21.0-rc.0+766a5fe minmli25111228-06250313-master-2 Ready master 5h22m v1.21.0-rc.0+766a5fe minmli25111228-06250313-rhel-0 Ready,SchedulingDisabled worker 4h v1.21.0-rc.0+766a5fe minmli25111228-06250313-rhel-1 Ready worker 4h1m v1.20.0+87cc9a4 minmli25111228-06250313-worker-centralus-1 Ready worker 5h6m v1.21.0-rc.0+766a5fe minmli25111228-06250313-worker-centralus-2 Ready worker 5h6m v1.21.0-rc.0+766a5fe minmli25111228-06250313-worker-centralus-3 Ready worker 5h6m v1.21.0-rc.0+766a5fe $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-233762ad75c31053efe877cec0214894 True False False 3 3 3 0 5h21m worker rendered-worker-c75b8d54476674a8f9124f786e8bfd20 False True False 5 4 5 0 5h21m from currentConfig line and desiredConfig line, the node thinks it has rolled out to the desiredConfig. But from mcp worker, it's not so. $ oc get node minmli25111228-06250313-rhel-0 -o yaml apiVersion: v1 kind: Node metadata: annotations: machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-c75b8d54476674a8f9124f786e8bfd20 // *** machineconfiguration.openshift.io/desiredConfig: rendered-worker-c75b8d54476674a8f9124f786e8bfd20 // *** machineconfiguration.openshift.io/reason: "" machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: "true"
If this is an openshift-ansible problem we need the verbose Ansible logs of the upgrade.yml playbook run. If this is happening before openshift-ansible is run to complete the upgrade, then another component is responsible for the unschedulable state of the RHEL node.
Looking at the MCD log for machine-config-daemon-hbz4g, the MCD cordoned and drained the node, applied the config, but did not uncordon the node. 2021-07-01T06:20:45.832769808Z I0701 06:20:45.832609 63608 update.go:1874] Node has been successfully cordoned Moving this back to MCO for further investigation as to why MCD did not uncordon the node and why MCP rollout is not progressing.
One thing to note that, MCO doesn't perform OS update on RHEL nodes. It only does files and systemd unit updates. Should RHEL node be updated through ansible script first? also who updates kubelet and other key component on RHEL nodes?
As requested in comment 16, please attach the Ansible log from the upgrade.yml playbook. Moving back to openshift-ansible.
Lets also go ahead and ensure that we can get access to the node that's stuck in case this can only be debugged by looking at logs from the node.
Looking at the openshift-ansible upgrade.yml log I see the task failed waiting for the node to come back after reboot. However, the node is actually reporting Ready so the node appears to be up. Given that this is in a proxy environment, the issue is likely related to the fact the task waiting for reboot does not use proxy vars. During scaleup, proxy vars are used here: https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/config.yml#L192 During upgrade, proxy vars are not used here: https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/apply_machine_config.yml#L84 Are there QE jobs that test proxy in other environments? I can open a PR to add the proxy vars to the upgrade path but I would like to confirm if proxy has been tested and was found to be working in other environments.
> the issue is likely related to the fact the task waiting for reboot does not use proxy vars. > > During scaleup, proxy vars are used here: > https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/config.yml#L192 > > During upgrade, proxy vars are not used here: > https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/apply_machine_config.yml#L84 Does ansible 'reboot' module need proxy? Per my understanding the proxy vars is needed only when playbook task need to access external, while 'reboot' module does not need to access external, so I guess no, I even think we should remove the proxy vars setting in scaleup code. If I am wrong, pls correct me.
> Are there QE jobs that test proxy in other environments? Yes. Here is a job link for upgrade from 4.7.19-x86_64--> 4.8.0-0.nightly-2021-07-01-185624 with profile 21_Disconnected IPI on GCP with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on. https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/15499/console We can see that the RHEL nodes get upgraded and are running on cri-o 1.21. And all of the operators states look good. 07-03 08:20:53.263 Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 07-03 08:20:53.263 tsze03035008-fv6rf-master-0.c.openshift-qe.internal Ready master 4h14m v1.21.0-rc.0+1622f87 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 tsze03035008-fv6rf-master-1.c.openshift-qe.internal Ready master 4h14m v1.21.0-rc.0+1622f87 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 tsze03035008-fv6rf-master-2.c.openshift-qe.internal Ready master 4h14m v1.21.0-rc.0+1622f87 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 tsze03035008-fv6rf-w-a-l-rhel-0 Ready worker 177m v1.21.1+66b664d 10.0.32.5 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el7 07-03 08:20:53.263 tsze03035008-fv6rf-w-a-l-rhel-1 Ready worker 177m v1.21.1+66b664d 10.0.32.6 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el7 07-03 08:20:53.263 tsze03035008-fv6rf-worker-a-zs4nz.c.openshift-qe.internal Ready worker 3h57m v1.21.0-rc.0+1622f87 10.0.32.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 tsze03035008-fv6rf-worker-b-sdgrh.c.openshift-qe.internal Ready worker 3h57m v1.21.0-rc.0+1622f87 10.0.32.2 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 tsze03035008-fv6rf-worker-c-7j9zf.c.openshift-qe.internal Ready worker 3h57m v1.21.0-rc.0+1622f87 10.0.32.3 <none> Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-03 08:20:53.263 07-03 08:20:53.263 07-03 08:20:53.263 Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE 07-03 08:20:53.263 authentication 4.8.0-0.nightly-2021-07-01-185624 True False False 37m 07-03 08:20:53.263 baremetal 4.8.0-0.nightly-2021-07-01-185624 True False False 4h13m 07-03 08:20:53.263 cloud-credential 4.8.0-0.nightly-2021-07-01-185624 True False False 4h18m 07-03 08:20:53.263 cluster-autoscaler 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.263 config-operator 4.8.0-0.nightly-2021-07-01-185624 True False False 4h13m 07-03 08:20:53.263 console 4.8.0-0.nightly-2021-07-01-185624 True False False 42m 07-03 08:20:53.263 csi-snapshot-controller 4.8.0-0.nightly-2021-07-01-185624 True False False 3h44m 07-03 08:20:53.264 dns 4.8.0-0.nightly-2021-07-01-185624 True False False 127m 07-03 08:20:53.264 etcd 4.8.0-0.nightly-2021-07-01-185624 True False False 4h11m 07-03 08:20:53.264 image-registry 4.8.0-0.nightly-2021-07-01-185624 True False False 3h56m 07-03 08:20:53.264 ingress 4.8.0-0.nightly-2021-07-01-185624 True False False 142m 07-03 08:20:53.264 insights 4.8.0-0.nightly-2021-07-01-185624 True False False 4h5m 07-03 08:20:53.264 kube-apiserver 4.8.0-0.nightly-2021-07-01-185624 True False False 4h10m 07-03 08:20:53.264 kube-controller-manager 4.8.0-0.nightly-2021-07-01-185624 True False False 4h11m 07-03 08:20:53.264 kube-scheduler 4.8.0-0.nightly-2021-07-01-185624 True False False 4h10m 07-03 08:20:53.264 kube-storage-version-migrator 4.8.0-0.nightly-2021-07-01-185624 True False False 21m 07-03 08:20:53.264 machine-api 4.8.0-0.nightly-2021-07-01-185624 True False False 4h2m 07-03 08:20:53.264 machine-approver 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.264 machine-config 4.8.0-0.nightly-2021-07-01-185624 True False False 37m 07-03 08:20:53.264 marketplace 4.8.0-0.nightly-2021-07-01-185624 True False False 3h40m 07-03 08:20:53.264 monitoring 4.8.0-0.nightly-2021-07-01-185624 True False False 140m 07-03 08:20:53.264 network 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.264 node-tuning 4.8.0-0.nightly-2021-07-01-185624 True False False 142m 07-03 08:20:53.264 openshift-apiserver 4.8.0-0.nightly-2021-07-01-185624 True False False 37m 07-03 08:20:53.264 openshift-controller-manager 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.264 openshift-samples 4.8.0-0.nightly-2021-07-01-185624 True False False 142m 07-03 08:20:53.264 operator-lifecycle-manager 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.264 operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-07-01-185624 True False False 4h12m 07-03 08:20:53.264 operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-07-01-185624 True False False 4h5m 07-03 08:20:53.264 service-ca 4.8.0-0.nightly-2021-07-01-185624 True False False 4h13m 07-03 08:20:53.264 storage 4.8.0-0.nightly-2021-07-01-185624 True False False 49m
Given that 4.7.20 will be the minimum version offered via the upgrade graph and we believe that this was fixed via other unspecified fixes to 4.7 MCO and/or kubelet changes and we've been unable to reproduce this in 4.7.19 or higher to 4.8 I'm closing this bug as CLOSED CURRENTRELEASE. If we can reproduce this when upgrading from 4.7.19 or higher then lets re-open it. Ryan or others are surely welcome to provide details on the suspected MCO changes which fixed the problem here.
I reran the upgrade job with 4.7.20 -> 4.8.0-rc.3 and it failed (twice). Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME 07-09 15:17:37.858 tsze09230733-07091508-master-0 Ready master 3h48m v1.21.1+f36aa36 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-09 15:17:37.858 tsze09230733-07091508-master-1 Ready master 3h48m v1.21.1+f36aa36 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-09 15:17:37.858 tsze09230733-07091508-master-2 Ready master 3h48m v1.21.1+f36aa36 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-09 15:17:37.858 tsze09230733-07091508-rhel-0 Ready,SchedulingDisabled worker 154m v1.21.1+f36aa36 10.0.1.8 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-13.rhaos4.8.git8d20153.el7 07-09 15:17:37.858 tsze09230733-07091508-rhel-1 Ready worker 154m v1.20.0+bd7b30d 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-7.rhaos4.7.git41925ef.el7 07-09 15:17:37.858 tsze09230733-07091508-worker-centralus-1 Ready worker 3h33m v1.21.1+f36aa36 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-09 15:17:37.858 tsze09230733-07091508-worker-centralus-2 Ready worker 3h32m v1.21.1+f36aa36 10.0.1.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 07-09 15:17:37.858 tsze09230733-07091508-worker-centralus-3 Ready worker 3h33m v1.21.1+f36aa36 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 Profile used: 03_Disconnected UPI on Azure with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on job/upgrade_CI/15669/ must-gather says: When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information. ClusterID: e73eff19-f35e-459d-be20-ee9c66247b96 ClusterVersion: Stable at "4.8.0-rc.3" ClusterOperators: clusteroperator/machine-config is not upgradeable because One or more machine config pools are updating, please see `oc get mcp` for further details
Must-gather is too big to attach here but available.
Reproducing it with regular azure cluster behind proxy with RHEL nodes. Upgrading from 4.7.20 -> 4.8.0 and the RHEL node still failed to reboot. TASK [openshift_node : Reboot the host and wait for it to come back] *********** Wednesday 14 July 2021 16:36:47 +0800 (0:00:00.558) 0:15:19.258 ******** fatal: [10.0.1.8]: FAILED! => {"changed": false, "elapsed": 613, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true} # oc get node -owide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME yangyang-bz-07140153-master-0 Ready master 7h51m v1.21.1+f36aa36 10.0.0.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 yangyang-bz-07140153-master-1 Ready master 7h51m v1.21.1+f36aa36 10.0.0.8 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 yangyang-bz-07140153-master-2 Ready master 7h51m v1.21.1+f36aa36 10.0.0.7 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 yangyang-bz-07140153-rhel-0 Ready worker 5h36m v1.21.1+f36aa36 10.0.1.7 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-13.rhaos4.8.git8d20153.el7 yangyang-bz-07140153-rhel-1 Ready,SchedulingDisabled worker 5h37m v1.21.1+f36aa36 10.0.1.8 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.21.1-13.rhaos4.8.git8d20153.el7 yangyang-bz-07140153-worker-northcentralus-1 Ready worker 7h35m v1.21.1+f36aa36 10.0.1.4 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 yangyang-bz-07140153-worker-northcentralus-2 Ready worker 7h35m v1.21.1+f36aa36 10.0.1.5 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 yangyang-bz-07140153-worker-northcentralus-3 Ready worker 7h35m v1.21.1+f36aa36 10.0.1.6 <none> Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa) 4.18.0-305.7.1.el8_4.x86_64 cri-o://1.21.1-12.rhaos4.8.git30ca719.el8 # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0 True False False 107m baremetal 4.8.0 True False False 7h48m cloud-credential 4.8.0 True False False 7h51m cluster-autoscaler 4.8.0 True False False 7h46m config-operator 4.8.0 True False False 7h48m console 4.8.0 True False False 108m csi-snapshot-controller 4.8.0 True False False 7h42m dns 4.8.0 True False False 130m etcd 4.8.0 True False False 7h46m image-registry 4.8.0 True False False 7h33m ingress 4.8.0 True False False 145m insights 4.8.0 True False False 7h41m kube-apiserver 4.8.0 True False False 7h44m kube-controller-manager 4.8.0 True False False 7h44m kube-scheduler 4.8.0 True False False 7h46m kube-storage-version-migrator 4.8.0 True False False 118m machine-api 4.8.0 True False False 7h42m machine-approver 4.8.0 True False False 7h47m machine-config 4.8.0 True False False 7h41m marketplace 4.8.0 True False False 7h46m monitoring 4.8.0 True False False 143m network 4.8.0 True False False 7h47m node-tuning 4.8.0 True False False 145m openshift-apiserver 4.8.0 True False False 107m openshift-controller-manager 4.8.0 True False False 143m openshift-samples 4.8.0 True False False 145m operator-lifecycle-manager 4.8.0 True False False 7h47m operator-lifecycle-manager-catalog 4.8.0 True False False 7h47m operator-lifecycle-manager-packageserver 4.8.0 True False False 7h42m service-ca 4.8.0 True False False 7h48m storage 4.8.0 True False False 7h48m The cluster is up and running and can be accessed using the kubeconfig: https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/29826/artifact/workdir/install-dir/auth/kubeconfig
I'm summarizing the reports of passing/failing cluster configurations. Please correct me if I have these wrong. It appears that the same config works on IPI GCP but fails on UPI Azure. Are there other passing test jobs on UPI Azure? I'm trying to narrow down what combinations pass or fail to focus on potential issues. Since we know the code works in IPI GCP, it leads me to believe there is a platform component to the problem on Azure. Passed comment 34 21_Disconnected IPI on GCP with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on Failed comment 37 03_Disconnected UPI on Azure with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on Failed comment 42 "regular azure cluster behind proxy with RHEL nodes" From comment 42, were the nodes reporting Ready while the Reboot task was still retrying? I'm unable to access the must-gather in comment 44.
Per QE's CI test history, since 4.6.38 as upgrade target version, these similar issue start happening (and only on azure).
Problem with ssh into Azure nodes is now proven to happen before upgrade. I am assigning QA contact.
The ssh issue [1] should be fixed in [2]. With the ssh issue resolved, the upgrade should complete successfully. Please retest. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449 [2] https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-07-30-090713
Problem is solved with latest 4.9. I can ssh into Azure nodes without running into the "PTY allocation request failed on channel 0" problem. 4.8 -> 4.9 upgrade also worked. 4.7 -> 4.8 upgrade still fails https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/16355/console
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? example: Up to 2 minute disruption in edge routing example: Up to 90seconds of API downtime example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? example: Issue resolves itself after five minutes example: Admin uses oc to fix things example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? example: No, itβs always been like this we just never noticed example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
I think these questions should be asked on the linked bug [1], as that is the actual issue. This bug should probably just be closed NOTABUG or closed as a DUPLICATE of the linked bug because the issue identified here was just a result of ssh being broken on all nodes. The assignee on the linked bug would be in a better position to answer the questions as the fix is in that bug. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449
https://bugzilla.redhat.com/show_bug.cgi?id=1984449#c10(In reply to Russell Teague from comment #62) > I think these questions should be asked on the linked bug [1]... > ... > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449 Done [1]. > This bug should probably just be closed NOTABUG or closed as a DUPLICATE of the linked bug... I'm not clear enough on what's going on to be able to make that call myself, so for now I'm just leaving UpgradeBlocker on here and adding ImpactStatementRequested. If someone more familiar with this series thinks it's appropriate to close it out, or just remove UpgradeBlocker, that's fine with me. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1984449#c10
*** This bug has been marked as a duplicate of bug 1984449 ***
Since this is closed as a dup, I'm dropping UpgradeBlocker, and we can sort that all out in bug 1984449.