1975907 – RHEL node stuck in Scheduling disabled state after upgrade from 4.7 to 4.8 nightly (with proxy) (can't ssh into Azure nodes even before upgrade)

Bug 1975907 - RHEL node stuck in Scheduling disabled state after upgrade from 4.7 to 4.8 nightly (with proxy) (can't ssh into Azure nodes even before upgrade)

Summary: RHEL node stuck in Scheduling disabled state after upgrade from 4.7 to 4.8 ni...

Keywords:
Status:	CLOSED DUPLICATE of bug 1984449
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Russell Teague
QA Contact:	To Hung Sze
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1982973
TreeView+	depends on / blocked

Reported:	2021-06-24 17:13 UTC by Sunil Choudhary
Modified:	2021-09-10 21:25 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1982973 (view as bug list)
Environment:
Last Closed:	2021-08-24 18:59:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather (17.54 MB, application/gzip) 2021-06-24 17:13 UTC, Sunil Choudhary	no flags	Details
View All

Description Sunil Choudhary 2021-06-24 17:13:48 UTC

Created attachment 1793996 [details]
must-gather

RHEL node stuck in Scheduling disabled state after upgrade from 4.7.17 to 4.8.0-0.nightly-2021-06-22-192915

Profile: UPI on Azure HTTP Proxy FIPS ETCD encryption.

I see worker machine config pool is in UPDATING true state.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-22-192915   True        False         5h32m   Cluster version is 4.8.0-0.nightly-2021-06-22-192915

$ oc get nodes -o wide
NAME                                         STATUS                     ROLES    AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
schoudha241528-06240728-master-0             Ready                      master   9h      v1.21.0-rc.0+120883f   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8
schoudha241528-06240728-master-1             Ready                      master   9h      v1.21.0-rc.0+120883f   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8
schoudha241528-06240728-master-2             Ready                      master   9h      v1.21.0-rc.0+120883f   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8
schoudha241528-06240728-rhel-0               Ready,SchedulingDisabled   worker   7h48m   v1.21.0-rc.0+766a5fe   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el7
schoudha241528-06240728-rhel-1               Ready                      worker   7h48m   v1.20.0+87cc9a4        10.0.1.8      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7
schoudha241528-06240728-worker-centralus-1   Ready                      worker   8h      v1.21.0-rc.0+120883f   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8
schoudha241528-06240728-worker-centralus-2   Ready                      worker   8h      v1.21.0-rc.0+120883f   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8
schoudha241528-06240728-worker-centralus-3   Ready                      worker   8h      v1.21.0-rc.0+120883f   10.0.1.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106220017-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64   cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el8

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-69913347f90de5cdcb6775b82a9ca3b4   True      False      False      3              3                   3                     0                      9h
worker   rendered-worker-85d54b151dc3dae658b002023a688f7a   False     True       False      5              4                   5                     0                      9h

$ oc describe mcp worker
Name:         worker
Namespace:    
Labels:       machineconfiguration.openshift.io/mco-built-in=
              pools.operator.machineconfiguration.openshift.io/worker=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2021-06-24T07:51:56Z
  Generation:          4
  Managed Fields:
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:machineconfiguration.openshift.io/mco-built-in:
          f:pools.operator.machineconfiguration.openshift.io/worker:
      f:spec:
        .:
        f:configuration:
          .:
          f:source:
        f:machineConfigSelector:
          .:
          f:matchLabels:
            .:
            f:machineconfiguration.openshift.io/role:
        f:nodeSelector:
          .:
          f:matchLabels:
            .:
            f:node-role.kubernetes.io/worker:
        f:paused:
    Manager:      machine-config-operator
    Operation:    Update
    Time:         2021-06-24T07:51:56Z
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name:
          f:source:
      f:status:
        .:
        f:conditions:
        f:configuration:
          .:
          f:name:
          f:source:
        f:degradedMachineCount:
        f:machineCount:
        f:observedGeneration:
        f:readyMachineCount:
        f:unavailableMachineCount:
        f:updatedMachineCount:
    Manager:         machine-config-controller
    Operation:       Update
    Time:            2021-06-24T07:52:50Z
  Resource Version:  124500
  UID:               13a6ba4a-1766-42ad-ac72-4d88215d6de2
Spec:
  Configuration:
    Name:  rendered-worker-85d54b151dc3dae658b002023a688f7a
    Source:
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         00-worker
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-fips
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-generated-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  worker
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/worker:  
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2021-06-24T07:52:45Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2021-06-24T07:52:50Z
    Message:               
    Reason:                
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2021-06-24T07:52:50Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2021-06-24T11:27:02Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2021-06-24T11:27:02Z
    Message:               All nodes are updating to rendered-worker-85d54b151dc3dae658b002023a688f7a
    Reason:                
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-worker-85d54b151dc3dae658b002023a688f7a
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-fips
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-generated-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
  Degraded Machine Count:     0
  Machine Count:              5
  Observed Generation:        4
  Ready Machine Count:        4
  Unavailable Machine Count:  1
  Updated Machine Count:      5
Events:                       <none>

$ oc describe node schoudha241528-06240728-rhel-0
Name:               schoudha241528-06240728-rhel-0
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=Standard_D4s_v3
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=centralus
                    failure-domain.beta.kubernetes.io/zone=0
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=schoudha241528-06240728-rhel-0
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=Standard_D4s_v3
                    node.openshift.io/os_id=rhel
                    topology.kubernetes.io/region=centralus
                    topology.kubernetes.io/zone=0
Annotations:        machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-85d54b151dc3dae658b002023a688f7a
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-85d54b151dc3dae658b002023a688f7a
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 24 Jun 2021 14:40:20 +0530
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  schoudha241528-06240728-rhel-0
  AcquireTime:     <unset>
  RenewTime:       Thu, 24 Jun 2021 22:29:21 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Thu, 24 Jun 2021 22:28:51 +0530   Thu, 24 Jun 2021 17:06:36 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 24 Jun 2021 22:28:51 +0530   Thu, 24 Jun 2021 17:06:36 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 24 Jun 2021 22:28:51 +0530   Thu, 24 Jun 2021 17:06:36 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Thu, 24 Jun 2021 22:28:51 +0530   Thu, 24 Jun 2021 17:06:46 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  Hostname:    schoudha241528-06240728-rhel-0
  InternalIP:  10.0.1.7
Capacity:
  attachable-volumes-azure-disk:  8
  cpu:                            4
  ephemeral-storage:              28662Mi
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         16265940Ki
  pods:                           250
Allocatable:
  attachable-volumes-azure-disk:  8
  cpu:                            3500m
  ephemeral-storage:              27048856737
  hugepages-1Gi:                  0
  hugepages-2Mi:                  0
  memory:                         15114964Ki
  pods:                           250
System Info:
  Machine ID:                             72d5f8ee97d141a2bd5151b18a1b1c57
  System UUID:                            62B4B769-FE93-406F-A14E-DE561431C2FE
  Boot ID:                                5b33eaaa-f05d-4ea2-9797-d2abcea7b397
  Kernel Version:                         3.10.0-1160.31.1.el7.x86_64
  OS Image:                               Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.21.1-10.rhaos4.8.gitd3e59a4.el7
  Kubelet Version:                        v1.21.0-rc.0+766a5fe
  Kube-Proxy Version:                     v1.21.0-rc.0+766a5fe
ProviderID:                               azure:///subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/schoudha241528-06240728-rg/providers/Microsoft.Compute/virtualMachines/schoudha241528-06240728-rhel-0
Non-terminated Pods:                      (12 in total)
  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                   ------------  ----------  ---------------  -------------  ---
  openshift-cluster-node-tuning-operator  tuned-hv4sp                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         7h14m
  openshift-dns                           dns-default-spqqg                      60m (1%)      0 (0%)      110Mi (0%)       0 (0%)         6h57m
  openshift-dns                           node-resolver-4hwc9                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         6h59m
  openshift-image-registry                node-ca-mhh4l                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         7h14m
  openshift-ingress-canary                ingress-canary-vb8j4                   10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         7h14m
  openshift-machine-config-operator       machine-config-daemon-5bq6w            40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         6h2m
  openshift-monitoring                    node-exporter-vfdzq                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         7h15m
  openshift-multus                        multus-additional-cni-plugins-fwsxn    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         7h8m
  openshift-multus                        multus-rmp8q                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         7h3m
  openshift-multus                        network-metrics-daemon-9gzdb           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         7h8m
  openshift-network-diagnostics           network-check-target-sxbhm             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         7h4m
  openshift-sdn                           sdn-nbwmf                              115m (3%)     0 (0%)      240Mi (1%)       0 (0%)         7h7m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests    Limits
  --------                       --------    ------
  cpu                            309m (8%)   0 (0%)
  memory                         808Mi (5%)  0 (0%)
  ephemeral-storage              0 (0%)      0 (0%)
  hugepages-1Gi                  0 (0%)      0 (0%)
  hugepages-2Mi                  0 (0%)      0 (0%)
  attachable-volumes-azure-disk  0           0
Events:                          <none>

Comment 5 MinLi 2021-06-25 08:59:09 UTC

reproduced when upgrade from 4.7.18 to 4.8 nightly.

$ oc get node 
NAME                                         STATUS                     ROLES    AGE     VERSION
minmli25111228-06250313-master-0             Ready                      master   5h22m   v1.21.0-rc.0+766a5fe
minmli25111228-06250313-master-1             Ready                      master   5h22m   v1.21.0-rc.0+766a5fe
minmli25111228-06250313-master-2             Ready                      master   5h22m   v1.21.0-rc.0+766a5fe
minmli25111228-06250313-rhel-0               Ready,SchedulingDisabled   worker   4h      v1.21.0-rc.0+766a5fe
minmli25111228-06250313-rhel-1               Ready                      worker   4h1m    v1.20.0+87cc9a4
minmli25111228-06250313-worker-centralus-1   Ready                      worker   5h6m    v1.21.0-rc.0+766a5fe
minmli25111228-06250313-worker-centralus-2   Ready                      worker   5h6m    v1.21.0-rc.0+766a5fe
minmli25111228-06250313-worker-centralus-3   Ready                      worker   5h6m    v1.21.0-rc.0+766a5fe

$ oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-233762ad75c31053efe877cec0214894   True      False      False      3              3                   3                     0                      5h21m
worker   rendered-worker-c75b8d54476674a8f9124f786e8bfd20   False     True       False      5              4                   5                     0                      5h21m


from currentConfig line and desiredConfig line, the node thinks it has rolled out to the desiredConfig. But from mcp worker, it's not so.

$ oc get node minmli25111228-06250313-rhel-0 -o yaml 
apiVersion: v1
kind: Node
metadata:
  annotations:
    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
    machineconfiguration.openshift.io/currentConfig: rendered-worker-c75b8d54476674a8f9124f786e8bfd20 // ***
    machineconfiguration.openshift.io/desiredConfig: rendered-worker-c75b8d54476674a8f9124f786e8bfd20 // ***
    machineconfiguration.openshift.io/reason: ""
    machineconfiguration.openshift.io/ssh: accessed
    machineconfiguration.openshift.io/state: Done
    volumes.kubernetes.io/controller-managed-attach-detach: "true"

Comment 16 Russell Teague 2021-07-01 17:12:05 UTC

If this is an openshift-ansible problem we need the verbose Ansible logs of the upgrade.yml playbook run.  If this is happening before openshift-ansible is run to complete the upgrade, then another component is responsible for the unschedulable state of the RHEL node.

Comment 17 Russell Teague 2021-07-01 17:26:45 UTC

Looking at the MCD log for machine-config-daemon-hbz4g, the MCD cordoned and drained the node, applied the config, but did not uncordon the node.


2021-07-01T06:20:45.832769808Z I0701 06:20:45.832609   63608 update.go:1874] Node has been successfully cordoned


Moving this back to MCO for further investigation as to why MCD did not uncordon the node and why MCP rollout is not progressing.

Comment 20 Sinny Kumari 2021-07-02 10:30:49 UTC

One thing to note that, MCO doesn't perform OS update on RHEL nodes. It only does files and systemd unit updates. Should RHEL node be updated through ansible script first? also who updates kubelet and other key component on RHEL nodes?

Comment 21 Russell Teague 2021-07-02 12:52:27 UTC

As requested in comment 16, please attach the Ansible log from the upgrade.yml playbook.

Moving back to openshift-ansible.

Comment 22 Scott Dodson 2021-07-02 13:34:41 UTC

Lets also go ahead and ensure that we can get access to the node that's stuck in case this can only be debugged by looking at logs from the node.

Comment 30 Russell Teague 2021-07-02 16:57:42 UTC

Looking at the openshift-ansible upgrade.yml log I see the task failed waiting for the node to come back after reboot.  However, the node is actually reporting Ready so the node appears to be up.  Given that this is in a proxy environment, the issue is likely related to the fact the task waiting for reboot does not use proxy vars.

During scaleup, proxy vars are used here:
https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/config.yml#L192

During upgrade, proxy vars are not used here:
https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/apply_machine_config.yml#L84

Are there QE jobs that test proxy in other environments?

I can open a PR to add the proxy vars to the upgrade path but I would like to confirm if proxy has been tested and was found to be working in other environments.

Comment 33 Johnny Liu 2021-07-05 04:24:16 UTC

> the issue is likely related to the fact the task waiting for reboot does not use proxy vars.
>
> During scaleup, proxy vars are used here:
> https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/config.yml#L192
> 
> During upgrade, proxy vars are not used here:
> https://github.com/openshift/openshift-ansible/blob/24d5991b20a414133d819eb3c86b50c4c76b1591/roles/openshift_node/tasks/apply_machine_config.yml#L84

Does ansible 'reboot' module need proxy? Per my understanding the proxy vars is needed only when playbook task need to access external, while 'reboot' module does not need to access external, so I guess no, I even think we should remove the proxy vars setting in scaleup code. If I am wrong, pls correct me.

Comment 34 Yang Yang 2021-07-05 10:02:55 UTC

> Are there QE jobs that test proxy in other environments?

Yes. Here is a job link for upgrade from 4.7.19-x86_64--> 4.8.0-0.nightly-2021-07-01-185624
 with profile 21_Disconnected IPI on GCP with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on. 

https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/15499/console

We can see that the RHEL nodes get upgraded and are running on cri-o 1.21. And all of the operators states look good.

07-03 08:20:53.263  Post action: #oc get node: NAME                                                        STATUS   ROLES    AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
07-03 08:20:53.263  tsze03035008-fv6rf-master-0.c.openshift-qe.internal         Ready    master   4h14m   v1.21.0-rc.0+1622f87   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  tsze03035008-fv6rf-master-1.c.openshift-qe.internal         Ready    master   4h14m   v1.21.0-rc.0+1622f87   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  tsze03035008-fv6rf-master-2.c.openshift-qe.internal         Ready    master   4h14m   v1.21.0-rc.0+1622f87   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  tsze03035008-fv6rf-w-a-l-rhel-0                             Ready    worker   177m    v1.21.1+66b664d        10.0.32.5     <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el7
07-03 08:20:53.263  tsze03035008-fv6rf-w-a-l-rhel-1                             Ready    worker   177m    v1.21.1+66b664d        10.0.32.6     <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el7
07-03 08:20:53.263  tsze03035008-fv6rf-worker-a-zs4nz.c.openshift-qe.internal   Ready    worker   3h57m   v1.21.0-rc.0+1622f87   10.0.32.4     <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  tsze03035008-fv6rf-worker-b-sdgrh.c.openshift-qe.internal   Ready    worker   3h57m   v1.21.0-rc.0+1622f87   10.0.32.2     <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  tsze03035008-fv6rf-worker-c-7j9zf.c.openshift-qe.internal   Ready    worker   3h57m   v1.21.0-rc.0+1622f87   10.0.32.3     <none>        Red Hat Enterprise Linux CoreOS 48.84.202106301921-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-03 08:20:53.263  
07-03 08:20:53.263  
07-03 08:20:53.263  Post action: #oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
07-03 08:20:53.263  authentication                             4.8.0-0.nightly-2021-07-01-185624   True        False         False      37m
07-03 08:20:53.263  baremetal                                  4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h13m
07-03 08:20:53.263  cloud-credential                           4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h18m
07-03 08:20:53.263  cluster-autoscaler                         4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.263  config-operator                            4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h13m
07-03 08:20:53.263  console                                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      42m
07-03 08:20:53.263  csi-snapshot-controller                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      3h44m
07-03 08:20:53.264  dns                                        4.8.0-0.nightly-2021-07-01-185624   True        False         False      127m
07-03 08:20:53.264  etcd                                       4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h11m
07-03 08:20:53.264  image-registry                             4.8.0-0.nightly-2021-07-01-185624   True        False         False      3h56m
07-03 08:20:53.264  ingress                                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      142m
07-03 08:20:53.264  insights                                   4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h5m
07-03 08:20:53.264  kube-apiserver                             4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h10m
07-03 08:20:53.264  kube-controller-manager                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h11m
07-03 08:20:53.264  kube-scheduler                             4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h10m
07-03 08:20:53.264  kube-storage-version-migrator              4.8.0-0.nightly-2021-07-01-185624   True        False         False      21m
07-03 08:20:53.264  machine-api                                4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h2m
07-03 08:20:53.264  machine-approver                           4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.264  machine-config                             4.8.0-0.nightly-2021-07-01-185624   True        False         False      37m
07-03 08:20:53.264  marketplace                                4.8.0-0.nightly-2021-07-01-185624   True        False         False      3h40m
07-03 08:20:53.264  monitoring                                 4.8.0-0.nightly-2021-07-01-185624   True        False         False      140m
07-03 08:20:53.264  network                                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.264  node-tuning                                4.8.0-0.nightly-2021-07-01-185624   True        False         False      142m
07-03 08:20:53.264  openshift-apiserver                        4.8.0-0.nightly-2021-07-01-185624   True        False         False      37m
07-03 08:20:53.264  openshift-controller-manager               4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.264  openshift-samples                          4.8.0-0.nightly-2021-07-01-185624   True        False         False      142m
07-03 08:20:53.264  operator-lifecycle-manager                 4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.264  operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h12m
07-03 08:20:53.264  operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h5m
07-03 08:20:53.264  service-ca                                 4.8.0-0.nightly-2021-07-01-185624   True        False         False      4h13m
07-03 08:20:53.264  storage                                    4.8.0-0.nightly-2021-07-01-185624   True        False         False      49m

Comment 36 Scott Dodson 2021-07-06 13:55:29 UTC

Given that 4.7.20 will be the minimum version offered via the upgrade graph and we believe that this was fixed via other unspecified fixes to 4.7 MCO and/or kubelet changes and we've been unable to reproduce this in 4.7.19 or higher to 4.8 I'm closing this bug as CLOSED CURRENTRELEASE. If we can reproduce this when upgrading from 4.7.19 or higher then lets re-open it.

Ryan or others are surely welcome to provide details on the suspected MCO changes which fixed the problem here.

Comment 37 To Hung Sze 2021-07-09 20:50:58 UTC

I reran the upgrade job with 4.7.20 -> 4.8.0-rc.3 and it failed (twice).

Post action: #oc get node: NAME                                       STATUS                     ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
07-09 15:17:37.858  tsze09230733-07091508-master-0             Ready                      master   3h48m   v1.21.1+f36aa36   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-09 15:17:37.858  tsze09230733-07091508-master-1             Ready                      master   3h48m   v1.21.1+f36aa36   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-09 15:17:37.858  tsze09230733-07091508-master-2             Ready                      master   3h48m   v1.21.1+f36aa36   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-09 15:17:37.858  tsze09230733-07091508-rhel-0               Ready,SchedulingDisabled   worker   154m    v1.21.1+f36aa36   10.0.1.8      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-13.rhaos4.8.git8d20153.el7
07-09 15:17:37.858  tsze09230733-07091508-rhel-1               Ready                      worker   154m    v1.20.0+bd7b30d   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
07-09 15:17:37.858  tsze09230733-07091508-worker-centralus-1   Ready                      worker   3h33m   v1.21.1+f36aa36   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-09 15:17:37.858  tsze09230733-07091508-worker-centralus-2   Ready                      worker   3h32m   v1.21.1+f36aa36   10.0.1.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
07-09 15:17:37.858  tsze09230733-07091508-worker-centralus-3   Ready                      worker   3h33m   v1.21.1+f36aa36   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8

Profile used: 03_Disconnected UPI on Azure with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on

job/upgrade_CI/15669/

must-gather says:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: e73eff19-f35e-459d-be20-ee9c66247b96
ClusterVersion: Stable at "4.8.0-rc.3"
ClusterOperators:
	clusteroperator/machine-config is not upgradeable because One or more machine config pools are updating, please see `oc get mcp` for further details

Comment 38 To Hung Sze 2021-07-09 20:51:31 UTC

Must-gather is too big to attach here but available.

Comment 42 Yang Yang 2021-07-14 10:09:20 UTC

Reproducing it with regular azure cluster behind proxy with RHEL nodes.

Upgrading from 4.7.20 -> 4.8.0 and the RHEL node still failed to reboot.

TASK [openshift_node : Reboot the host and wait for it to come back] ***********
Wednesday 14 July 2021  16:36:47 +0800 (0:00:00.558)       0:15:19.258 ******** 
fatal: [10.0.1.8]: FAILED! => {"changed": false, "elapsed": 613, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true}

# oc get node -owide
NAME                                           STATUS                     ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
yangyang-bz-07140153-master-0                  Ready                      master   7h51m   v1.21.1+f36aa36   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
yangyang-bz-07140153-master-1                  Ready                      master   7h51m   v1.21.1+f36aa36   10.0.0.8      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
yangyang-bz-07140153-master-2                  Ready                      master   7h51m   v1.21.1+f36aa36   10.0.0.7      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
yangyang-bz-07140153-rhel-0                    Ready                      worker   5h36m   v1.21.1+f36aa36   10.0.1.7      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-13.rhaos4.8.git8d20153.el7
yangyang-bz-07140153-rhel-1                    Ready,SchedulingDisabled   worker   5h37m   v1.21.1+f36aa36   10.0.1.8      <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64   cri-o://1.21.1-13.rhaos4.8.git8d20153.el7
yangyang-bz-07140153-worker-northcentralus-1   Ready                      worker   7h35m   v1.21.1+f36aa36   10.0.1.4      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
yangyang-bz-07140153-worker-northcentralus-2   Ready                      worker   7h35m   v1.21.1+f36aa36   10.0.1.5      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8
yangyang-bz-07140153-worker-northcentralus-3   Ready                      worker   7h35m   v1.21.1+f36aa36   10.0.1.6      <none>        Red Hat Enterprise Linux CoreOS 48.84.202107040900-0 (Ootpa)   4.18.0-305.7.1.el8_4.x86_64   cri-o://1.21.1-12.rhaos4.8.git30ca719.el8


# oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0     True        False         False      107m
baremetal                                  4.8.0     True        False         False      7h48m
cloud-credential                           4.8.0     True        False         False      7h51m
cluster-autoscaler                         4.8.0     True        False         False      7h46m
config-operator                            4.8.0     True        False         False      7h48m
console                                    4.8.0     True        False         False      108m
csi-snapshot-controller                    4.8.0     True        False         False      7h42m
dns                                        4.8.0     True        False         False      130m
etcd                                       4.8.0     True        False         False      7h46m
image-registry                             4.8.0     True        False         False      7h33m
ingress                                    4.8.0     True        False         False      145m
insights                                   4.8.0     True        False         False      7h41m
kube-apiserver                             4.8.0     True        False         False      7h44m
kube-controller-manager                    4.8.0     True        False         False      7h44m
kube-scheduler                             4.8.0     True        False         False      7h46m
kube-storage-version-migrator              4.8.0     True        False         False      118m
machine-api                                4.8.0     True        False         False      7h42m
machine-approver                           4.8.0     True        False         False      7h47m
machine-config                             4.8.0     True        False         False      7h41m
marketplace                                4.8.0     True        False         False      7h46m
monitoring                                 4.8.0     True        False         False      143m
network                                    4.8.0     True        False         False      7h47m
node-tuning                                4.8.0     True        False         False      145m
openshift-apiserver                        4.8.0     True        False         False      107m
openshift-controller-manager               4.8.0     True        False         False      143m
openshift-samples                          4.8.0     True        False         False      145m
operator-lifecycle-manager                 4.8.0     True        False         False      7h47m
operator-lifecycle-manager-catalog         4.8.0     True        False         False      7h47m
operator-lifecycle-manager-packageserver   4.8.0     True        False         False      7h42m
service-ca                                 4.8.0     True        False         False      7h48m
storage                                    4.8.0     True        False         False      7h48m

The cluster is up and running and can be accessed using the kubeconfig:
https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/29826/artifact/workdir/install-dir/auth/kubeconfig

Comment 45 Russell Teague 2021-07-14 15:17:35 UTC

I'm summarizing the reports of passing/failing cluster configurations.  Please correct me if I have these wrong.
It appears that the same config works on IPI GCP but fails on UPI Azure.  Are there other passing test jobs on UPI Azure?  I'm trying to narrow down what combinations pass or fail to focus on potential issues.  Since we know the code works in IPI GCP, it leads me to believe there is a platform component to the problem on Azure.

Passed comment 34
21_Disconnected IPI on GCP   with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on

Failed comment 37
03_Disconnected UPI on Azure with RHCOS & RHEL7.9 & FIPS on & http_proxy & Etcd Encryption on

Failed comment 42
"regular azure cluster behind proxy with RHEL nodes"


From comment 42, were the nodes reporting Ready while the Reboot task was still retrying?

I'm unable to access the must-gather in comment 44.

Comment 48 Johnny Liu 2021-07-16 03:43:10 UTC

Per QE's CI test history, since 4.6.38 as upgrade target version, these similar issue start happening (and only on azure).

Comment 58 To Hung Sze 2021-07-29 02:31:43 UTC

Problem with ssh into Azure nodes is now proven to happen before upgrade.
I am assigning QA contact.

Comment 59 Russell Teague 2021-07-30 16:59:15 UTC

The ssh issue [1] should be fixed in [2].  With the ssh issue resolved, the upgrade should complete successfully.  Please retest.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449
[2] https://amd64.ocp.releases.ci.openshift.org/releasestream/4.9.0-0.nightly/release/4.9.0-0.nightly-2021-07-30-090713

Comment 60 To Hung Sze 2021-08-01 02:37:42 UTC

Problem is solved with latest 4.9.
I can ssh into Azure nodes without running into the "PTY allocation request failed on channel 0" problem.
4.8 -> 4.9 upgrade also worked.



4.7 -> 4.8 upgrade still fails
https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/upgrade_CI/16355/console

Comment 61 Lalatendu Mohanty 2021-08-03 18:30:40 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?  Is it serious enough to warrant blocking edges?
  example: Up to 2 minute disruption in edge routing
  example: Up to 90seconds of API downtime
  example: etcd loses quorum and you have to restore from backup
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  example: Issue resolves itself after five minutes
  example: Admin uses oc to fix things
  example: Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
  example: No, it’s always been like this we just never noticed
  example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 62 Russell Teague 2021-08-03 20:19:32 UTC

I think these questions should be asked on the linked bug [1], as that is the actual issue.  This bug should probably just be closed NOTABUG or closed as a DUPLICATE of the linked bug because the issue identified here was just a result of ssh being broken on all nodes.  The assignee on the linked bug would be in a better position to answer the questions as the fix is in that bug.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449

Comment 63 W. Trevor King 2021-08-18 22:04:14 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1984449#c10(In reply to Russell Teague from comment #62)
> I think these questions should be asked on the linked bug [1]...
> ...
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1984449

Done [1].

> This bug should probably just be closed NOTABUG or closed as a DUPLICATE of the linked bug...

I'm not clear enough on what's going on to be able to make that call myself, so for now I'm just leaving UpgradeBlocker on here and adding ImpactStatementRequested.  If someone more familiar with this series thinks it's appropriate to close it out, or just remove UpgradeBlocker, that's fine with me.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1984449#c10

Comment 65 Scott Dodson 2021-08-24 18:59:46 UTC


*** This bug has been marked as a duplicate of bug 1984449 ***

Comment 66 W. Trevor King 2021-09-10 21:25:47 UTC

Since this is closed as a dup, I'm dropping UpgradeBlocker, and we can sort that all out in bug 1984449.

Note You need to log in before you can comment on or make changes to this bug.