1974962 – MCO degraded during upgrade OCP with RHEL,OVN, Proxy: client connection lost

Bug 1974962 - MCO degraded during upgrade OCP with RHEL,OVN, Proxy: client connection lost

Summary: MCO degraded during upgrade OCP with RHEL,OVN, Proxy: client connection lost

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Andrew Stoycos
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:	1976110
Blocks:	1999867
TreeView+	depends on / blocked

Reported:	2021-06-22 20:20 UTC by Walid A.
Modified:	2021-10-14 14:14 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-10-14 14:14:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Walid A. 2021-06-22 20:20:49 UTC

Description of problem:
Cluster operator machine-config has not yet successfully rolled out during upgrade from OCP 4.7.0-0.nightly-2021-06-20-093308 to 4.8.0-0.nightly-2021-06-21-175537

This is seen in our QE Upgrade CI profile:
80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS where we scaleup and add two RHEL 7.9 worker nodes.
One RHEL 7.9 worker node stuck in Ready,SchedulingDisabled state

[2021-06-22T06:45:19.608Z] Post action: #oc get node: NAME                                        STATUS                     ROLES    AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
[2021-06-22T06:45:19.608Z] ip-10-0-51-35.us-east-2.compute.internal    Ready                      master   4h27m   v1.21.0-rc.0+120883f   10.0.51.35    <none>        Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
[2021-06-22T06:45:19.608Z] ip-10-0-51-86.us-east-2.compute.internal    Ready                      worker   4h18m   v1.20.0+87cc9a4        10.0.51.86    <none>        Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8
[2021-06-22T06:45:19.608Z] ip-10-0-54-153.us-east-2.compute.internal   Ready                      master   4h27m   v1.21.0-rc.0+120883f   10.0.54.153   <none>        Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8
[2021-06-22T06:45:19.608Z] ip-10-0-56-175.us-east-2.compute.internal   Ready                      worker   3h6m    v1.20.0+87cc9a4        10.0.56.175   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64    cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7
[2021-06-22T06:45:19.608Z] ip-10-0-58-42.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   3h6m    v1.20.0+87cc9a4        10.0.58.42    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64    cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7
[2021-06-22T06:45:19.608Z] ip-10-0-74-190.us-east-2.compute.internal   Ready                      worker   4h18m   v1.20.0+87cc9a4        10.0.74.190   <none>        Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8
[2021-06-22T06:45:19.608Z] ip-10-0-74-227.us-east-2.compute.internal   Ready                      worker   4h17m   v1.20.0+87cc9a4        10.0.74.227   <none>        Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8
[2021-06-22T06:45:19.608Z] ip-10-0-76-192.us-east-2.compute.internal   Ready                      master   4h28m   v1.21.0-rc.0+120883f   10.0.76.192   <none>        Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa)   4.18.0-305.3.1.el8_4.x86_64    cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8

[2021-06-22T06:42:17.846Z] NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
[2021-06-22T06:42:17.846Z] version   4.7.0-0.nightly-2021-06-20-093308   True        True          3h1m    Unable to apply 4.8.0-0.nightly-2021-06-21-175537: the cluster operator machine-config has not yet successfully rolled out.


Version-Release number of selected component (if applicable):
node kube version: v1.20.0+87cc9a4
Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64
cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8

How reproducible:
Once in our CI

Steps to Reproduce:
1. Install OCP 4.7.0-0.nightly-2021-06-20-093308 IPI on AWS FIPS on & OVN & Etcd Encryption & http_proxy & STS
2. Scale to add 2 RHEL 7.9 worker nodes
3. Upgrade cluster to 4.8.0-0.nightly-2021-06-21-175537

Actual results:
Machine config operator degraded
[2021-06-22T06:45:19.608Z] Post action: #oc get co:NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
[2021-06-22T06:45:19.608Z] authentication                             4.8.0-0.nightly-2021-06-21-175537   True        False         False      99m
[2021-06-22T06:45:19.608Z] baremetal                                  4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h25m
[2021-06-22T06:45:19.608Z] cloud-credential                           4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h23m
[2021-06-22T06:45:19.608Z] cluster-autoscaler                         4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h23m
[2021-06-22T06:45:19.608Z] config-operator                            4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.608Z] console                                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      101m
[2021-06-22T06:45:19.608Z] csi-snapshot-controller                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.608Z] dns                                        4.8.0-0.nightly-2021-06-21-175537   True        False         False      132m
[2021-06-22T06:45:19.608Z] etcd                                       4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h22m
[2021-06-22T06:45:19.608Z] image-registry                             4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h17m
[2021-06-22T06:45:19.609Z] ingress                                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      151m
[2021-06-22T06:45:19.609Z] insights                                   4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h18m
[2021-06-22T06:45:19.609Z] kube-apiserver                             4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h22m
[2021-06-22T06:45:19.609Z] kube-controller-manager                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h22m
[2021-06-22T06:45:19.609Z] kube-scheduler                             4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h22m
[2021-06-22T06:45:19.609Z] kube-storage-version-migrator              4.8.0-0.nightly-2021-06-21-175537   True        False         False      102m
[2021-06-22T06:45:19.609Z] machine-api                                4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h19m
[2021-06-22T06:45:19.609Z] machine-approver                           4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.609Z] machine-config                             4.7.0-0.nightly-2021-06-20-093308   False       True          True       111m
[2021-06-22T06:45:19.609Z] marketplace                                4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h23m
[2021-06-22T06:45:19.609Z] monitoring                                 4.8.0-0.nightly-2021-06-21-175537   True        False         False      107m
[2021-06-22T06:45:19.609Z] network                                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.609Z] node-tuning                                4.8.0-0.nightly-2021-06-21-175537   True        False         False      151m
[2021-06-22T06:45:19.609Z] openshift-apiserver                        4.8.0-0.nightly-2021-06-21-175537   True        False         False      100m
[2021-06-22T06:45:19.609Z] openshift-controller-manager               4.8.0-0.nightly-2021-06-21-175537   True        False         False      150m
[2021-06-22T06:45:19.609Z] openshift-samples                          4.8.0-0.nightly-2021-06-21-175537   True        False         False      151m
[2021-06-22T06:45:19.609Z] operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.609Z] operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.609Z] operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-21-175537   True        False         False      133m
[2021-06-22T06:45:19.609Z] service-ca                                 4.8.0-0.nightly-2021-06-21-175537   True        False         False      4h24m
[2021-06-22T06:45:19.609Z] storage                                    4.8.0-0.nightly-2021-06-21-175537   True        False         False      102m


Expected results:
All co operators available, not progressing and not degraded after upgrade

Additional info:
Link to must-gather tar ball will be in next private comment.

Comment 3 Yang Yang 2021-06-23 02:53:24 UTC

I can reproduce it. 

Checking the mcp status, worker pool is degraded.

# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-523fde35bd682b1ca95b43ea123ee039   True      False      False      3              3                   3                     0                      20h
worker   rendered-worker-631a04c93d1348113465fd9c4127892b   False     True       True       5              2                   2                     1                      20h

Checking the worker pool message, found that Node ip-10-0-48-245.us-east-2.compute.internal has a SSH access error.

# oc get mcp worker -oyaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2021-06-22T06:27:00Z"
  generation: 3
  labels:
    machineconfiguration.openshift.io/mco-built-in: ""
    pools.operator.machineconfiguration.openshift.io/worker: ""
  managedFields:
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:machineconfiguration.openshift.io/mco-built-in: {}
          f:pools.operator.machineconfiguration.openshift.io/worker: {}
      f:spec:
        .: {}
        f:configuration:
          .: {}
          f:source: {}
        f:machineConfigSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:machineconfiguration.openshift.io/role: {}
        f:nodeSelector:
          .: {}
          f:matchLabels:
            .: {}
            f:node-role.kubernetes.io/worker: {}
        f:paused: {}
    manager: machine-config-operator
    operation: Update
    time: "2021-06-22T06:27:00Z"
  - apiVersion: machineconfiguration.openshift.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name: {}
          f:source: {}
      f:status:
        .: {}
        f:conditions: {}
        f:configuration:
          .: {}
          f:name: {}
          f:source: {}
        f:degradedMachineCount: {}
        f:machineCount: {}
        f:observedGeneration: {}
        f:readyMachineCount: {}
        f:unavailableMachineCount: {}
        f:updatedMachineCount: {}
    manager: machine-config-controller
    operation: Update
    time: "2021-06-22T06:28:38Z"
  name: worker
  resourceVersion: "106385"
  uid: 7a455bab-c429-47e1-a4db-795d3c5b6b8c
spec:
  configuration:
    name: rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-fips
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2021-06-22T08:44:49Z"
    message: ""
    reason: ""
    status: "False"
    type: Updated
  - lastTransitionTime: "2021-06-22T08:44:49Z"
    message: All nodes are updating to rendered-worker-631a04c93d1348113465fd9c4127892b
    reason: ""
    status: "True"
    type: Updating
  - lastTransitionTime: "2021-06-22T09:02:22Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2021-06-22T09:05:25Z"
    message: 'Node ip-10-0-48-245.us-east-2.compute.internal is reporting: "error detecting previous SSH accesses: error: cannot apply annotation for SSH access due to: unable to update node \"&Node{ObjectMeta:{      0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\": unexpected error when reading response body. Please retry. Original error: http2: client connection lost"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2021-06-22T09:05:25Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded
  configuration:
    name: rendered-worker-631a04c93d1348113465fd9c4127892b
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-fips
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-generated-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 1
  machineCount: 5
  observedGeneration: 3
  readyMachineCount: 2
  unavailableMachineCount: 1
  updatedMachineCount: 2

# oc logs -n openshift-machine-config-operator pod/machine-config-daemon-7999p -c machine-config-daemon | tail
W0622 08:55:51.521339    3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 08:55:51.521370    3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 09:02:38.735300    3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 09:02:38.735343    3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0622 09:03:09.701297    3355 trace.go:205] Trace[774965466]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (22-Jun-2021 09:02:39.699) (total time: 30002ms):
Trace[774965466]: [30.002098681s] [30.002098681s] END
E0622 09:03:09.701425    3355 reflector.go:138] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?resourceVersion=98573": dial tcp 172.30.0.1:443: i/o timeout
I0622 09:03:09.711269    3355 trace.go:205] Trace[1852186258]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (22-Jun-2021 09:02:39.710) (total time: 30000ms):
Trace[1852186258]: [30.000475692s] [30.000475692s] END
E0622 09:03:09.711370    3355 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?resourceVersion=98506": dial tcp 172.30.0.1:443: i/o timeout

Comment 5 Wei Sun 2021-06-23 07:01:11 UTC

Per the comment#3 ,  QE still could reproduce it. Set blocker? to request re-check if it's one release blocker.

Comment 6 Yu Qi Zhang 2021-06-24 03:27:34 UTC

Looking at the provided journal logs, there are a bunch of weird errors early on in kubelet/crio, and a few MCD container restarts, but it eventually does get to the part where it's having trouble, and the MCD tries the update, fails, and loops on that.

Example journal snippet:

Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491146]: machine-config-daemon[74888]: Starting update from rendered-worker-631a04c93d1348113465fd9c4127892b to rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491147]: machine-config-daemon[74888]: Node has been successfully cordoned
Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491148]: machine-config-daemon[74888]: Update prepared; beginning drain
Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.742723030Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.744083065Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPREQUEST on br-ex to 10.0.48.1 port 67 (xid=0x5ea02204)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPACK from 10.0.48.1 (xid=0x5ea02204)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1782] dhcp4 (br-ex):   address 10.0.48.245
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1782] dhcp4 (br-ex):   plen 20 (255.255.240.0)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1782] dhcp4 (br-ex):   gateway 10.0.48.1
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1783] dhcp4 (br-ex):   lease time 3600
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1783] dhcp4 (br-ex):   hostname 'ip-10-0-48-245'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1783] dhcp4 (br-ex):   nameserver '10.0.0.2'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1783] dhcp4 (br-ex):   domain name 'us-east-2.compute.internal'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info>  [1624371800.1783] dhcp4 (br-ex): state changed bound -> bound
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: bound to 10.0.48.245 -- renewal in 1783 seconds.
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Starting Network Manager Script Dispatcher Service...
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Started Network Manager Script Dispatcher Service.
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: new request (6 scripts)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: start running ordered scripts...
Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.748633915Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.749859655Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus

2 things from that:
1. the crio logs seem to indicate that we are doing an inspect. I'm not sure if the return values look correct, but assuming they are,
2. the node immediately sees a DHCPREQUEST, that seems abnormal. Is networking on the node unstable? Is that expected?

Looking at the job profile, maybe OVN can help take a look? From the MCO's perspective, something else lower in the networking stack is going wrong, so I'm not sure how to debug further.

Comment 8 Ori Michaeli 2021-06-28 06:10:32 UTC

This BZ does not reproduce on IPI BM

Comment 9 Ori Michaeli 2021-06-28 13:12:15 UTC

(In reply to Ori Michaeli from comment #8)
> This BZ does not reproduce on IPI BM

Clarification:
This was tested with 4.7.0-0.nightly-2021-06-26-014854 as 4.7.0-0.nightly-2021-06-20-093308 is not available anymore.

Comment 11 Yang Yang 2021-06-30 09:00:59 UTC

Updating the test result of upgrading from 4.7.0-0.nightly-2021-06-26-014854--> 4.8.0-0.nightly-2021-06-29-033219. I don't see this kind of error "timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready".

# oc get node
NAME                                        STATUS                        ROLES    AGE     VERSION
ip-10-0-48-218.us-east-2.compute.internal   Ready                         worker   5h26m   v1.21.0-rc.0+766a5fe
ip-10-0-49-65.us-east-2.compute.internal    Ready                         worker   4h15m   v1.20.0+87cc9a4
ip-10-0-50-212.us-east-2.compute.internal   Ready                         master   5h39m   v1.21.0-rc.0+766a5fe
ip-10-0-54-58.us-east-2.compute.internal    Ready                         worker   5h26m   v1.21.0-rc.0+766a5fe
ip-10-0-59-156.us-east-2.compute.internal   NotReady,SchedulingDisabled   worker   4h15m   v1.20.0+87cc9a4
ip-10-0-71-142.us-east-2.compute.internal   Ready                         master   5h39m   v1.21.0-rc.0+766a5fe
ip-10-0-72-138.us-east-2.compute.internal   Ready                         master   5h40m   v1.21.0-rc.0+766a5fe
ip-10-0-73-231.us-east-2.compute.internal   Ready                         worker   5h26m   v1.21.0-rc.0+766a5fe


# oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.8.0-0.nightly-2021-06-29-033219   True        False         False      125m
baremetal                                  4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
cloud-credential                           4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
cluster-autoscaler                         4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h35m
config-operator                            4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
console                                    4.8.0-0.nightly-2021-06-29-033219   True        False         False      127m
csi-snapshot-controller                    4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h31m
dns                                        4.8.0-0.nightly-2021-06-29-033219   True        True          False      157m
etcd                                       4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h35m
image-registry                             4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h25m
ingress                                    4.8.0-0.nightly-2021-06-29-033219   True        False         False      3h1m
insights                                   4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h29m
kube-apiserver                             4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h34m
kube-controller-manager                    4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h33m
kube-scheduler                             4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h33m
kube-storage-version-migrator              4.8.0-0.nightly-2021-06-29-033219   True        False         False      129m
machine-api                                4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h31m
machine-approver                           4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h35m
machine-config                             4.7.0-0.nightly-2021-06-26-014854   False       True          True       138m
marketplace                                4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h35m
monitoring                                 4.8.0-0.nightly-2021-06-29-033219   False       True          True       123m
network                                    4.8.0-0.nightly-2021-06-29-033219   True        True          True       5h36m
node-tuning                                4.8.0-0.nightly-2021-06-29-033219   True        False         False      3h
openshift-apiserver                        4.8.0-0.nightly-2021-06-29-033219   True        False         False      125m
openshift-controller-manager               4.8.0-0.nightly-2021-06-29-033219   True        False         False      3h1m
openshift-samples                          4.8.0-0.nightly-2021-06-29-033219   True        False         False      3h1m
operator-lifecycle-manager                 4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
operator-lifecycle-manager-catalog         4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
operator-lifecycle-manager-packageserver   4.8.0-0.nightly-2021-06-29-033219   True        False         False      145m
service-ca                                 4.8.0-0.nightly-2021-06-29-033219   True        False         False      5h36m
storage                                    4.8.0-0.nightly-2021-06-29-033219   True        True          False      137m

# oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-165f673837a072be255e8838537b90c6   True      False      False      3              3                   3                     0                      5h37m
worker   rendered-worker-6e0b285ba602438a84b3750e65ea3697   False     True       False      5              3                   3                     0                      5h37m


# oc describe co machine-config
Name:         machine-config
Namespace:    
Labels:       <none>
Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
API Version:  config.openshift.io/v1
Kind:         ClusterOperator
Metadata:
  Creation Timestamp:  2021-06-30T02:56:33Z
  Generation:          1
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:exclude.release.openshift.io/internal-openshift-hosted:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
      f:spec:
      f:status:
        .:
        f:relatedObjects:
        f:versions:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-06-30T02:56:33Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:extension:
          .:
          f:master:
          f:worker:
        f:relatedObjects:
        f:versions:
    Manager:         machine-config-operator
    Operation:       Update
    Time:            2021-06-30T06:31:07Z
  Resource Version:  178512
  UID:               29895b49-21dd-4dce-8622-1d8c77fefe55
Spec:
Status:
  Conditions:
    Last Transition Time:  2021-06-30T06:12:03Z
    Message:               Working towards 4.8.0-0.nightly-2021-06-29-033219
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2021-06-30T06:16:56Z
    Message:               One or more machine config pools are updating, please see `oc get mcp` for further details
    Reason:                PoolUpdating
    Status:                False
    Type:                  Upgradeable
    Last Transition Time:  2021-06-30T06:43:35Z
    Message:               Unable to apply 4.8.0-0.nightly-2021-06-29-033219: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1)
    Reason:                MachineConfigDaemonFailed
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2021-06-30T06:26:55Z
    Message:               Cluster not available for 4.8.0-0.nightly-2021-06-29-033219
    Status:                False
    Type:                  Available
  Extension:
    Master:  all 3 nodes are at latest configuration rendered-master-165f673837a072be255e8838537b90c6
    Worker:  3 (ready 3) out of 5 nodes are updating to latest configuration rendered-worker-63fe24505cf686e519c81db70ec61ca0
  Related Objects:
    Group:     
    Name:      openshift-machine-config-operator
    Resource:  namespaces
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigpools
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  controllerconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  kubeletconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  containerruntimeconfigs
    Group:     machineconfiguration.openshift.io
    Name:      
    Resource:  machineconfigs
    Group:     
    Name:      
    Resource:  nodes
    Group:     
    Name:      openshift-kni-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-openstack-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-ovirt-infra
    Resource:  namespaces
    Group:     
    Name:      openshift-vsphere-infra
    Resource:  namespaces
  Versions:
    Name:     operator
    Version:  4.7.0-0.nightly-2021-06-26-014854
Events:       <none>


# oc describe node ip-10-0-59-156.us-east-2.compute.internal 
Name:               ip-10-0-59-156.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-59-156.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.xlarge
                    node.openshift.io/os_id=rhel
                    topology.ebs.csi.aws.com/zone=us-east-2a
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f79156d446d6eaaf"}
                    k8s.ovn.org/host-addresses: ["10.0.59.156"]
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-156.us-east-2.compute.internal","mac-address":"02:62:91:75:7d:44","ip-address...
                    k8s.ovn.org/node-chassis-id: afae5859-452c-46c0-805c-50531fff6957
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.1.48"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: e6:22:87:9d:74:ca
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.156/20"}
                    k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"}
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-6e0b285ba602438a84b3750e65ea3697
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-63fe24505cf686e519c81db70ec61ca0
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Working
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 30 Jun 2021 00:28:49 -0400
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
                    node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  ip-10-0-59-156.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 30 Jun 2021 02:32:06 -0400
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 30 Jun 2021 02:28:16 -0400   Wed, 30 Jun 2021 02:32:51 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 30 Jun 2021 02:28:16 -0400   Wed, 30 Jun 2021 02:32:51 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 30 Jun 2021 02:28:16 -0400   Wed, 30 Jun 2021 02:32:51 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 30 Jun 2021 02:28:16 -0400   Wed, 30 Jun 2021 02:32:51 -0400   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.0.59.156
  Hostname:     ip-10-0-59-156.us-east-2.compute.internal
  InternalDNS:  ip-10-0-59-156.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           31444972Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16264956Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3500m
  ephemeral-storage:           27905944324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15113980Ki
  pods:                        250
System Info:
  Machine ID:                             d14ea99a3d754419bdfdfa01ab77bf07
  System UUID:                            EC20084D-9DCB-C128-04CE-35C2042522E3
  Boot ID:                                deb7c00f-4ed0-4973-926e-a7c62936964c
  Kernel Version:                         3.10.0-1160.31.1.el7.x86_64
  OS Image:                               Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.20.3-6.rhaos4.7.git0d0f863.el7
  Kubelet Version:                        v1.20.0+87cc9a4
  Kube-Proxy Version:                     v1.20.0+87cc9a4
ProviderID:                               aws:///us-east-2a/i-0f79156d446d6eaaf
Non-terminated Pods:                      (13 in total)
  Namespace                               Name                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                                   ------------  ----------  ---------------  -------------  ---
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-6kgh4          30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         3h8m
  openshift-cluster-node-tuning-operator  tuned-txllf                            10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         3h9m
  openshift-dns                           dns-default-724lv                      60m (1%)      0 (0%)      110Mi (0%)       0 (0%)         165m
  openshift-dns                           node-resolver-ldj8r                    5m (0%)       0 (0%)      21Mi (0%)        0 (0%)         167m
  openshift-image-registry                node-ca-qjn99                          10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         3h10m
  openshift-ingress-canary                ingress-canary-t7vmj                   10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         3h9m
  openshift-machine-config-operator       machine-config-daemon-fs2st            40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         161m
  openshift-monitoring                    node-exporter-fq6qw                    9m (0%)       0 (0%)      47Mi (0%)        0 (0%)         3h10m
  openshift-multus                        multus-additional-cni-plugins-gx6th    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         3h4m
  openshift-multus                        multus-jb6cx                           10m (0%)      0 (0%)      65Mi (0%)        0 (0%)         179m
  openshift-multus                        network-metrics-daemon-kk5vg           20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         3h4m
  openshift-network-diagnostics           network-check-target-7lq5z             10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         3h3m
  openshift-ovn-kubernetes                ovnkube-node-fj8hz                     40m (1%)      0 (0%)      640Mi (4%)       0 (0%)         3h2m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         264m (7%)    0 (0%)
  memory                      1358Mi (9%)  0 (0%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:
  Type    Reason              Age   From     Message
  ----    ------              ----  ----     -------
  Normal  NodeNotSchedulable  143m  kubelet  Node ip-10-0-59-156.us-east-2.compute.internal status is now: NodeNotSchedulable

The kubeconfig file to access the cluster can be found in [1] if you'd like.

[1] https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/27508/artifact/workdir/install-dir/auth/kubeconfig

Comment 12 Yu Qi Zhang 2021-07-03 01:11:03 UTC

The cluster inspect you gave (unfortunately the cluster appears to have been torn down) shows that one of the workers never successfully rejoined the cluster. Unfortunately this seems to be yet another different error case. We may need to ssh into the system next time if we see a "NotReady, SchedulingDisabled" since the API is unable to access the node, which may point to kubelet failures.

There is another proxy + RHEL7 bug ongoing: https://bugzilla.redhat.com/show_bug.cgi?id=1975907. Maybe we can wait until both other RHEL bugs are settled, before we look into this again? Sorry for the repeated trouble, but given the instability of RHEL7 right now it's hard to gauge what is going on

Comment 14 Jian Zhang 2021-07-06 06:41:23 UTC

Since there are no logs for the RHEL upgrade job, I'm not sure if this bug is a duplicate of bug 1975907，but based on the must-gather logs, seems like they are not. Based on comments 2, 6, transfer it to the OVN team to have a look.
Based on comment 13, I removed the `UpgradeBlocker` Keywords.

Comment 15 Ke Wang 2021-07-07 06:42:51 UTC

Hit the same bug, when upgraded from 4.7.13-x86_64 to 4.7.19-x86_64 with profile 50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.13    True        True          15h     Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out

$ oc get no
NAME                                STATUS                        ROLES    AGE   VERSION
kewang06173513-zdhhx-master-0       Ready                         master   16h   v1.20.0+87cc9a4
kewang06173513-zdhhx-master-1       Ready                         master   16h   v1.20.0+87cc9a4
kewang06173513-zdhhx-master-2       Ready                         master   16h   v1.20.0+87cc9a4
kewang06173513-zdhhx-rhel-0         NotReady,SchedulingDisabled   worker   15h   v1.20.0+87cc9a4
kewang06173513-zdhhx-rhel-1         Ready                         worker   15h   v1.20.0+87cc9a4
kewang06173513-zdhhx-worker-5m4xf   Ready                         worker   16h   v1.20.0+df9c838
kewang06173513-zdhhx-worker-hhqgw   Ready                         worker   16h   v1.20.0+df9c838

$ oc get co --no-headers | grep -v '.True.*False.*False'
machine-config                             4.7.13   False   True    True    14h
monitoring                                 4.7.19   False   True    True    13h
network                                    4.7.19   True    True    True    16h
    
Upgrade Job console output from Jenkins:
...
07-06 14:13:56.229  version   4.7.13    True        True          125m    Working towards 4.7.19: 7 of 669 done (1% complete)
07-06 14:13:56.229  
07-06 14:13:56.229  
07-06 14:13:56.229  [Debug] upgrade_ret_1=1,upgrade_ret_2=1
07-06 14:13:56.229  
07-06 14:14:54.160  Unable to connect to the server: Service Unavailable
07-06 14:14:54.160  
07-06 14:14:54.160  
07-06 14:14:54.160  
07-06 14:14:54.160  Unable to connect to the server: Service Unavailable
07-06 14:14:54.160  [Debug] upgrade_ret_1=1,upgrade_ret_2=1
...

Checked operator mco which is the first reported the error,
..    
07-06 15:13:38.873  Status:
07-06 15:13:38.873    Conditions:
07-06 15:13:38.873      Last Transition Time:  2021-07-06T13:08:14Z
07-06 15:13:38.873      Message:               Working towards 4.7.19
07-06 15:13:38.873      Status:                True
07-06 15:13:38.873      Type:                  Progressing
07-06 15:13:38.873      Last Transition Time:  2021-07-06T14:12:48Z
07-06 15:13:38.873      Message:               Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
07-06 15:13:38.873      Reason:                MachineConfigDaemonFailed

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a2e8bef20b3410f5d7765df456714e66   True      False      False      3              3                   3                     0                      19h
worker   rendered-worker-2990fdac5c6746d7e6bb90f86292fbad   False     True       False      4              0                   0                     0                      19h

Comment 21 hongyan li 2021-07-12 06:54:27 UTC

Hit the same bug, when triage upgrade CI 
original_build=4.7.13-x86_64  target_build=4.7.19-x86_64
profile_name=50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on

[2021-07-09T14:28:57.626Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1
[2021-07-09T14:28:57.626Z] 
[2021-07-09T14:29:56.570Z] NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
[2021-07-09T14:29:56.570Z] version   4.7.13    True        True          3h2m    Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out
[2021-07-09T14:29:56.570Z] 
[2021-07-09T14:29:56.570Z] 
[2021-07-09T14:29:57.156Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1
[2021-07-09T14:29:57.156Z] 
[2021-07-09T14:31:05.151Z] **************Post Action after upgrade fail****************
[2021-07-09T14:31:05.151Z] 
[2021-07-09T14:31:05.151Z] Post action: #oc get node: NAME                                STATUS                        ROLES    AGE     VERSION           INTERNAL-IP      EXTERNAL-IP      OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
[2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-0       Ready                         master   4h20m   v1.20.0+87cc9a4   172.31.249.87    172.31.249.87    Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-1       Ready                         master   4h20m   v1.20.0+87cc9a4   172.31.249.110   172.31.249.110   Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-master-2       Ready                         master   4h20m   v1.20.0+87cc9a4   172.31.249.7     172.31.249.7     Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-0         Ready                         worker   3h6m    v1.20.0+bd7b30d   172.31.249.247   172.31.249.247   Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64    cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-1         NotReady,SchedulingDisabled   worker   3h6m    v1.20.0+bd7b30d   172.31.249.227   172.31.249.227   Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.31.1.el7.x86_64    cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-ptksq   Ready                         worker   4h8m    v1.20.0+df9c838   172.31.249.215   172.31.249.215   Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-qfkr9   Ready                         worker   4h8m    v1.20.0+df9c838   172.31.249.231   172.31.249.231   Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64   cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8
[2021-07-09T14:31:05.152Z] 
[2021-07-09T14:31:05.152Z] 
[2021-07-09T14:31:05.152Z] Post action: #oc get co:NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
[2021-07-09T14:31:05.152Z] authentication                             4.7.19    True        False         False      66m
[2021-07-09T14:31:05.152Z] baremetal                                  4.7.19    True        False         False      4h17m
[2021-07-09T14:31:05.152Z] cloud-credential                           4.7.19    True        False         False      4h20m
[2021-07-09T14:31:05.152Z] cluster-autoscaler                         4.7.19    True        False         False      4h15m
[2021-07-09T14:31:05.152Z] config-operator                            4.7.19    True        False         False      4h17m
[2021-07-09T14:31:05.152Z] console                                    4.7.19    True        False         False      73m
[2021-07-09T14:31:05.152Z] csi-snapshot-controller                    4.7.19    True        False         False      65m
[2021-07-09T14:31:05.152Z] dns                                        4.7.19    True        False         False      4h15m
[2021-07-09T14:31:05.152Z] etcd                                       4.7.19    True        False         False      4h15m
[2021-07-09T14:31:05.152Z] image-registry                             4.7.19    True        False         False      3h20m
[2021-07-09T14:31:05.152Z] ingress                                    4.7.19    True        False         False      4h6m
[2021-07-09T14:31:05.152Z] insights                                   4.7.19    True        False         False      4h10m
[2021-07-09T14:31:05.152Z] kube-apiserver                             4.7.19    True        False         False      4h13m
[2021-07-09T14:31:05.152Z] kube-controller-manager                    4.7.19    True        False         False      4h14m
[2021-07-09T14:31:05.152Z] kube-scheduler                             4.7.19    True        False         False      4h14m
[2021-07-09T14:31:05.152Z] kube-storage-version-migrator              4.7.19    True        False         False      68m
[2021-07-09T14:31:05.152Z] machine-api                                4.7.19    True        False         False      4h13m
[2021-07-09T14:31:05.152Z] machine-approver                           4.7.19    True        False         False      4h16m
[2021-07-09T14:31:05.152Z] machine-config                             4.7.13    False       True          True       140m
[2021-07-09T14:31:05.152Z] marketplace                                4.7.19    True        False         False      77m
[2021-07-09T14:31:05.152Z] monitoring                                 4.7.19    False       True          True       74m
[2021-07-09T14:31:05.152Z] network                                    4.7.19    True        True          True       4h16m
[2021-07-09T14:31:05.152Z] node-tuning                                4.7.19    True        False         False      153m
[2021-07-09T14:31:05.152Z] openshift-apiserver                        4.7.19    True        False         False      66m
[2021-07-09T14:31:05.152Z] openshift-controller-manager               4.7.19    True        False         False      151m
[2021-07-09T14:31:05.152Z] openshift-samples                          4.7.19    True        False         False      153m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager                 4.7.19    True        False         False      4h16m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager-catalog         4.7.19    True        False         False      4h16m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager-packageserver   4.7.19    True        False         False      68m
[2021-07-09T14:31:05.152Z] service-ca                                 4.7.19    True        False         False      4h17m
[2021-07-09T14:31:05.152Z] storage                                    4.7.19    True        False         False      73m


[2021-07-09T14:31:05.152Z] print detail msg for node(SchedulingDisabled) if exist:
[2021-07-09T14:31:05.152Z] 
[2021-07-09T14:32:13.333Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:13.333Z] 
[2021-07-09T14:32:13.333Z] 
[2021-07-09T14:32:13.333Z] Name:               rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] Roles:              worker
[2021-07-09T14:32:13.333Z] Labels:             beta.kubernetes.io/arch=amd64
[2021-07-09T14:32:13.333Z]                     beta.kubernetes.io/os=linux
[2021-07-09T14:32:13.333Z]                     kubernetes.io/arch=amd64
[2021-07-09T14:32:13.333Z]                     kubernetes.io/hostname=rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z]                     kubernetes.io/os=linux
[2021-07-09T14:32:13.333Z]                     node-role.kubernetes.io/worker=
[2021-07-09T14:32:13.333Z]                     node.openshift.io/os_id=rhel
[2021-07-09T14:32:13.333Z] Annotations:        k8s.ovn.org/l3-gateway-config:
[2021-07-09T14:32:13.333Z]                       {"default":{"mode":"shared","interface-id":"br-ex_rioliu09175612-kgg8w-rhel-1","mac-address":"00:50:56:ac:9a:f7","ip-addresses":["172.31.2...
[2021-07-09T14:32:13.333Z]                     k8s.ovn.org/node-chassis-id: 8bc13dc6-e668-49b1-b12b-9ba16655aac1
[2021-07-09T14:32:13.333Z]                     k8s.ovn.org/node-local-nat-ip: {"default":["169.254.15.54"]}
[2021-07-09T14:32:13.333Z]                     k8s.ovn.org/node-mgmt-port-mac-address: 92:30:54:8d:79:42
[2021-07-09T14:32:13.333Z]                     k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.31.249.227/23"}
[2021-07-09T14:32:13.333Z]                     k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"}
[2021-07-09T14:32:13.333Z]                     machineconfiguration.openshift.io/currentConfig: rendered-worker-a557d8bd4e42e3b87302ed6fef0c7626
[2021-07-09T14:32:13.333Z]                     machineconfiguration.openshift.io/desiredConfig: rendered-worker-04fc66aec510739dc2cb7bc4696f80c0
[2021-07-09T14:32:13.333Z]                     machineconfiguration.openshift.io/ssh: accessed
[2021-07-09T14:32:13.333Z]                     machineconfiguration.openshift.io/state: Working
[2021-07-09T14:32:13.333Z]                     volumes.kubernetes.io/controller-managed-attach-detach: true
[2021-07-09T14:32:13.333Z] CreationTimestamp:  Fri, 09 Jul 2021 11:24:43 +0000
[2021-07-09T14:32:13.333Z] Taints:             node.kubernetes.io/unreachable:NoExecute
[2021-07-09T14:32:13.333Z]                     node.kubernetes.io/unreachable:NoSchedule
[2021-07-09T14:32:13.333Z]                     node.kubernetes.io/unschedulable:NoSchedule
[2021-07-09T14:32:13.333Z] Unschedulable:      true
[2021-07-09T14:32:13.333Z] Lease:
[2021-07-09T14:32:13.333Z]   HolderIdentity:  rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z]   AcquireTime:     <unset>
[2021-07-09T14:32:13.333Z]   RenewTime:       Fri, 09 Jul 2021 13:08:04 +0000
[2021-07-09T14:32:13.333Z] Conditions:
[2021-07-09T14:32:13.333Z]   Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
[2021-07-09T14:32:13.333Z]   ----             ------    -----------------                 ------------------                ------              -------
[2021-07-09T14:32:13.333Z]   MemoryPressure   Unknown   Fri, 09 Jul 2021 13:08:03 +0000   Fri, 09 Jul 2021 13:08:44 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z]   DiskPressure     Unknown   Fri, 09 Jul 2021 13:08:03 +0000   Fri, 09 Jul 2021 13:08:44 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z]   PIDPressure      Unknown   Fri, 09 Jul 2021 13:08:03 +0000   Fri, 09 Jul 2021 13:08:44 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z]   Ready            Unknown   Fri, 09 Jul 2021 13:08:03 +0000   Fri, 09 Jul 2021 13:08:44 +0000   NodeStatusUnknown   Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z] Addresses:
[2021-07-09T14:32:13.333Z]   ExternalIP:  172.31.249.227
[2021-07-09T14:32:13.333Z]   InternalIP:  172.31.249.227
[2021-07-09T14:32:13.333Z]   Hostname:    rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] Capacity:
[2021-07-09T14:32:13.333Z]   cpu:                8
[2021-07-09T14:32:13.333Z]   ephemeral-storage:  19066Mi
[2021-07-09T14:32:13.333Z]   hugepages-1Gi:      0
[2021-07-09T14:32:13.333Z]   hugepages-2Mi:      0
[2021-07-09T14:32:13.333Z]   memory:             16265896Ki
[2021-07-09T14:32:13.333Z]   pods:               250
[2021-07-09T14:32:13.333Z] Allocatable:
[2021-07-09T14:32:13.333Z]   cpu:                7500m
[2021-07-09T14:32:13.333Z]   ephemeral-storage:  16919193161
[2021-07-09T14:32:13.333Z]   hugepages-1Gi:      0
[2021-07-09T14:32:13.333Z]   hugepages-2Mi:      0
[2021-07-09T14:32:13.333Z]   memory:             15114920Ki
[2021-07-09T14:32:13.333Z]   pods:               250
[2021-07-09T14:32:13.333Z] System Info:
[2021-07-09T14:32:13.333Z]   Machine ID:                             68b4570c8c2f40dfa8c34a353d7aa935
[2021-07-09T14:32:13.333Z]   System UUID:                            97AA2C42-B450-4CA9-CC9E-4675C05C9D09
[2021-07-09T14:32:13.333Z]   Boot ID:                                46a37500-282c-4f56-a934-841dd0bfc02c
[2021-07-09T14:32:13.333Z]   Kernel Version:                         3.10.0-1160.31.1.el7.x86_64
[2021-07-09T14:32:13.333Z]   OS Image:                               Red Hat Enterprise Linux Server 7.9 (Maipo)
[2021-07-09T14:32:13.333Z]   Operating System:                       linux
[2021-07-09T14:32:13.333Z]   Architecture:                           amd64
[2021-07-09T14:32:13.333Z]   Container Runtime Version:              cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:32:13.333Z]   Kubelet Version:                        v1.20.0+bd7b30d
[2021-07-09T14:32:13.333Z]   Kube-Proxy Version:                     v1.20.0+bd7b30d
[2021-07-09T14:32:13.333Z] ProviderID:                               vsphere://422caa97-50b4-a94c-cc9e-4675c05c9d09
[2021-07-09T14:32:13.333Z] Non-terminated Pods:                      (13 in total)
[2021-07-09T14:32:13.333Z]   Namespace                               Name                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
[2021-07-09T14:32:13.333Z]   ---------                               ----                                          ------------  ----------  ---------------  -------------  ---
[2021-07-09T14:32:13.333Z]   openshift-cluster-node-tuning-operator  tuned-6bhjp                                   10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         154m
[2021-07-09T14:32:13.333Z]   openshift-dns                           dns-default-bwzjf                             65m (0%)      0 (0%)      131Mi (0%)       0 (0%)         141m
[2021-07-09T14:32:13.333Z]   openshift-image-registry                node-ca-lh9br                                 10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         153m
[2021-07-09T14:32:13.333Z]   openshift-ingress-canary                ingress-canary-tzfts                          10m (0%)      0 (0%)      20Mi (0%)        0 (0%)         154m
[2021-07-09T14:32:13.334Z]   openshift-machine-config-operator       machine-config-daemon-gb94s                   40m (0%)      0 (0%)      100Mi (0%)       0 (0%)         119m
[2021-07-09T14:32:13.334Z]   openshift-monitoring                    node-exporter-hllwn                           9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         154m
[2021-07-09T14:32:13.334Z]   openshift-multus                        multus-crksd                                  10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         145m
[2021-07-09T14:32:13.334Z]   openshift-multus                        network-metrics-daemon-26prg                  20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         151m
[2021-07-09T14:32:13.334Z]   openshift-network-diagnostics           network-check-target-z77nd                    10m (0%)      0 (0%)      15Mi (0%)        0 (0%)         148m
[2021-07-09T14:32:13.334Z]   openshift-ovn-kubernetes                ovnkube-node-n6hf5                            30m (0%)      0 (0%)      620Mi (4%)       0 (0%)         148m
[2021-07-09T14:32:13.334Z]   openshift-vsphere-infra                 coredns-rioliu09175612-kgg8w-rhel-1           200m (2%)     0 (0%)      400Mi (2%)       0 (0%)         84m
[2021-07-09T14:32:13.334Z]   openshift-vsphere-infra                 keepalived-rioliu09175612-kgg8w-rhel-1        200m (2%)     0 (0%)      400Mi (2%)       0 (0%)         84m
[2021-07-09T14:32:13.334Z]   openshift-vsphere-infra                 mdns-publisher-rioliu09175612-kgg8w-rhel-1    100m (1%)     0 (0%)      200Mi (1%)       0 (0%)         84m
[2021-07-09T14:32:13.334Z] Allocated resources:
[2021-07-09T14:32:13.334Z]   (Total limits may be over 100 percent, i.e., overcommitted.)
[2021-07-09T14:32:13.334Z]   Resource           Requests      Limits
[2021-07-09T14:32:13.334Z]   --------           --------      ------
[2021-07-09T14:32:13.334Z]   cpu                714m (9%)     0 (0%)
[2021-07-09T14:32:13.334Z]   memory             2426Mi (16%)  0 (0%)
[2021-07-09T14:32:13.334Z]   ephemeral-storage  0 (0%)        0 (0%)
[2021-07-09T14:32:13.334Z]   hugepages-1Gi      0 (0%)        0 (0%)
[2021-07-09T14:32:13.334Z]   hugepages-2Mi      0 (0%)        0 (0%)
[2021-07-09T14:32:13.334Z] Events:
[2021-07-09T14:32:13.334Z]   Type    Reason              Age   From     Message
[2021-07-09T14:32:13.334Z]   ----    ------              ----  ----     -------
[2021-07-09T14:32:13.334Z]   Normal  NodeNotSchedulable  85m   kubelet  Node rioliu09175612-kgg8w-rhel-1 status is now: NodeNotSchedulable
[2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:14.331Z] 
[2021-07-09T14:32:14.331Z] 
[2021-07-09T14:32:14.331Z] print detail msg for co(AVAILABLE != True or PROGRESSING!=False or DEGRADED!=False or version != target_version) if exist:
[2021-07-09T14:32:14.331Z] 
[2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal co details==~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:14.331Z] 
[2021-07-09T14:32:14.331Z] 
[2021-07-09T14:32:14.975Z] ~~~~~~~~
[2021-07-09T14:32:14.975Z]  machine-config
[2021-07-09T14:32:14.975Z] monitoring
[2021-07-09T14:32:14.975Z] network 
[2021-07-09T14:32:14.975Z] ~~~~~~~~
[2021-07-09T14:32:14.975Z] 
[2021-07-09T14:32:15.931Z] #### Quick diagnosis: The first abnormal cluster operator is often the culprit! ####
[2021-07-09T14:32:15.931Z] => Below status and logs for the different conditions of all abnormal cos are sorted by 'lastTransitionTime':
[2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] ManagementStateDegraded False - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] Upgradeable True - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:14:08Z [network] Available True - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:15:03Z [machine-config] Upgradeable True AsExpected -
[2021-07-09T14:32:15.931Z] 2021-07-09T12:10:10Z [machine-config] -->>Available False<<-- - Cluster -->>not available<<-- for 4.7.19
[2021-07-09T14:32:15.931Z] 2021-07-09T12:12:09Z [machine-config] -->>Progressing True<<-- - Working towards 4.7.19
[2021-07-09T14:32:15.931Z] 2021-07-09T13:08:44Z [network] -->>Progressing True<<-- Deploying DaemonSet "openshift-multus/multus" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] 2021-07-09T13:16:21Z [monitoring] -->>Available False<<-- - -
[2021-07-09T14:32:15.931Z] 2021-07-09T13:16:36Z [machine-config] -->>Degraded True<<-- MachineConfigDaemonFailed Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
[2021-07-09T14:32:15.931Z] 2021-07-09T13:29:43Z [monitoring] -->>Degraded True<<-- UpdatingnodeExporterFailed Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes
[2021-07-09T14:32:15.931Z] 2021-07-09T13:35:15Z [network] -->>Degraded True<<-- RolloutHung DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-07-09T13:24:42Z
[2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] -->>Progressing True<<-- RollOutInProgress Rolling out the stack.
[2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] Upgradeable True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes.
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-multus/network-metrics-daemon" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-07-09T13:25:32Z
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-07-09T13:24:43Z
[2021-07-09T14:32:15.931Z] 
[2021-07-09T14:32:15.931Z] --------------------------
[2021-07-09T14:32:15.931Z] 
[2021-07-09T14:32:17.875Z] Name:         machine-config
[2021-07-09T14:32:17.875Z] Namespace:    
[2021-07-09T14:32:17.875Z] Labels:       <none>
[2021-07-09T14:32:17.875Z] Annotations:  exclude.release.openshift.io/internal-openshift-hosted: true
[2021-07-09T14:32:17.875Z]               include.release.openshift.io/self-managed-high-availability: true
[2021-07-09T14:32:17.875Z]               include.release.openshift.io/single-node-developer: true
[2021-07-09T14:32:17.875Z] API Version:  config.openshift.io/v1
[2021-07-09T14:32:17.875Z] Kind:         ClusterOperator
[2021-07-09T14:32:17.875Z] Metadata:
[2021-07-09T14:32:17.875Z]   Creation Timestamp:  2021-07-09T10:08:07Z
[2021-07-09T14:32:17.876Z]   Generation:          1
[2021-07-09T14:32:17.876Z]   Managed Fields:
[2021-07-09T14:32:17.876Z]     API Version:  config.openshift.io/v1
[2021-07-09T14:32:17.876Z]     Fields Type:  FieldsV1
[2021-07-09T14:32:17.876Z]     fieldsV1:
[2021-07-09T14:32:17.876Z]       f:metadata:
[2021-07-09T14:32:17.876Z]         f:annotations:
[2021-07-09T14:32:17.876Z]           .:
[2021-07-09T14:32:17.876Z]           f:exclude.release.openshift.io/internal-openshift-hosted:
[2021-07-09T14:32:17.876Z]           f:include.release.openshift.io/self-managed-high-availability:
[2021-07-09T14:32:17.876Z]           f:include.release.openshift.io/single-node-developer:
[2021-07-09T14:32:17.876Z]       f:spec:
[2021-07-09T14:32:17.876Z]       f:status:
[2021-07-09T14:32:17.876Z]         .:
[2021-07-09T14:32:17.876Z]         f:relatedObjects:
[2021-07-09T14:32:17.876Z]         f:versions:
[2021-07-09T14:32:17.876Z]     Manager:      cluster-version-operator
[2021-07-09T14:32:17.876Z]     Operation:    Update
[2021-07-09T14:32:17.876Z]     Time:         2021-07-09T10:08:07Z
[2021-07-09T14:32:17.876Z]     API Version:  config.openshift.io/v1
[2021-07-09T14:32:17.876Z]     Fields Type:  FieldsV1
[2021-07-09T14:32:17.876Z]     fieldsV1:
[2021-07-09T14:32:17.876Z]       f:status:
[2021-07-09T14:32:17.876Z]         f:conditions:
[2021-07-09T14:32:17.876Z]         f:extension:
[2021-07-09T14:32:17.876Z]           .:
[2021-07-09T14:32:17.876Z]           f:master:
[2021-07-09T14:32:17.876Z]           f:worker:
[2021-07-09T14:32:17.876Z]         f:relatedObjects:
[2021-07-09T14:32:17.876Z]         f:versions:
[2021-07-09T14:32:17.876Z]     Manager:         machine-config-operator
[2021-07-09T14:32:17.876Z]     Operation:       Update
[2021-07-09T14:32:17.876Z]     Time:            2021-07-09T13:16:36Z
[2021-07-09T14:32:17.876Z]   Resource Version:  124620
[2021-07-09T14:32:17.876Z]   Self Link:         /apis/config.openshift.io/v1/clusteroperators/machine-config
[2021-07-09T14:32:17.876Z]   UID:               9b848845-2913-4f1b-8348-0672b17c87c9
[2021-07-09T14:32:17.876Z] Spec:
[2021-07-09T14:32:17.876Z] Status:
[2021-07-09T14:32:17.876Z]   Conditions:
[2021-07-09T14:32:17.876Z]     Last Transition Time:  2021-07-09T12:12:09Z
[2021-07-09T14:32:17.876Z]     Message:               Working towards 4.7.19
[2021-07-09T14:32:17.876Z]     Status:                True
[2021-07-09T14:32:17.876Z]     Type:                  Progressing
[2021-07-09T14:32:17.876Z]     Last Transition Time:  2021-07-09T13:16:36Z
[2021-07-09T14:32:17.876Z]     Message:               Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
[2021-07-09T14:32:17.876Z]     Reason:                MachineConfigDaemonFailed
[2021-07-09T14:32:17.876Z]     Status:                True
[2021-07-09T14:32:17.876Z]     Type:                  Degraded
[2021-07-09T14:32:17.876Z]     Last Transition Time:  2021-07-09T12:10:10Z
[2021-07-09T14:32:17.876Z]     Message:               Cluster not available for 4.7.19
[2021-07-09T14:32:17.876Z]     Status:                False
[2021-07-09T14:32:17.876Z]     Type:                  Available
[2021-07-09T14:32:17.876Z]     Last Transition Time:  2021-07-09T10:15:03Z
[2021-07-09T14:32:17.876Z]     Reason:                AsExpected
[2021-07-09T14:32:17.876Z]     Status:                True
[2021-07-09T14:32:17.876Z]     Type:                  Upgradeable
[2021-07-09T14:32:17.876Z]   Extension:
[2021-07-09T14:32:17.876Z]     Master:  all 3 nodes are at latest configuration rendered-master-e327977b6a5b999cef4cbfbb711a6e90
[2021-07-09T14:32:17.876Z]     Worker:  0 (ready 0) out of 4 nodes are updating to latest configuration rendered-worker-04fc66aec510739dc2cb7bc4696f80c0
[2021-07-09T14:32:17.876Z]   Related Objects:
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      openshift-machine-config-operator
[2021-07-09T14:32:17.876Z]     Resource:  namespaces
[2021-07-09T14:32:17.876Z]     Group:     machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  machineconfigpools
[2021-07-09T14:32:17.876Z]     Group:     machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  controllerconfigs
[2021-07-09T14:32:17.876Z]     Group:     machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  kubeletconfigs
[2021-07-09T14:32:17.876Z]     Group:     machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  containerruntimeconfigs
[2021-07-09T14:32:17.876Z]     Group:     machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  machineconfigs
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      
[2021-07-09T14:32:17.876Z]     Resource:  nodes
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      openshift-kni-infra
[2021-07-09T14:32:17.876Z]     Resource:  namespaces
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      openshift-openstack-infra
[2021-07-09T14:32:17.876Z]     Resource:  namespaces
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      openshift-ovirt-infra
[2021-07-09T14:32:17.876Z]     Resource:  namespaces
[2021-07-09T14:32:17.876Z]     Group:     
[2021-07-09T14:32:17.876Z]     Name:      openshift-vsphere-infra
[2021-07-09T14:32:17.876Z]     Resource:  namespaces
[2021-07-09T14:32:17.876Z]   Versions:
[2021-07-09T14:32:17.876Z]     Name:     operator
[2021-07-09T14:32:17.876Z]     Version:  4.7.13
[2021-07-09T14:32:17.876Z] Events:       <none>
[2021-07-09T14:32:17.876Z]

Comment 22 hongyan li 2021-07-12 06:58:58 UTC

http://10.73.131.57:9000/minio/openshift-must-gather/2021-07-09-15-09-02/must-gather.local.2206304322535983685.tar.gz

Comment 26 Ke Wang 2021-08-30 09:56:31 UTC

Still hit the same bug when upgraded 4.7.13-x86_64--> 4.7.0-0.nightly-2021-08-27-190811

$ oc get co --no-headers | grep -v '.True.*False.*False'
machine-config                             4.7.13                              False   True    True    3h18m
monitoring                                 4.7.0-0.nightly-2021-08-27-190811   False   False   True    164m
network                                    4.7.0-0.nightly-2021-08-27-190811   True    True    True    6h2m

$ oc get no
NAME                                STATUS                        ROLES    AGE     VERSION
kewang30113609-r5nwz-master-0       Ready                         master   6h6m    v1.20.0+4593a24
kewang30113609-r5nwz-master-1       Ready                         master   6h6m    v1.20.0+4593a24
kewang30113609-r5nwz-master-2       Ready                         master   6h6m    v1.20.0+4593a24
kewang30113609-r5nwz-rhel-0         NotReady,SchedulingDisabled   worker   4h52m   v1.20.0+9689d22
kewang30113609-r5nwz-rhel-1         Ready                         worker   4h52m   v1.20.0+9689d22
kewang30113609-r5nwz-worker-5564x   Ready                         worker   5h54m   v1.20.0+df9c838
kewang30113609-r5nwz-worker-s5bqb   Ready                         worker   5h54m   v1.20.0+df9c838

...   
Status:
    Conditions:
      Last Transition Time:  2021-08-30T06:34:33Z
      Message:               Working towards 4.7.0-0.nightly-2021-08-27-190811
      Status:                True              
      Type:                  Progressing       
      Last Transition Time:  2021-08-30T07:26:36Z
      Message:               Unable to apply 4.7.0-0.nightly-2021-08-27-190811: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
      Reason:                MachineConfigDaemonFailed
      Status:                True              
      Type:                  Degraded          
      Last Transition Time:  2021-08-30T06:32:33Z
      Message:               Cluster not available for 4.7.0-0.nightly-2021-08-27-190811
      Status:                False             
      Type:                  Available         
      Last Transition Time:  2021-08-30T03:49:17Z
      Reason:                AsExpected        
      Status:                True              
      Type:                  Upgradeable    
...

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-cb85aa357520053a2b25d779d251aaee   True      False      False      3              3                   3                     0                      6h4m
worker   rendered-worker-aac5d9e7df7359f63e356a255aec21ef   False     True       False      4              0                   0                     0                      6h4m

Comment 32 Ke Wang 2021-10-14 09:02:37 UTC

Upgrade with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS, got the passed result.

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.15    True        False         3h20m   Cluster version is 4.8.15

$ oc get clusterversion -o json|jq ".items[0].status.history" 
[
  {
    "completionTime": "2021-10-14T05:37:21Z",
    "image": "quay.io/openshift-release-dev/ocp-release:4.8.15-x86_64",
    "startedTime": "2021-10-14T03:35:00Z",
    "state": "Completed",
    "verified": false,
    "version": "4.8.15"
  },
  {
    "completionTime": "2021-10-14T01:39:25Z",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:e121118406de30f9a92168808979d9363c1a576401c399bf6f528fb47c48b16c",
    "startedTime": "2021-10-14T01:04:54Z",
    "state": "Completed",
    "verified": false,
    "version": "4.7.34"
  }
]

The last successful upgrade tests:
4.7.0-0.nightly-2021-08-25-185239 -> 4.8.0-0.nightly-2021-08-25-185034 with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS

astoycos, I think this bug was solved from above resuls.

Comment 33 Andrew Stoycos 2021-10-14 14:14:23 UTC

@kewang Thanks so much for taking the time to reproduce :), If you see this arise again please feel free to reopen this bug. 

Thanks, 
Andrew

Note You need to log in before you can comment on or make changes to this bug.