Description of problem: Cluster operator machine-config has not yet successfully rolled out during upgrade from OCP 4.7.0-0.nightly-2021-06-20-093308 to 4.8.0-0.nightly-2021-06-21-175537 This is seen in our QE Upgrade CI profile: 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS where we scaleup and add two RHEL 7.9 worker nodes. One RHEL 7.9 worker node stuck in Ready,SchedulingDisabled state [2021-06-22T06:45:19.608Z] Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME [2021-06-22T06:45:19.608Z] ip-10-0-51-35.us-east-2.compute.internal Ready master 4h27m v1.21.0-rc.0+120883f 10.0.51.35 <none> Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 [2021-06-22T06:45:19.608Z] ip-10-0-51-86.us-east-2.compute.internal Ready worker 4h18m v1.20.0+87cc9a4 10.0.51.86 <none> Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8 [2021-06-22T06:45:19.608Z] ip-10-0-54-153.us-east-2.compute.internal Ready master 4h27m v1.21.0-rc.0+120883f 10.0.54.153 <none> Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 [2021-06-22T06:45:19.608Z] ip-10-0-56-175.us-east-2.compute.internal Ready worker 3h6m v1.20.0+87cc9a4 10.0.56.175 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7 [2021-06-22T06:45:19.608Z] ip-10-0-58-42.us-east-2.compute.internal Ready,SchedulingDisabled worker 3h6m v1.20.0+87cc9a4 10.0.58.42 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el7 [2021-06-22T06:45:19.608Z] ip-10-0-74-190.us-east-2.compute.internal Ready worker 4h18m v1.20.0+87cc9a4 10.0.74.190 <none> Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8 [2021-06-22T06:45:19.608Z] ip-10-0-74-227.us-east-2.compute.internal Ready worker 4h17m v1.20.0+87cc9a4 10.0.74.227 <none> Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8 [2021-06-22T06:45:19.608Z] ip-10-0-76-192.us-east-2.compute.internal Ready master 4h28m v1.21.0-rc.0+120883f 10.0.76.192 <none> Red Hat Enterprise Linux CoreOS 48.84.202106211517-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 [2021-06-22T06:42:17.846Z] NAME VERSION AVAILABLE PROGRESSING SINCE STATUS [2021-06-22T06:42:17.846Z] version 4.7.0-0.nightly-2021-06-20-093308 True True 3h1m Unable to apply 4.8.0-0.nightly-2021-06-21-175537: the cluster operator machine-config has not yet successfully rolled out. Version-Release number of selected component (if applicable): node kube version: v1.20.0+87cc9a4 Red Hat Enterprise Linux CoreOS 47.83.202106200838-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-4.rhaos4.7.gitbaade70.el8 How reproducible: Once in our CI Steps to Reproduce: 1. Install OCP 4.7.0-0.nightly-2021-06-20-093308 IPI on AWS FIPS on & OVN & Etcd Encryption & http_proxy & STS 2. Scale to add 2 RHEL 7.9 worker nodes 3. Upgrade cluster to 4.8.0-0.nightly-2021-06-21-175537 Actual results: Machine config operator degraded [2021-06-22T06:45:19.608Z] Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [2021-06-22T06:45:19.608Z] authentication 4.8.0-0.nightly-2021-06-21-175537 True False False 99m [2021-06-22T06:45:19.608Z] baremetal 4.8.0-0.nightly-2021-06-21-175537 True False False 4h25m [2021-06-22T06:45:19.608Z] cloud-credential 4.8.0-0.nightly-2021-06-21-175537 True False False 4h23m [2021-06-22T06:45:19.608Z] cluster-autoscaler 4.8.0-0.nightly-2021-06-21-175537 True False False 4h23m [2021-06-22T06:45:19.608Z] config-operator 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.608Z] console 4.8.0-0.nightly-2021-06-21-175537 True False False 101m [2021-06-22T06:45:19.608Z] csi-snapshot-controller 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.608Z] dns 4.8.0-0.nightly-2021-06-21-175537 True False False 132m [2021-06-22T06:45:19.608Z] etcd 4.8.0-0.nightly-2021-06-21-175537 True False False 4h22m [2021-06-22T06:45:19.608Z] image-registry 4.8.0-0.nightly-2021-06-21-175537 True False False 4h17m [2021-06-22T06:45:19.609Z] ingress 4.8.0-0.nightly-2021-06-21-175537 True False False 151m [2021-06-22T06:45:19.609Z] insights 4.8.0-0.nightly-2021-06-21-175537 True False False 4h18m [2021-06-22T06:45:19.609Z] kube-apiserver 4.8.0-0.nightly-2021-06-21-175537 True False False 4h22m [2021-06-22T06:45:19.609Z] kube-controller-manager 4.8.0-0.nightly-2021-06-21-175537 True False False 4h22m [2021-06-22T06:45:19.609Z] kube-scheduler 4.8.0-0.nightly-2021-06-21-175537 True False False 4h22m [2021-06-22T06:45:19.609Z] kube-storage-version-migrator 4.8.0-0.nightly-2021-06-21-175537 True False False 102m [2021-06-22T06:45:19.609Z] machine-api 4.8.0-0.nightly-2021-06-21-175537 True False False 4h19m [2021-06-22T06:45:19.609Z] machine-approver 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.609Z] machine-config 4.7.0-0.nightly-2021-06-20-093308 False True True 111m [2021-06-22T06:45:19.609Z] marketplace 4.8.0-0.nightly-2021-06-21-175537 True False False 4h23m [2021-06-22T06:45:19.609Z] monitoring 4.8.0-0.nightly-2021-06-21-175537 True False False 107m [2021-06-22T06:45:19.609Z] network 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.609Z] node-tuning 4.8.0-0.nightly-2021-06-21-175537 True False False 151m [2021-06-22T06:45:19.609Z] openshift-apiserver 4.8.0-0.nightly-2021-06-21-175537 True False False 100m [2021-06-22T06:45:19.609Z] openshift-controller-manager 4.8.0-0.nightly-2021-06-21-175537 True False False 150m [2021-06-22T06:45:19.609Z] openshift-samples 4.8.0-0.nightly-2021-06-21-175537 True False False 151m [2021-06-22T06:45:19.609Z] operator-lifecycle-manager 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.609Z] operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.609Z] operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-21-175537 True False False 133m [2021-06-22T06:45:19.609Z] service-ca 4.8.0-0.nightly-2021-06-21-175537 True False False 4h24m [2021-06-22T06:45:19.609Z] storage 4.8.0-0.nightly-2021-06-21-175537 True False False 102m Expected results: All co operators available, not progressing and not degraded after upgrade Additional info: Link to must-gather tar ball will be in next private comment.
I can reproduce it. Checking the mcp status, worker pool is degraded. # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-523fde35bd682b1ca95b43ea123ee039 True False False 3 3 3 0 20h worker rendered-worker-631a04c93d1348113465fd9c4127892b False True True 5 2 2 1 20h Checking the worker pool message, found that Node ip-10-0-48-245.us-east-2.compute.internal has a SSH access error. # oc get mcp worker -oyaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2021-06-22T06:27:00Z" generation: 3 labels: machineconfiguration.openshift.io/mco-built-in: "" pools.operator.machineconfiguration.openshift.io/worker: "" managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/mco-built-in: {} f:pools.operator.machineconfiguration.openshift.io/worker: {} f:spec: .: {} f:configuration: .: {} f:source: {} f:machineConfigSelector: .: {} f:matchLabels: .: {} f:machineconfiguration.openshift.io/role: {} f:nodeSelector: .: {} f:matchLabels: .: {} f:node-role.kubernetes.io/worker: {} f:paused: {} manager: machine-config-operator operation: Update time: "2021-06-22T06:27:00Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: {} f:source: {} f:status: .: {} f:conditions: {} f:configuration: .: {} f:name: {} f:source: {} f:degradedMachineCount: {} f:machineCount: {} f:observedGeneration: {} f:readyMachineCount: {} f:unavailableMachineCount: {} f:updatedMachineCount: {} manager: machine-config-controller operation: Update time: "2021-06-22T06:28:38Z" name: worker resourceVersion: "106385" uid: 7a455bab-c429-47e1-a4db-795d3c5b6b8c spec: configuration: name: rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-fips - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2021-06-22T08:44:49Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2021-06-22T08:44:49Z" message: All nodes are updating to rendered-worker-631a04c93d1348113465fd9c4127892b reason: "" status: "True" type: Updating - lastTransitionTime: "2021-06-22T09:02:22Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2021-06-22T09:05:25Z" message: 'Node ip-10-0-48-245.us-east-2.compute.internal is reporting: "error detecting previous SSH accesses: error: cannot apply annotation for SSH access due to: unable to update node \"&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\": unexpected error when reading response body. Please retry. Original error: http2: client connection lost"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2021-06-22T09:05:25Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-631a04c93d1348113465fd9c4127892b source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-fips - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 5 observedGeneration: 3 readyMachineCount: 2 unavailableMachineCount: 1 updatedMachineCount: 2 # oc logs -n openshift-machine-config-operator pod/machine-config-daemon-7999p -c machine-config-daemon | tail W0622 08:55:51.521339 3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0622 08:55:51.521370 3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0622 09:02:38.735300 3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding W0622 09:02:38.735343 3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding I0622 09:03:09.701297 3355 trace.go:205] Trace[774965466]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (22-Jun-2021 09:02:39.699) (total time: 30002ms): Trace[774965466]: [30.002098681s] [30.002098681s] END E0622 09:03:09.701425 3355 reflector.go:138] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?resourceVersion=98573": dial tcp 172.30.0.1:443: i/o timeout I0622 09:03:09.711269 3355 trace.go:205] Trace[1852186258]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (22-Jun-2021 09:02:39.710) (total time: 30000ms): Trace[1852186258]: [30.000475692s] [30.000475692s] END E0622 09:03:09.711370 3355 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?resourceVersion=98506": dial tcp 172.30.0.1:443: i/o timeout
Per the comment#3 , QE still could reproduce it. Set blocker? to request re-check if it's one release blocker.
Looking at the provided journal logs, there are a bunch of weird errors early on in kubelet/crio, and a few MCD container restarts, but it eventually does get to the part where it's having trouble, and the MCD tries the update, fails, and loops on that. Example journal snippet: Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491146]: machine-config-daemon[74888]: Starting update from rendered-worker-631a04c93d1348113465fd9c4127892b to rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491147]: machine-config-daemon[74888]: Node has been successfully cordoned Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491148]: machine-config-daemon[74888]: Update prepared; beginning drain Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.742723030Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.744083065Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPREQUEST on br-ex to 10.0.48.1 port 67 (xid=0x5ea02204) Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPACK from 10.0.48.1 (xid=0x5ea02204) Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): address 10.0.48.245 Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): plen 20 (255.255.240.0) Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): gateway 10.0.48.1 Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): lease time 3600 Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): hostname 'ip-10-0-48-245' Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): nameserver '10.0.0.2' Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): domain name 'us-east-2.compute.internal' Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): state changed bound -> bound Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: bound to 10.0.48.245 -- renewal in 1783 seconds. Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Starting Network Manager Script Dispatcher Service... Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Started Network Manager Script Dispatcher Service. Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: new request (6 scripts) Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: start running ordered scripts... Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.748633915Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.749859655Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus 2 things from that: 1. the crio logs seem to indicate that we are doing an inspect. I'm not sure if the return values look correct, but assuming they are, 2. the node immediately sees a DHCPREQUEST, that seems abnormal. Is networking on the node unstable? Is that expected? Looking at the job profile, maybe OVN can help take a look? From the MCO's perspective, something else lower in the networking stack is going wrong, so I'm not sure how to debug further.
This BZ does not reproduce on IPI BM
(In reply to Ori Michaeli from comment #8) > This BZ does not reproduce on IPI BM Clarification: This was tested with 4.7.0-0.nightly-2021-06-26-014854 as 4.7.0-0.nightly-2021-06-20-093308 is not available anymore.
Updating the test result of upgrading from 4.7.0-0.nightly-2021-06-26-014854--> 4.8.0-0.nightly-2021-06-29-033219. I don't see this kind of error "timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready". # oc get node NAME STATUS ROLES AGE VERSION ip-10-0-48-218.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe ip-10-0-49-65.us-east-2.compute.internal Ready worker 4h15m v1.20.0+87cc9a4 ip-10-0-50-212.us-east-2.compute.internal Ready master 5h39m v1.21.0-rc.0+766a5fe ip-10-0-54-58.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe ip-10-0-59-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 4h15m v1.20.0+87cc9a4 ip-10-0-71-142.us-east-2.compute.internal Ready master 5h39m v1.21.0-rc.0+766a5fe ip-10-0-72-138.us-east-2.compute.internal Ready master 5h40m v1.21.0-rc.0+766a5fe ip-10-0-73-231.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe # oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.8.0-0.nightly-2021-06-29-033219 True False False 125m baremetal 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m cloud-credential 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m cluster-autoscaler 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m config-operator 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m console 4.8.0-0.nightly-2021-06-29-033219 True False False 127m csi-snapshot-controller 4.8.0-0.nightly-2021-06-29-033219 True False False 5h31m dns 4.8.0-0.nightly-2021-06-29-033219 True True False 157m etcd 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m image-registry 4.8.0-0.nightly-2021-06-29-033219 True False False 5h25m ingress 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m insights 4.8.0-0.nightly-2021-06-29-033219 True False False 5h29m kube-apiserver 4.8.0-0.nightly-2021-06-29-033219 True False False 5h34m kube-controller-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 5h33m kube-scheduler 4.8.0-0.nightly-2021-06-29-033219 True False False 5h33m kube-storage-version-migrator 4.8.0-0.nightly-2021-06-29-033219 True False False 129m machine-api 4.8.0-0.nightly-2021-06-29-033219 True False False 5h31m machine-approver 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m machine-config 4.7.0-0.nightly-2021-06-26-014854 False True True 138m marketplace 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m monitoring 4.8.0-0.nightly-2021-06-29-033219 False True True 123m network 4.8.0-0.nightly-2021-06-29-033219 True True True 5h36m node-tuning 4.8.0-0.nightly-2021-06-29-033219 True False False 3h openshift-apiserver 4.8.0-0.nightly-2021-06-29-033219 True False False 125m openshift-controller-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m openshift-samples 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m operator-lifecycle-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-29-033219 True False False 145m service-ca 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m storage 4.8.0-0.nightly-2021-06-29-033219 True True False 137m # oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-165f673837a072be255e8838537b90c6 True False False 3 3 3 0 5h37m worker rendered-worker-6e0b285ba602438a84b3750e65ea3697 False True False 5 3 3 0 5h37m # oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2021-06-30T02:56:33Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:include.release.openshift.io/self-managed-high-availability: f:include.release.openshift.io/single-node-developer: f:spec: f:status: .: f:relatedObjects: f:versions: Manager: cluster-version-operator Operation: Update Time: 2021-06-30T02:56:33Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Time: 2021-06-30T06:31:07Z Resource Version: 178512 UID: 29895b49-21dd-4dce-8622-1d8c77fefe55 Spec: Status: Conditions: Last Transition Time: 2021-06-30T06:12:03Z Message: Working towards 4.8.0-0.nightly-2021-06-29-033219 Status: True Type: Progressing Last Transition Time: 2021-06-30T06:16:56Z Message: One or more machine config pools are updating, please see `oc get mcp` for further details Reason: PoolUpdating Status: False Type: Upgradeable Last Transition Time: 2021-06-30T06:43:35Z Message: Unable to apply 4.8.0-0.nightly-2021-06-29-033219: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-06-30T06:26:55Z Message: Cluster not available for 4.8.0-0.nightly-2021-06-29-033219 Status: False Type: Available Extension: Master: all 3 nodes are at latest configuration rendered-master-165f673837a072be255e8838537b90c6 Worker: 3 (ready 3) out of 5 nodes are updating to latest configuration rendered-worker-63fe24505cf686e519c81db70ec61ca0 Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Group: Name: Resource: nodes Group: Name: openshift-kni-infra Resource: namespaces Group: Name: openshift-openstack-infra Resource: namespaces Group: Name: openshift-ovirt-infra Resource: namespaces Group: Name: openshift-vsphere-infra Resource: namespaces Versions: Name: operator Version: 4.7.0-0.nightly-2021-06-26-014854 Events: <none> # oc describe node ip-10-0-59-156.us-east-2.compute.internal Name: ip-10-0-59-156.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-59-156.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhel topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f79156d446d6eaaf"} k8s.ovn.org/host-addresses: ["10.0.59.156"] k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-156.us-east-2.compute.internal","mac-address":"02:62:91:75:7d:44","ip-address... k8s.ovn.org/node-chassis-id: afae5859-452c-46c0-805c-50531fff6957 k8s.ovn.org/node-local-nat-ip: {"default":["169.254.1.48"]} k8s.ovn.org/node-mgmt-port-mac-address: e6:22:87:9d:74:ca k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.156/20"} k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"} machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-6e0b285ba602438a84b3750e65ea3697 machineconfiguration.openshift.io/desiredConfig: rendered-worker-63fe24505cf686e519c81db70ec61ca0 machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Working volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 30 Jun 2021 00:28:49 -0400 Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoSchedule node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-59-156.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Wed, 30 Jun 2021 02:32:06 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status. Addresses: InternalIP: 10.0.59.156 Hostname: ip-10-0-59-156.us-east-2.compute.internal InternalDNS: ip-10-0-59-156.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 4 ephemeral-storage: 31444972Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16264956Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 3500m ephemeral-storage: 27905944324 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15113980Ki pods: 250 System Info: Machine ID: d14ea99a3d754419bdfdfa01ab77bf07 System UUID: EC20084D-9DCB-C128-04CE-35C2042522E3 Boot ID: deb7c00f-4ed0-4973-926e-a7c62936964c Kernel Version: 3.10.0-1160.31.1.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.20.3-6.rhaos4.7.git0d0f863.el7 Kubelet Version: v1.20.0+87cc9a4 Kube-Proxy Version: v1.20.0+87cc9a4 ProviderID: aws:///us-east-2a/i-0f79156d446d6eaaf Non-terminated Pods: (13 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-6kgh4 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 3h8m openshift-cluster-node-tuning-operator tuned-txllf 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 3h9m openshift-dns dns-default-724lv 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 165m openshift-dns node-resolver-ldj8r 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 167m openshift-image-registry node-ca-qjn99 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h10m openshift-ingress-canary ingress-canary-t7vmj 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 3h9m openshift-machine-config-operator machine-config-daemon-fs2st 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 161m openshift-monitoring node-exporter-fq6qw 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 3h10m openshift-multus multus-additional-cni-plugins-gx6th 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h4m openshift-multus multus-jb6cx 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 179m openshift-multus network-metrics-daemon-kk5vg 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 3h4m openshift-network-diagnostics network-check-target-7lq5z 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 3h3m openshift-ovn-kubernetes ovnkube-node-fj8hz 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 3h2m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 264m (7%) 0 (0%) memory 1358Mi (9%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeNotSchedulable 143m kubelet Node ip-10-0-59-156.us-east-2.compute.internal status is now: NodeNotSchedulable The kubeconfig file to access the cluster can be found in [1] if you'd like. [1] https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/27508/artifact/workdir/install-dir/auth/kubeconfig
The cluster inspect you gave (unfortunately the cluster appears to have been torn down) shows that one of the workers never successfully rejoined the cluster. Unfortunately this seems to be yet another different error case. We may need to ssh into the system next time if we see a "NotReady, SchedulingDisabled" since the API is unable to access the node, which may point to kubelet failures. There is another proxy + RHEL7 bug ongoing: https://bugzilla.redhat.com/show_bug.cgi?id=1975907. Maybe we can wait until both other RHEL bugs are settled, before we look into this again? Sorry for the repeated trouble, but given the instability of RHEL7 right now it's hard to gauge what is going on
Since there are no logs for the RHEL upgrade job, I'm not sure if this bug is a duplicate of bug 1975907,but based on the must-gather logs, seems like they are not. Based on comments 2, 6, transfer it to the OVN team to have a look. Based on comment 13, I removed the `UpgradeBlocker` Keywords.
Hit the same bug, when upgraded from 4.7.13-x86_64 to 4.7.19-x86_64 with profile 50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.13 True True 15h Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out $ oc get no NAME STATUS ROLES AGE VERSION kewang06173513-zdhhx-master-0 Ready master 16h v1.20.0+87cc9a4 kewang06173513-zdhhx-master-1 Ready master 16h v1.20.0+87cc9a4 kewang06173513-zdhhx-master-2 Ready master 16h v1.20.0+87cc9a4 kewang06173513-zdhhx-rhel-0 NotReady,SchedulingDisabled worker 15h v1.20.0+87cc9a4 kewang06173513-zdhhx-rhel-1 Ready worker 15h v1.20.0+87cc9a4 kewang06173513-zdhhx-worker-5m4xf Ready worker 16h v1.20.0+df9c838 kewang06173513-zdhhx-worker-hhqgw Ready worker 16h v1.20.0+df9c838 $ oc get co --no-headers | grep -v '.True.*False.*False' machine-config 4.7.13 False True True 14h monitoring 4.7.19 False True True 13h network 4.7.19 True True True 16h Upgrade Job console output from Jenkins: ... 07-06 14:13:56.229 version 4.7.13 True True 125m Working towards 4.7.19: 7 of 669 done (1% complete) 07-06 14:13:56.229 07-06 14:13:56.229 07-06 14:13:56.229 [Debug] upgrade_ret_1=1,upgrade_ret_2=1 07-06 14:13:56.229 07-06 14:14:54.160 Unable to connect to the server: Service Unavailable 07-06 14:14:54.160 07-06 14:14:54.160 07-06 14:14:54.160 07-06 14:14:54.160 Unable to connect to the server: Service Unavailable 07-06 14:14:54.160 [Debug] upgrade_ret_1=1,upgrade_ret_2=1 ... Checked operator mco which is the first reported the error, .. 07-06 15:13:38.873 Status: 07-06 15:13:38.873 Conditions: 07-06 15:13:38.873 Last Transition Time: 2021-07-06T13:08:14Z 07-06 15:13:38.873 Message: Working towards 4.7.19 07-06 15:13:38.873 Status: True 07-06 15:13:38.873 Type: Progressing 07-06 15:13:38.873 Last Transition Time: 2021-07-06T14:12:48Z 07-06 15:13:38.873 Message: Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1) 07-06 15:13:38.873 Reason: MachineConfigDaemonFailed $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-a2e8bef20b3410f5d7765df456714e66 True False False 3 3 3 0 19h worker rendered-worker-2990fdac5c6746d7e6bb90f86292fbad False True False 4 0 0 0 19h
Hit the same bug, when triage upgrade CI original_build=4.7.13-x86_64 target_build=4.7.19-x86_64 profile_name=50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on [2021-07-09T14:28:57.626Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1 [2021-07-09T14:28:57.626Z] [2021-07-09T14:29:56.570Z] NAME VERSION AVAILABLE PROGRESSING SINCE STATUS [2021-07-09T14:29:56.570Z] version 4.7.13 True True 3h2m Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out [2021-07-09T14:29:56.570Z] [2021-07-09T14:29:56.570Z] [2021-07-09T14:29:57.156Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1 [2021-07-09T14:29:57.156Z] [2021-07-09T14:31:05.151Z] **************Post Action after upgrade fail**************** [2021-07-09T14:31:05.151Z] [2021-07-09T14:31:05.151Z] Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME [2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-0 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.87 172.31.249.87 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8 [2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-1 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.110 172.31.249.110 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8 [2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-master-2 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.7 172.31.249.7 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8 [2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-0 Ready worker 3h6m v1.20.0+bd7b30d 172.31.249.247 172.31.249.247 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-7.rhaos4.7.git41925ef.el7 [2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-1 NotReady,SchedulingDisabled worker 3h6m v1.20.0+bd7b30d 172.31.249.227 172.31.249.227 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-7.rhaos4.7.git41925ef.el7 [2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-ptksq Ready worker 4h8m v1.20.0+df9c838 172.31.249.215 172.31.249.215 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8 [2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-qfkr9 Ready worker 4h8m v1.20.0+df9c838 172.31.249.231 172.31.249.231 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8 [2021-07-09T14:31:05.152Z] [2021-07-09T14:31:05.152Z] [2021-07-09T14:31:05.152Z] Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE [2021-07-09T14:31:05.152Z] authentication 4.7.19 True False False 66m [2021-07-09T14:31:05.152Z] baremetal 4.7.19 True False False 4h17m [2021-07-09T14:31:05.152Z] cloud-credential 4.7.19 True False False 4h20m [2021-07-09T14:31:05.152Z] cluster-autoscaler 4.7.19 True False False 4h15m [2021-07-09T14:31:05.152Z] config-operator 4.7.19 True False False 4h17m [2021-07-09T14:31:05.152Z] console 4.7.19 True False False 73m [2021-07-09T14:31:05.152Z] csi-snapshot-controller 4.7.19 True False False 65m [2021-07-09T14:31:05.152Z] dns 4.7.19 True False False 4h15m [2021-07-09T14:31:05.152Z] etcd 4.7.19 True False False 4h15m [2021-07-09T14:31:05.152Z] image-registry 4.7.19 True False False 3h20m [2021-07-09T14:31:05.152Z] ingress 4.7.19 True False False 4h6m [2021-07-09T14:31:05.152Z] insights 4.7.19 True False False 4h10m [2021-07-09T14:31:05.152Z] kube-apiserver 4.7.19 True False False 4h13m [2021-07-09T14:31:05.152Z] kube-controller-manager 4.7.19 True False False 4h14m [2021-07-09T14:31:05.152Z] kube-scheduler 4.7.19 True False False 4h14m [2021-07-09T14:31:05.152Z] kube-storage-version-migrator 4.7.19 True False False 68m [2021-07-09T14:31:05.152Z] machine-api 4.7.19 True False False 4h13m [2021-07-09T14:31:05.152Z] machine-approver 4.7.19 True False False 4h16m [2021-07-09T14:31:05.152Z] machine-config 4.7.13 False True True 140m [2021-07-09T14:31:05.152Z] marketplace 4.7.19 True False False 77m [2021-07-09T14:31:05.152Z] monitoring 4.7.19 False True True 74m [2021-07-09T14:31:05.152Z] network 4.7.19 True True True 4h16m [2021-07-09T14:31:05.152Z] node-tuning 4.7.19 True False False 153m [2021-07-09T14:31:05.152Z] openshift-apiserver 4.7.19 True False False 66m [2021-07-09T14:31:05.152Z] openshift-controller-manager 4.7.19 True False False 151m [2021-07-09T14:31:05.152Z] openshift-samples 4.7.19 True False False 153m [2021-07-09T14:31:05.152Z] operator-lifecycle-manager 4.7.19 True False False 4h16m [2021-07-09T14:31:05.152Z] operator-lifecycle-manager-catalog 4.7.19 True False False 4h16m [2021-07-09T14:31:05.152Z] operator-lifecycle-manager-packageserver 4.7.19 True False False 68m [2021-07-09T14:31:05.152Z] service-ca 4.7.19 True False False 4h17m [2021-07-09T14:31:05.152Z] storage 4.7.19 True False False 73m [2021-07-09T14:31:05.152Z] print detail msg for node(SchedulingDisabled) if exist: [2021-07-09T14:31:05.152Z] [2021-07-09T14:32:13.333Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2021-07-09T14:32:13.333Z] [2021-07-09T14:32:13.333Z] [2021-07-09T14:32:13.333Z] Name: rioliu09175612-kgg8w-rhel-1 [2021-07-09T14:32:13.333Z] Roles: worker [2021-07-09T14:32:13.333Z] Labels: beta.kubernetes.io/arch=amd64 [2021-07-09T14:32:13.333Z] beta.kubernetes.io/os=linux [2021-07-09T14:32:13.333Z] kubernetes.io/arch=amd64 [2021-07-09T14:32:13.333Z] kubernetes.io/hostname=rioliu09175612-kgg8w-rhel-1 [2021-07-09T14:32:13.333Z] kubernetes.io/os=linux [2021-07-09T14:32:13.333Z] node-role.kubernetes.io/worker= [2021-07-09T14:32:13.333Z] node.openshift.io/os_id=rhel [2021-07-09T14:32:13.333Z] Annotations: k8s.ovn.org/l3-gateway-config: [2021-07-09T14:32:13.333Z] {"default":{"mode":"shared","interface-id":"br-ex_rioliu09175612-kgg8w-rhel-1","mac-address":"00:50:56:ac:9a:f7","ip-addresses":["172.31.2... [2021-07-09T14:32:13.333Z] k8s.ovn.org/node-chassis-id: 8bc13dc6-e668-49b1-b12b-9ba16655aac1 [2021-07-09T14:32:13.333Z] k8s.ovn.org/node-local-nat-ip: {"default":["169.254.15.54"]} [2021-07-09T14:32:13.333Z] k8s.ovn.org/node-mgmt-port-mac-address: 92:30:54:8d:79:42 [2021-07-09T14:32:13.333Z] k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.31.249.227/23"} [2021-07-09T14:32:13.333Z] k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"} [2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/currentConfig: rendered-worker-a557d8bd4e42e3b87302ed6fef0c7626 [2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/desiredConfig: rendered-worker-04fc66aec510739dc2cb7bc4696f80c0 [2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/ssh: accessed [2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/state: Working [2021-07-09T14:32:13.333Z] volumes.kubernetes.io/controller-managed-attach-detach: true [2021-07-09T14:32:13.333Z] CreationTimestamp: Fri, 09 Jul 2021 11:24:43 +0000 [2021-07-09T14:32:13.333Z] Taints: node.kubernetes.io/unreachable:NoExecute [2021-07-09T14:32:13.333Z] node.kubernetes.io/unreachable:NoSchedule [2021-07-09T14:32:13.333Z] node.kubernetes.io/unschedulable:NoSchedule [2021-07-09T14:32:13.333Z] Unschedulable: true [2021-07-09T14:32:13.333Z] Lease: [2021-07-09T14:32:13.333Z] HolderIdentity: rioliu09175612-kgg8w-rhel-1 [2021-07-09T14:32:13.333Z] AcquireTime: <unset> [2021-07-09T14:32:13.333Z] RenewTime: Fri, 09 Jul 2021 13:08:04 +0000 [2021-07-09T14:32:13.333Z] Conditions: [2021-07-09T14:32:13.333Z] Type Status LastHeartbeatTime LastTransitionTime Reason Message [2021-07-09T14:32:13.333Z] ---- ------ ----------------- ------------------ ------ ------- [2021-07-09T14:32:13.333Z] MemoryPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status. [2021-07-09T14:32:13.333Z] DiskPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status. [2021-07-09T14:32:13.333Z] PIDPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status. [2021-07-09T14:32:13.333Z] Ready Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status. [2021-07-09T14:32:13.333Z] Addresses: [2021-07-09T14:32:13.333Z] ExternalIP: 172.31.249.227 [2021-07-09T14:32:13.333Z] InternalIP: 172.31.249.227 [2021-07-09T14:32:13.333Z] Hostname: rioliu09175612-kgg8w-rhel-1 [2021-07-09T14:32:13.333Z] Capacity: [2021-07-09T14:32:13.333Z] cpu: 8 [2021-07-09T14:32:13.333Z] ephemeral-storage: 19066Mi [2021-07-09T14:32:13.333Z] hugepages-1Gi: 0 [2021-07-09T14:32:13.333Z] hugepages-2Mi: 0 [2021-07-09T14:32:13.333Z] memory: 16265896Ki [2021-07-09T14:32:13.333Z] pods: 250 [2021-07-09T14:32:13.333Z] Allocatable: [2021-07-09T14:32:13.333Z] cpu: 7500m [2021-07-09T14:32:13.333Z] ephemeral-storage: 16919193161 [2021-07-09T14:32:13.333Z] hugepages-1Gi: 0 [2021-07-09T14:32:13.333Z] hugepages-2Mi: 0 [2021-07-09T14:32:13.333Z] memory: 15114920Ki [2021-07-09T14:32:13.333Z] pods: 250 [2021-07-09T14:32:13.333Z] System Info: [2021-07-09T14:32:13.333Z] Machine ID: 68b4570c8c2f40dfa8c34a353d7aa935 [2021-07-09T14:32:13.333Z] System UUID: 97AA2C42-B450-4CA9-CC9E-4675C05C9D09 [2021-07-09T14:32:13.333Z] Boot ID: 46a37500-282c-4f56-a934-841dd0bfc02c [2021-07-09T14:32:13.333Z] Kernel Version: 3.10.0-1160.31.1.el7.x86_64 [2021-07-09T14:32:13.333Z] OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) [2021-07-09T14:32:13.333Z] Operating System: linux [2021-07-09T14:32:13.333Z] Architecture: amd64 [2021-07-09T14:32:13.333Z] Container Runtime Version: cri-o://1.20.3-7.rhaos4.7.git41925ef.el7 [2021-07-09T14:32:13.333Z] Kubelet Version: v1.20.0+bd7b30d [2021-07-09T14:32:13.333Z] Kube-Proxy Version: v1.20.0+bd7b30d [2021-07-09T14:32:13.333Z] ProviderID: vsphere://422caa97-50b4-a94c-cc9e-4675c05c9d09 [2021-07-09T14:32:13.333Z] Non-terminated Pods: (13 in total) [2021-07-09T14:32:13.333Z] Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE [2021-07-09T14:32:13.333Z] --------- ---- ------------ ---------- --------------- ------------- --- [2021-07-09T14:32:13.333Z] openshift-cluster-node-tuning-operator tuned-6bhjp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 154m [2021-07-09T14:32:13.333Z] openshift-dns dns-default-bwzjf 65m (0%) 0 (0%) 131Mi (0%) 0 (0%) 141m [2021-07-09T14:32:13.333Z] openshift-image-registry node-ca-lh9br 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 153m [2021-07-09T14:32:13.333Z] openshift-ingress-canary ingress-canary-tzfts 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 154m [2021-07-09T14:32:13.334Z] openshift-machine-config-operator machine-config-daemon-gb94s 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 119m [2021-07-09T14:32:13.334Z] openshift-monitoring node-exporter-hllwn 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 154m [2021-07-09T14:32:13.334Z] openshift-multus multus-crksd 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 145m [2021-07-09T14:32:13.334Z] openshift-multus network-metrics-daemon-26prg 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 151m [2021-07-09T14:32:13.334Z] openshift-network-diagnostics network-check-target-z77nd 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 148m [2021-07-09T14:32:13.334Z] openshift-ovn-kubernetes ovnkube-node-n6hf5 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 148m [2021-07-09T14:32:13.334Z] openshift-vsphere-infra coredns-rioliu09175612-kgg8w-rhel-1 200m (2%) 0 (0%) 400Mi (2%) 0 (0%) 84m [2021-07-09T14:32:13.334Z] openshift-vsphere-infra keepalived-rioliu09175612-kgg8w-rhel-1 200m (2%) 0 (0%) 400Mi (2%) 0 (0%) 84m [2021-07-09T14:32:13.334Z] openshift-vsphere-infra mdns-publisher-rioliu09175612-kgg8w-rhel-1 100m (1%) 0 (0%) 200Mi (1%) 0 (0%) 84m [2021-07-09T14:32:13.334Z] Allocated resources: [2021-07-09T14:32:13.334Z] (Total limits may be over 100 percent, i.e., overcommitted.) [2021-07-09T14:32:13.334Z] Resource Requests Limits [2021-07-09T14:32:13.334Z] -------- -------- ------ [2021-07-09T14:32:13.334Z] cpu 714m (9%) 0 (0%) [2021-07-09T14:32:13.334Z] memory 2426Mi (16%) 0 (0%) [2021-07-09T14:32:13.334Z] ephemeral-storage 0 (0%) 0 (0%) [2021-07-09T14:32:13.334Z] hugepages-1Gi 0 (0%) 0 (0%) [2021-07-09T14:32:13.334Z] hugepages-2Mi 0 (0%) 0 (0%) [2021-07-09T14:32:13.334Z] Events: [2021-07-09T14:32:13.334Z] Type Reason Age From Message [2021-07-09T14:32:13.334Z] ---- ------ ---- ---- ------- [2021-07-09T14:32:13.334Z] Normal NodeNotSchedulable 85m kubelet Node rioliu09175612-kgg8w-rhel-1 status is now: NodeNotSchedulable [2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2021-07-09T14:32:14.331Z] [2021-07-09T14:32:14.331Z] [2021-07-09T14:32:14.331Z] print detail msg for co(AVAILABLE != True or PROGRESSING!=False or DEGRADED!=False or version != target_version) if exist: [2021-07-09T14:32:14.331Z] [2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal co details==~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [2021-07-09T14:32:14.331Z] [2021-07-09T14:32:14.331Z] [2021-07-09T14:32:14.975Z] ~~~~~~~~ [2021-07-09T14:32:14.975Z] machine-config [2021-07-09T14:32:14.975Z] monitoring [2021-07-09T14:32:14.975Z] network [2021-07-09T14:32:14.975Z] ~~~~~~~~ [2021-07-09T14:32:14.975Z] [2021-07-09T14:32:15.931Z] #### Quick diagnosis: The first abnormal cluster operator is often the culprit! #### [2021-07-09T14:32:15.931Z] => Below status and logs for the different conditions of all abnormal cos are sorted by 'lastTransitionTime': [2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] ManagementStateDegraded False - - [2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] Upgradeable True - - [2021-07-09T14:32:15.931Z] 2021-07-09T10:14:08Z [network] Available True - - [2021-07-09T14:32:15.931Z] 2021-07-09T10:15:03Z [machine-config] Upgradeable True AsExpected - [2021-07-09T14:32:15.931Z] 2021-07-09T12:10:10Z [machine-config] -->>Available False<<-- - Cluster -->>not available<<-- for 4.7.19 [2021-07-09T14:32:15.931Z] 2021-07-09T12:12:09Z [machine-config] -->>Progressing True<<-- - Working towards 4.7.19 [2021-07-09T14:32:15.931Z] 2021-07-09T13:08:44Z [network] -->>Progressing True<<-- Deploying DaemonSet "openshift-multus/multus" is -->>not available<<-- (awaiting 1 nodes) [2021-07-09T14:32:15.931Z] 2021-07-09T13:16:21Z [monitoring] -->>Available False<<-- - - [2021-07-09T14:32:15.931Z] 2021-07-09T13:16:36Z [machine-config] -->>Degraded True<<-- MachineConfigDaemonFailed Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1) [2021-07-09T14:32:15.931Z] 2021-07-09T13:29:43Z [monitoring] -->>Degraded True<<-- UpdatingnodeExporterFailed Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes [2021-07-09T14:32:15.931Z] 2021-07-09T13:35:15Z [network] -->>Degraded True<<-- RolloutHung DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-07-09T13:24:42Z [2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] -->>Progressing True<<-- RollOutInProgress Rolling out the stack. [2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] Upgradeable True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes. [2021-07-09T14:32:15.931Z] DaemonSet "openshift-multus/network-metrics-daemon" is -->>not available<<-- (awaiting 1 nodes) [2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" is -->>not available<<-- (awaiting 1 nodes) [2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-07-09T13:25:32Z [2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is -->>not available<<-- (awaiting 1 nodes) [2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-07-09T13:24:43Z [2021-07-09T14:32:15.931Z] [2021-07-09T14:32:15.931Z] -------------------------- [2021-07-09T14:32:15.931Z] [2021-07-09T14:32:17.875Z] Name: machine-config [2021-07-09T14:32:17.875Z] Namespace: [2021-07-09T14:32:17.875Z] Labels: <none> [2021-07-09T14:32:17.875Z] Annotations: exclude.release.openshift.io/internal-openshift-hosted: true [2021-07-09T14:32:17.875Z] include.release.openshift.io/self-managed-high-availability: true [2021-07-09T14:32:17.875Z] include.release.openshift.io/single-node-developer: true [2021-07-09T14:32:17.875Z] API Version: config.openshift.io/v1 [2021-07-09T14:32:17.875Z] Kind: ClusterOperator [2021-07-09T14:32:17.875Z] Metadata: [2021-07-09T14:32:17.875Z] Creation Timestamp: 2021-07-09T10:08:07Z [2021-07-09T14:32:17.876Z] Generation: 1 [2021-07-09T14:32:17.876Z] Managed Fields: [2021-07-09T14:32:17.876Z] API Version: config.openshift.io/v1 [2021-07-09T14:32:17.876Z] Fields Type: FieldsV1 [2021-07-09T14:32:17.876Z] fieldsV1: [2021-07-09T14:32:17.876Z] f:metadata: [2021-07-09T14:32:17.876Z] f:annotations: [2021-07-09T14:32:17.876Z] .: [2021-07-09T14:32:17.876Z] f:exclude.release.openshift.io/internal-openshift-hosted: [2021-07-09T14:32:17.876Z] f:include.release.openshift.io/self-managed-high-availability: [2021-07-09T14:32:17.876Z] f:include.release.openshift.io/single-node-developer: [2021-07-09T14:32:17.876Z] f:spec: [2021-07-09T14:32:17.876Z] f:status: [2021-07-09T14:32:17.876Z] .: [2021-07-09T14:32:17.876Z] f:relatedObjects: [2021-07-09T14:32:17.876Z] f:versions: [2021-07-09T14:32:17.876Z] Manager: cluster-version-operator [2021-07-09T14:32:17.876Z] Operation: Update [2021-07-09T14:32:17.876Z] Time: 2021-07-09T10:08:07Z [2021-07-09T14:32:17.876Z] API Version: config.openshift.io/v1 [2021-07-09T14:32:17.876Z] Fields Type: FieldsV1 [2021-07-09T14:32:17.876Z] fieldsV1: [2021-07-09T14:32:17.876Z] f:status: [2021-07-09T14:32:17.876Z] f:conditions: [2021-07-09T14:32:17.876Z] f:extension: [2021-07-09T14:32:17.876Z] .: [2021-07-09T14:32:17.876Z] f:master: [2021-07-09T14:32:17.876Z] f:worker: [2021-07-09T14:32:17.876Z] f:relatedObjects: [2021-07-09T14:32:17.876Z] f:versions: [2021-07-09T14:32:17.876Z] Manager: machine-config-operator [2021-07-09T14:32:17.876Z] Operation: Update [2021-07-09T14:32:17.876Z] Time: 2021-07-09T13:16:36Z [2021-07-09T14:32:17.876Z] Resource Version: 124620 [2021-07-09T14:32:17.876Z] Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config [2021-07-09T14:32:17.876Z] UID: 9b848845-2913-4f1b-8348-0672b17c87c9 [2021-07-09T14:32:17.876Z] Spec: [2021-07-09T14:32:17.876Z] Status: [2021-07-09T14:32:17.876Z] Conditions: [2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T12:12:09Z [2021-07-09T14:32:17.876Z] Message: Working towards 4.7.19 [2021-07-09T14:32:17.876Z] Status: True [2021-07-09T14:32:17.876Z] Type: Progressing [2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T13:16:36Z [2021-07-09T14:32:17.876Z] Message: Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1) [2021-07-09T14:32:17.876Z] Reason: MachineConfigDaemonFailed [2021-07-09T14:32:17.876Z] Status: True [2021-07-09T14:32:17.876Z] Type: Degraded [2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T12:10:10Z [2021-07-09T14:32:17.876Z] Message: Cluster not available for 4.7.19 [2021-07-09T14:32:17.876Z] Status: False [2021-07-09T14:32:17.876Z] Type: Available [2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T10:15:03Z [2021-07-09T14:32:17.876Z] Reason: AsExpected [2021-07-09T14:32:17.876Z] Status: True [2021-07-09T14:32:17.876Z] Type: Upgradeable [2021-07-09T14:32:17.876Z] Extension: [2021-07-09T14:32:17.876Z] Master: all 3 nodes are at latest configuration rendered-master-e327977b6a5b999cef4cbfbb711a6e90 [2021-07-09T14:32:17.876Z] Worker: 0 (ready 0) out of 4 nodes are updating to latest configuration rendered-worker-04fc66aec510739dc2cb7bc4696f80c0 [2021-07-09T14:32:17.876Z] Related Objects: [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: openshift-machine-config-operator [2021-07-09T14:32:17.876Z] Resource: namespaces [2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: machineconfigpools [2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: controllerconfigs [2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: kubeletconfigs [2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: containerruntimeconfigs [2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: machineconfigs [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: [2021-07-09T14:32:17.876Z] Resource: nodes [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: openshift-kni-infra [2021-07-09T14:32:17.876Z] Resource: namespaces [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: openshift-openstack-infra [2021-07-09T14:32:17.876Z] Resource: namespaces [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: openshift-ovirt-infra [2021-07-09T14:32:17.876Z] Resource: namespaces [2021-07-09T14:32:17.876Z] Group: [2021-07-09T14:32:17.876Z] Name: openshift-vsphere-infra [2021-07-09T14:32:17.876Z] Resource: namespaces [2021-07-09T14:32:17.876Z] Versions: [2021-07-09T14:32:17.876Z] Name: operator [2021-07-09T14:32:17.876Z] Version: 4.7.13 [2021-07-09T14:32:17.876Z] Events: <none> [2021-07-09T14:32:17.876Z]
http://10.73.131.57:9000/minio/openshift-must-gather/2021-07-09-15-09-02/must-gather.local.2206304322535983685.tar.gz
Still hit the same bug when upgraded 4.7.13-x86_64--> 4.7.0-0.nightly-2021-08-27-190811 $ oc get co --no-headers | grep -v '.True.*False.*False' machine-config 4.7.13 False True True 3h18m monitoring 4.7.0-0.nightly-2021-08-27-190811 False False True 164m network 4.7.0-0.nightly-2021-08-27-190811 True True True 6h2m $ oc get no NAME STATUS ROLES AGE VERSION kewang30113609-r5nwz-master-0 Ready master 6h6m v1.20.0+4593a24 kewang30113609-r5nwz-master-1 Ready master 6h6m v1.20.0+4593a24 kewang30113609-r5nwz-master-2 Ready master 6h6m v1.20.0+4593a24 kewang30113609-r5nwz-rhel-0 NotReady,SchedulingDisabled worker 4h52m v1.20.0+9689d22 kewang30113609-r5nwz-rhel-1 Ready worker 4h52m v1.20.0+9689d22 kewang30113609-r5nwz-worker-5564x Ready worker 5h54m v1.20.0+df9c838 kewang30113609-r5nwz-worker-s5bqb Ready worker 5h54m v1.20.0+df9c838 ... Status: Conditions: Last Transition Time: 2021-08-30T06:34:33Z Message: Working towards 4.7.0-0.nightly-2021-08-27-190811 Status: True Type: Progressing Last Transition Time: 2021-08-30T07:26:36Z Message: Unable to apply 4.7.0-0.nightly-2021-08-27-190811: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1) Reason: MachineConfigDaemonFailed Status: True Type: Degraded Last Transition Time: 2021-08-30T06:32:33Z Message: Cluster not available for 4.7.0-0.nightly-2021-08-27-190811 Status: False Type: Available Last Transition Time: 2021-08-30T03:49:17Z Reason: AsExpected Status: True Type: Upgradeable ... $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-cb85aa357520053a2b25d779d251aaee True False False 3 3 3 0 6h4m worker rendered-worker-aac5d9e7df7359f63e356a255aec21ef False True False 4 0 0 0 6h4m
Upgrade with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS, got the passed result. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.15 True False 3h20m Cluster version is 4.8.15 $ oc get clusterversion -o json|jq ".items[0].status.history" [ { "completionTime": "2021-10-14T05:37:21Z", "image": "quay.io/openshift-release-dev/ocp-release:4.8.15-x86_64", "startedTime": "2021-10-14T03:35:00Z", "state": "Completed", "verified": false, "version": "4.8.15" }, { "completionTime": "2021-10-14T01:39:25Z", "image": "quay.io/openshift-release-dev/ocp-release@sha256:e121118406de30f9a92168808979d9363c1a576401c399bf6f528fb47c48b16c", "startedTime": "2021-10-14T01:04:54Z", "state": "Completed", "verified": false, "version": "4.7.34" } ] The last successful upgrade tests: 4.7.0-0.nightly-2021-08-25-185239 -> 4.8.0-0.nightly-2021-08-25-185034 with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS astoycos, I think this bug was solved from above resuls.
@kewang Thanks so much for taking the time to reproduce :), If you see this arise again please feel free to reopen this bug. Thanks, Andrew