Bug 1974962
| Summary: | MCO degraded during upgrade OCP with RHEL,OVN, Proxy: client connection lost | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Walid A. <wabouham> |
| Component: | Networking | Assignee: | Andrew Stoycos <astoycos> |
| Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aconstan, cstabler, hongyli, jerzhang, jiazha, kewang, omichael, rioliu, wsun, yanyang |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-14 14:14:23 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1976110 | ||
| Bug Blocks: | 1999867 | ||
|
Description
Walid A.
2021-06-22 20:20:49 UTC
I can reproduce it.
Checking the mcp status, worker pool is degraded.
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-523fde35bd682b1ca95b43ea123ee039 True False False 3 3 3 0 20h
worker rendered-worker-631a04c93d1348113465fd9c4127892b False True True 5 2 2 1 20h
Checking the worker pool message, found that Node ip-10-0-48-245.us-east-2.compute.internal has a SSH access error.
# oc get mcp worker -oyaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2021-06-22T06:27:00Z"
generation: 3
labels:
machineconfiguration.openshift.io/mco-built-in: ""
pools.operator.machineconfiguration.openshift.io/worker: ""
managedFields:
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.: {}
f:machineconfiguration.openshift.io/mco-built-in: {}
f:pools.operator.machineconfiguration.openshift.io/worker: {}
f:spec:
.: {}
f:configuration:
.: {}
f:source: {}
f:machineConfigSelector:
.: {}
f:matchLabels:
.: {}
f:machineconfiguration.openshift.io/role: {}
f:nodeSelector:
.: {}
f:matchLabels:
.: {}
f:node-role.kubernetes.io/worker: {}
f:paused: {}
manager: machine-config-operator
operation: Update
time: "2021-06-22T06:27:00Z"
- apiVersion: machineconfiguration.openshift.io/v1
fieldsType: FieldsV1
fieldsV1:
f:spec:
f:configuration:
f:name: {}
f:source: {}
f:status:
.: {}
f:conditions: {}
f:configuration:
.: {}
f:name: {}
f:source: {}
f:degradedMachineCount: {}
f:machineCount: {}
f:observedGeneration: {}
f:readyMachineCount: {}
f:unavailableMachineCount: {}
f:updatedMachineCount: {}
manager: machine-config-controller
operation: Update
time: "2021-06-22T06:28:38Z"
name: worker
resourceVersion: "106385"
uid: 7a455bab-c429-47e1-a4db-795d3c5b6b8c
spec:
configuration:
name: rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-fips
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
status:
conditions:
- lastTransitionTime: "2021-06-22T08:44:49Z"
message: ""
reason: ""
status: "False"
type: Updated
- lastTransitionTime: "2021-06-22T08:44:49Z"
message: All nodes are updating to rendered-worker-631a04c93d1348113465fd9c4127892b
reason: ""
status: "True"
type: Updating
- lastTransitionTime: "2021-06-22T09:02:22Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2021-06-22T09:05:25Z"
message: 'Node ip-10-0-48-245.us-east-2.compute.internal is reporting: "error detecting previous SSH accesses: error: cannot apply annotation for SSH access due to: unable to update node \"&Node{ObjectMeta:{ 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},Spec:NodeSpec{PodCIDR:,DoNotUseExternalID:,ProviderID:,Unschedulable:false,Taints:[]Taint{},ConfigSource:nil,PodCIDRs:[],},Status:NodeStatus{Capacity:ResourceList{},Allocatable:ResourceList{},Phase:,Conditions:[]NodeCondition{},Addresses:[]NodeAddress{},DaemonEndpoints:NodeDaemonEndpoints{KubeletEndpoint:DaemonEndpoint{Port:0,},},NodeInfo:NodeSystemInfo{MachineID:,SystemUUID:,BootID:,KernelVersion:,OSImage:,ContainerRuntimeVersion:,KubeletVersion:,KubeProxyVersion:,OperatingSystem:,Architecture:,},Images:[]ContainerImage{},VolumesInUse:[],VolumesAttached:[]AttachedVolume{},Config:nil,},}\": unexpected error when reading response body. Please retry. Original error: http2: client connection lost"'
reason: 1 nodes are reporting degraded status on sync
status: "True"
type: NodeDegraded
- lastTransitionTime: "2021-06-22T09:05:25Z"
message: ""
reason: ""
status: "True"
type: Degraded
configuration:
name: rendered-worker-631a04c93d1348113465fd9c4127892b
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-fips
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-generated-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
degradedMachineCount: 1
machineCount: 5
observedGeneration: 3
readyMachineCount: 2
unavailableMachineCount: 1
updatedMachineCount: 2
# oc logs -n openshift-machine-config-operator pod/machine-config-daemon-7999p -c machine-config-daemon | tail
W0622 08:55:51.521339 3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 08:55:51.521370 3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 09:02:38.735300 3355 reflector.go:436] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0622 09:02:38.735343 3355 reflector.go:436] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: watch of *v1.MachineConfig ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
I0622 09:03:09.701297 3355 trace.go:205] Trace[774965466]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (22-Jun-2021 09:02:39.699) (total time: 30002ms):
Trace[774965466]: [30.002098681s] [30.002098681s] END
E0622 09:03:09.701425 3355 reflector.go:138] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?resourceVersion=98573": dial tcp 172.30.0.1:443: i/o timeout
I0622 09:03:09.711269 3355 trace.go:205] Trace[1852186258]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (22-Jun-2021 09:02:39.710) (total time: 30000ms):
Trace[1852186258]: [30.000475692s] [30.000475692s] END
E0622 09:03:09.711370 3355 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?resourceVersion=98506": dial tcp 172.30.0.1:443: i/o timeout
Per the comment#3 , QE still could reproduce it. Set blocker? to request re-check if it's one release blocker. Looking at the provided journal logs, there are a bunch of weird errors early on in kubelet/crio, and a few MCD container restarts, but it eventually does get to the part where it's having trouble, and the MCD tries the update, fails, and loops on that.
Example journal snippet:
Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491146]: machine-config-daemon[74888]: Starting update from rendered-worker-631a04c93d1348113465fd9c4127892b to rendered-worker-884c73113dfa09e7755a2ce6ccbb7fd3: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false}
Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491147]: machine-config-daemon[74888]: Node has been successfully cordoned
Jun 22 14:18:57 ip-10-0-48-245.us-east-2.compute.internal root[491148]: machine-config-daemon[74888]: Update prepared; beginning drain
Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.742723030Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:19:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:19:54.744083065Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=69b0d967-0adb-4902-a53e-871534d71f14 name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPREQUEST on br-ex to 10.0.48.1 port 67 (xid=0x5ea02204)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: DHCPACK from 10.0.48.1 (xid=0x5ea02204)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): address 10.0.48.245
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): plen 20 (255.255.240.0)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1782] dhcp4 (br-ex): gateway 10.0.48.1
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): lease time 3600
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): hostname 'ip-10-0-48-245'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): nameserver '10.0.0.2'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): domain name 'us-east-2.compute.internal'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal NetworkManager[1357]: <info> [1624371800.1783] dhcp4 (br-ex): state changed bound -> bound
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dhclient[2233]: bound to 10.0.48.245 -- renewal in 1783 seconds.
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Starting Network Manager Script Dispatcher Service...
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal dbus[1104]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal systemd[1]: Started Network Manager Script Dispatcher Service.
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: new request (6 scripts)
Jun 22 14:23:20 ip-10-0-48-245.us-east-2.compute.internal nm-dispatcher[496358]: req:1 'dhcp4-change' [br-ex]: start running ordered scripts...
Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.748633915Z" level=info msg="Checking image status: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus
Jun 22 14:24:54 ip-10-0-48-245.us-east-2.compute.internal crio[2654]: time="2021-06-22 14:24:54.749859655Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b662cc564429008b0b899bcd8f7d2ea400057469f40a595bb39c41ec2c3e3b1b,RepoTags:[],RepoDigests:[quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4a2f77cfc83e981b62edb84b58c6520a09b6ea11e91e635ae91b5d56cf33873a],Size_:479281995,Uid:nil,Username:,Spec:nil,},Info:map[string]string{},}" id=405c4a76-b4cb-4eb3-baaf-d82263bef16f name=/runtime.v1alpha2.ImageService/ImageStatus
2 things from that:
1. the crio logs seem to indicate that we are doing an inspect. I'm not sure if the return values look correct, but assuming they are,
2. the node immediately sees a DHCPREQUEST, that seems abnormal. Is networking on the node unstable? Is that expected?
Looking at the job profile, maybe OVN can help take a look? From the MCO's perspective, something else lower in the networking stack is going wrong, so I'm not sure how to debug further.
This BZ does not reproduce on IPI BM (In reply to Ori Michaeli from comment #8) > This BZ does not reproduce on IPI BM Clarification: This was tested with 4.7.0-0.nightly-2021-06-26-014854 as 4.7.0-0.nightly-2021-06-20-093308 is not available anymore. Updating the test result of upgrading from 4.7.0-0.nightly-2021-06-26-014854--> 4.8.0-0.nightly-2021-06-29-033219. I don't see this kind of error "timed out waiting for the condition during syncRequiredMachineConfigPools: error pool worker is not ready".
# oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-48-218.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe
ip-10-0-49-65.us-east-2.compute.internal Ready worker 4h15m v1.20.0+87cc9a4
ip-10-0-50-212.us-east-2.compute.internal Ready master 5h39m v1.21.0-rc.0+766a5fe
ip-10-0-54-58.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe
ip-10-0-59-156.us-east-2.compute.internal NotReady,SchedulingDisabled worker 4h15m v1.20.0+87cc9a4
ip-10-0-71-142.us-east-2.compute.internal Ready master 5h39m v1.21.0-rc.0+766a5fe
ip-10-0-72-138.us-east-2.compute.internal Ready master 5h40m v1.21.0-rc.0+766a5fe
ip-10-0-73-231.us-east-2.compute.internal Ready worker 5h26m v1.21.0-rc.0+766a5fe
# oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.8.0-0.nightly-2021-06-29-033219 True False False 125m
baremetal 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
cloud-credential 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
cluster-autoscaler 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m
config-operator 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
console 4.8.0-0.nightly-2021-06-29-033219 True False False 127m
csi-snapshot-controller 4.8.0-0.nightly-2021-06-29-033219 True False False 5h31m
dns 4.8.0-0.nightly-2021-06-29-033219 True True False 157m
etcd 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m
image-registry 4.8.0-0.nightly-2021-06-29-033219 True False False 5h25m
ingress 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m
insights 4.8.0-0.nightly-2021-06-29-033219 True False False 5h29m
kube-apiserver 4.8.0-0.nightly-2021-06-29-033219 True False False 5h34m
kube-controller-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 5h33m
kube-scheduler 4.8.0-0.nightly-2021-06-29-033219 True False False 5h33m
kube-storage-version-migrator 4.8.0-0.nightly-2021-06-29-033219 True False False 129m
machine-api 4.8.0-0.nightly-2021-06-29-033219 True False False 5h31m
machine-approver 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m
machine-config 4.7.0-0.nightly-2021-06-26-014854 False True True 138m
marketplace 4.8.0-0.nightly-2021-06-29-033219 True False False 5h35m
monitoring 4.8.0-0.nightly-2021-06-29-033219 False True True 123m
network 4.8.0-0.nightly-2021-06-29-033219 True True True 5h36m
node-tuning 4.8.0-0.nightly-2021-06-29-033219 True False False 3h
openshift-apiserver 4.8.0-0.nightly-2021-06-29-033219 True False False 125m
openshift-controller-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m
openshift-samples 4.8.0-0.nightly-2021-06-29-033219 True False False 3h1m
operator-lifecycle-manager 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
operator-lifecycle-manager-catalog 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
operator-lifecycle-manager-packageserver 4.8.0-0.nightly-2021-06-29-033219 True False False 145m
service-ca 4.8.0-0.nightly-2021-06-29-033219 True False False 5h36m
storage 4.8.0-0.nightly-2021-06-29-033219 True True False 137m
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-165f673837a072be255e8838537b90c6 True False False 3 3 3 0 5h37m
worker rendered-worker-6e0b285ba602438a84b3750e65ea3697 False True False 5 3 3 0 5h37m
# oc describe co machine-config
Name: machine-config
Namespace:
Labels: <none>
Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
include.release.openshift.io/self-managed-high-availability: true
include.release.openshift.io/single-node-developer: true
API Version: config.openshift.io/v1
Kind: ClusterOperator
Metadata:
Creation Timestamp: 2021-06-30T02:56:33Z
Generation: 1
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:exclude.release.openshift.io/internal-openshift-hosted:
f:include.release.openshift.io/self-managed-high-availability:
f:include.release.openshift.io/single-node-developer:
f:spec:
f:status:
.:
f:relatedObjects:
f:versions:
Manager: cluster-version-operator
Operation: Update
Time: 2021-06-30T02:56:33Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:extension:
.:
f:master:
f:worker:
f:relatedObjects:
f:versions:
Manager: machine-config-operator
Operation: Update
Time: 2021-06-30T06:31:07Z
Resource Version: 178512
UID: 29895b49-21dd-4dce-8622-1d8c77fefe55
Spec:
Status:
Conditions:
Last Transition Time: 2021-06-30T06:12:03Z
Message: Working towards 4.8.0-0.nightly-2021-06-29-033219
Status: True
Type: Progressing
Last Transition Time: 2021-06-30T06:16:56Z
Message: One or more machine config pools are updating, please see `oc get mcp` for further details
Reason: PoolUpdating
Status: False
Type: Upgradeable
Last Transition Time: 2021-06-30T06:43:35Z
Message: Unable to apply 4.8.0-0.nightly-2021-06-29-033219: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 8, updated: 8, ready: 7, unavailable: 1)
Reason: MachineConfigDaemonFailed
Status: True
Type: Degraded
Last Transition Time: 2021-06-30T06:26:55Z
Message: Cluster not available for 4.8.0-0.nightly-2021-06-29-033219
Status: False
Type: Available
Extension:
Master: all 3 nodes are at latest configuration rendered-master-165f673837a072be255e8838537b90c6
Worker: 3 (ready 3) out of 5 nodes are updating to latest configuration rendered-worker-63fe24505cf686e519c81db70ec61ca0
Related Objects:
Group:
Name: openshift-machine-config-operator
Resource: namespaces
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigpools
Group: machineconfiguration.openshift.io
Name:
Resource: controllerconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: kubeletconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: containerruntimeconfigs
Group: machineconfiguration.openshift.io
Name:
Resource: machineconfigs
Group:
Name:
Resource: nodes
Group:
Name: openshift-kni-infra
Resource: namespaces
Group:
Name: openshift-openstack-infra
Resource: namespaces
Group:
Name: openshift-ovirt-infra
Resource: namespaces
Group:
Name: openshift-vsphere-infra
Resource: namespaces
Versions:
Name: operator
Version: 4.7.0-0.nightly-2021-06-26-014854
Events: <none>
# oc describe node ip-10-0-59-156.us-east-2.compute.internal
Name: ip-10-0-59-156.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-59-156.us-east-2.compute.internal
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.xlarge
node.openshift.io/os_id=rhel
topology.ebs.csi.aws.com/zone=us-east-2a
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f79156d446d6eaaf"}
k8s.ovn.org/host-addresses: ["10.0.59.156"]
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-156.us-east-2.compute.internal","mac-address":"02:62:91:75:7d:44","ip-address...
k8s.ovn.org/node-chassis-id: afae5859-452c-46c0-805c-50531fff6957
k8s.ovn.org/node-local-nat-ip: {"default":["169.254.1.48"]}
k8s.ovn.org/node-mgmt-port-mac-address: e6:22:87:9d:74:ca
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.156/20"}
k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"}
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-6e0b285ba602438a84b3750e65ea3697
machineconfiguration.openshift.io/desiredConfig: rendered-worker-63fe24505cf686e519c81db70ec61ca0
machineconfiguration.openshift.io/ssh: accessed
machineconfiguration.openshift.io/state: Working
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 30 Jun 2021 00:28:49 -0400
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: ip-10-0-59-156.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 30 Jun 2021 02:32:06 -0400
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 30 Jun 2021 02:28:16 -0400 Wed, 30 Jun 2021 02:32:51 -0400 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.0.59.156
Hostname: ip-10-0-59-156.us-east-2.compute.internal
InternalDNS: ip-10-0-59-156.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 31444972Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16264956Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 3500m
ephemeral-storage: 27905944324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15113980Ki
pods: 250
System Info:
Machine ID: d14ea99a3d754419bdfdfa01ab77bf07
System UUID: EC20084D-9DCB-C128-04CE-35C2042522E3
Boot ID: deb7c00f-4ed0-4973-926e-a7c62936964c
Kernel Version: 3.10.0-1160.31.1.el7.x86_64
OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.20.3-6.rhaos4.7.git0d0f863.el7
Kubelet Version: v1.20.0+87cc9a4
Kube-Proxy Version: v1.20.0+87cc9a4
ProviderID: aws:///us-east-2a/i-0f79156d446d6eaaf
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-6kgh4 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 3h8m
openshift-cluster-node-tuning-operator tuned-txllf 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 3h9m
openshift-dns dns-default-724lv 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 165m
openshift-dns node-resolver-ldj8r 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 167m
openshift-image-registry node-ca-qjn99 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h10m
openshift-ingress-canary ingress-canary-t7vmj 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 3h9m
openshift-machine-config-operator machine-config-daemon-fs2st 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 161m
openshift-monitoring node-exporter-fq6qw 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 3h10m
openshift-multus multus-additional-cni-plugins-gx6th 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 3h4m
openshift-multus multus-jb6cx 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 179m
openshift-multus network-metrics-daemon-kk5vg 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 3h4m
openshift-network-diagnostics network-check-target-7lq5z 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 3h3m
openshift-ovn-kubernetes ovnkube-node-fj8hz 40m (1%) 0 (0%) 640Mi (4%) 0 (0%) 3h2m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 264m (7%) 0 (0%)
memory 1358Mi (9%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotSchedulable 143m kubelet Node ip-10-0-59-156.us-east-2.compute.internal status is now: NodeNotSchedulable
The kubeconfig file to access the cluster can be found in [1] if you'd like.
[1] https://mastern-jenkins-csb-openshift-qe.apps.ocp4.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/27508/artifact/workdir/install-dir/auth/kubeconfig
The cluster inspect you gave (unfortunately the cluster appears to have been torn down) shows that one of the workers never successfully rejoined the cluster. Unfortunately this seems to be yet another different error case. We may need to ssh into the system next time if we see a "NotReady, SchedulingDisabled" since the API is unable to access the node, which may point to kubelet failures. There is another proxy + RHEL7 bug ongoing: https://bugzilla.redhat.com/show_bug.cgi?id=1975907. Maybe we can wait until both other RHEL bugs are settled, before we look into this again? Sorry for the repeated trouble, but given the instability of RHEL7 right now it's hard to gauge what is going on Since there are no logs for the RHEL upgrade job, I'm not sure if this bug is a duplicate of bug 1975907,but based on the must-gather logs, seems like they are not. Based on comments 2, 6, transfer it to the OVN team to have a look. Based on comment 13, I removed the `UpgradeBlocker` Keywords. Hit the same bug, when upgraded from 4.7.13-x86_64 to 4.7.19-x86_64 with profile 50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.7.13 True True 15h Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out
$ oc get no
NAME STATUS ROLES AGE VERSION
kewang06173513-zdhhx-master-0 Ready master 16h v1.20.0+87cc9a4
kewang06173513-zdhhx-master-1 Ready master 16h v1.20.0+87cc9a4
kewang06173513-zdhhx-master-2 Ready master 16h v1.20.0+87cc9a4
kewang06173513-zdhhx-rhel-0 NotReady,SchedulingDisabled worker 15h v1.20.0+87cc9a4
kewang06173513-zdhhx-rhel-1 Ready worker 15h v1.20.0+87cc9a4
kewang06173513-zdhhx-worker-5m4xf Ready worker 16h v1.20.0+df9c838
kewang06173513-zdhhx-worker-hhqgw Ready worker 16h v1.20.0+df9c838
$ oc get co --no-headers | grep -v '.True.*False.*False'
machine-config 4.7.13 False True True 14h
monitoring 4.7.19 False True True 13h
network 4.7.19 True True True 16h
Upgrade Job console output from Jenkins:
...
07-06 14:13:56.229 version 4.7.13 True True 125m Working towards 4.7.19: 7 of 669 done (1% complete)
07-06 14:13:56.229
07-06 14:13:56.229
07-06 14:13:56.229 [Debug] upgrade_ret_1=1,upgrade_ret_2=1
07-06 14:13:56.229
07-06 14:14:54.160 Unable to connect to the server: Service Unavailable
07-06 14:14:54.160
07-06 14:14:54.160
07-06 14:14:54.160
07-06 14:14:54.160 Unable to connect to the server: Service Unavailable
07-06 14:14:54.160 [Debug] upgrade_ret_1=1,upgrade_ret_2=1
...
Checked operator mco which is the first reported the error,
..
07-06 15:13:38.873 Status:
07-06 15:13:38.873 Conditions:
07-06 15:13:38.873 Last Transition Time: 2021-07-06T13:08:14Z
07-06 15:13:38.873 Message: Working towards 4.7.19
07-06 15:13:38.873 Status: True
07-06 15:13:38.873 Type: Progressing
07-06 15:13:38.873 Last Transition Time: 2021-07-06T14:12:48Z
07-06 15:13:38.873 Message: Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
07-06 15:13:38.873 Reason: MachineConfigDaemonFailed
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-a2e8bef20b3410f5d7765df456714e66 True False False 3 3 3 0 19h
worker rendered-worker-2990fdac5c6746d7e6bb90f86292fbad False True False 4 0 0 0 19h
Hit the same bug, when triage upgrade CI
original_build=4.7.13-x86_64 target_build=4.7.19-x86_64
profile_name=50_IPI on vSphere 7.0 with RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption on
[2021-07-09T14:28:57.626Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1
[2021-07-09T14:28:57.626Z]
[2021-07-09T14:29:56.570Z] NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
[2021-07-09T14:29:56.570Z] version 4.7.13 True True 3h2m Unable to apply 4.7.19: the cluster operator monitoring has not yet successfully rolled out
[2021-07-09T14:29:56.570Z]
[2021-07-09T14:29:56.570Z]
[2021-07-09T14:29:57.156Z] [Debug] upgrade_ret_1=1,upgrade_ret_2=1
[2021-07-09T14:29:57.156Z]
[2021-07-09T14:31:05.151Z] **************Post Action after upgrade fail****************
[2021-07-09T14:31:05.151Z]
[2021-07-09T14:31:05.151Z] Post action: #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
[2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-0 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.87 172.31.249.87 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.151Z] rioliu09175612-kgg8w-master-1 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.110 172.31.249.110 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-master-2 Ready master 4h20m v1.20.0+87cc9a4 172.31.249.7 172.31.249.7 Red Hat Enterprise Linux CoreOS 47.83.202106252242-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.3-6.rhaos4.7.git0d0f863.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-0 Ready worker 3h6m v1.20.0+bd7b30d 172.31.249.247 172.31.249.247 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-rhel-1 NotReady,SchedulingDisabled worker 3h6m v1.20.0+bd7b30d 172.31.249.227 172.31.249.227 Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.31.1.el7.x86_64 cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-ptksq Ready worker 4h8m v1.20.0+df9c838 172.31.249.215 172.31.249.215 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8
[2021-07-09T14:31:05.152Z] rioliu09175612-kgg8w-worker-qfkr9 Ready worker 4h8m v1.20.0+df9c838 172.31.249.231 172.31.249.231 Red Hat Enterprise Linux CoreOS 47.83.202105220305-0 (Ootpa) 4.18.0-240.22.1.el8_3.x86_64 cri-o://1.20.2-12.rhaos4.7.git9f7be76.el8
[2021-07-09T14:31:05.152Z]
[2021-07-09T14:31:05.152Z]
[2021-07-09T14:31:05.152Z] Post action: #oc get co:NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
[2021-07-09T14:31:05.152Z] authentication 4.7.19 True False False 66m
[2021-07-09T14:31:05.152Z] baremetal 4.7.19 True False False 4h17m
[2021-07-09T14:31:05.152Z] cloud-credential 4.7.19 True False False 4h20m
[2021-07-09T14:31:05.152Z] cluster-autoscaler 4.7.19 True False False 4h15m
[2021-07-09T14:31:05.152Z] config-operator 4.7.19 True False False 4h17m
[2021-07-09T14:31:05.152Z] console 4.7.19 True False False 73m
[2021-07-09T14:31:05.152Z] csi-snapshot-controller 4.7.19 True False False 65m
[2021-07-09T14:31:05.152Z] dns 4.7.19 True False False 4h15m
[2021-07-09T14:31:05.152Z] etcd 4.7.19 True False False 4h15m
[2021-07-09T14:31:05.152Z] image-registry 4.7.19 True False False 3h20m
[2021-07-09T14:31:05.152Z] ingress 4.7.19 True False False 4h6m
[2021-07-09T14:31:05.152Z] insights 4.7.19 True False False 4h10m
[2021-07-09T14:31:05.152Z] kube-apiserver 4.7.19 True False False 4h13m
[2021-07-09T14:31:05.152Z] kube-controller-manager 4.7.19 True False False 4h14m
[2021-07-09T14:31:05.152Z] kube-scheduler 4.7.19 True False False 4h14m
[2021-07-09T14:31:05.152Z] kube-storage-version-migrator 4.7.19 True False False 68m
[2021-07-09T14:31:05.152Z] machine-api 4.7.19 True False False 4h13m
[2021-07-09T14:31:05.152Z] machine-approver 4.7.19 True False False 4h16m
[2021-07-09T14:31:05.152Z] machine-config 4.7.13 False True True 140m
[2021-07-09T14:31:05.152Z] marketplace 4.7.19 True False False 77m
[2021-07-09T14:31:05.152Z] monitoring 4.7.19 False True True 74m
[2021-07-09T14:31:05.152Z] network 4.7.19 True True True 4h16m
[2021-07-09T14:31:05.152Z] node-tuning 4.7.19 True False False 153m
[2021-07-09T14:31:05.152Z] openshift-apiserver 4.7.19 True False False 66m
[2021-07-09T14:31:05.152Z] openshift-controller-manager 4.7.19 True False False 151m
[2021-07-09T14:31:05.152Z] openshift-samples 4.7.19 True False False 153m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager 4.7.19 True False False 4h16m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager-catalog 4.7.19 True False False 4h16m
[2021-07-09T14:31:05.152Z] operator-lifecycle-manager-packageserver 4.7.19 True False False 68m
[2021-07-09T14:31:05.152Z] service-ca 4.7.19 True False False 4h17m
[2021-07-09T14:31:05.152Z] storage 4.7.19 True False False 73m
[2021-07-09T14:31:05.152Z] print detail msg for node(SchedulingDisabled) if exist:
[2021-07-09T14:31:05.152Z]
[2021-07-09T14:32:13.333Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal node details~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:13.333Z]
[2021-07-09T14:32:13.333Z]
[2021-07-09T14:32:13.333Z] Name: rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] Roles: worker
[2021-07-09T14:32:13.333Z] Labels: beta.kubernetes.io/arch=amd64
[2021-07-09T14:32:13.333Z] beta.kubernetes.io/os=linux
[2021-07-09T14:32:13.333Z] kubernetes.io/arch=amd64
[2021-07-09T14:32:13.333Z] kubernetes.io/hostname=rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] kubernetes.io/os=linux
[2021-07-09T14:32:13.333Z] node-role.kubernetes.io/worker=
[2021-07-09T14:32:13.333Z] node.openshift.io/os_id=rhel
[2021-07-09T14:32:13.333Z] Annotations: k8s.ovn.org/l3-gateway-config:
[2021-07-09T14:32:13.333Z] {"default":{"mode":"shared","interface-id":"br-ex_rioliu09175612-kgg8w-rhel-1","mac-address":"00:50:56:ac:9a:f7","ip-addresses":["172.31.2...
[2021-07-09T14:32:13.333Z] k8s.ovn.org/node-chassis-id: 8bc13dc6-e668-49b1-b12b-9ba16655aac1
[2021-07-09T14:32:13.333Z] k8s.ovn.org/node-local-nat-ip: {"default":["169.254.15.54"]}
[2021-07-09T14:32:13.333Z] k8s.ovn.org/node-mgmt-port-mac-address: 92:30:54:8d:79:42
[2021-07-09T14:32:13.333Z] k8s.ovn.org/node-primary-ifaddr: {"ipv4":"172.31.249.227/23"}
[2021-07-09T14:32:13.333Z] k8s.ovn.org/node-subnets: {"default":"10.129.2.0/23"}
[2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/currentConfig: rendered-worker-a557d8bd4e42e3b87302ed6fef0c7626
[2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/desiredConfig: rendered-worker-04fc66aec510739dc2cb7bc4696f80c0
[2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/ssh: accessed
[2021-07-09T14:32:13.333Z] machineconfiguration.openshift.io/state: Working
[2021-07-09T14:32:13.333Z] volumes.kubernetes.io/controller-managed-attach-detach: true
[2021-07-09T14:32:13.333Z] CreationTimestamp: Fri, 09 Jul 2021 11:24:43 +0000
[2021-07-09T14:32:13.333Z] Taints: node.kubernetes.io/unreachable:NoExecute
[2021-07-09T14:32:13.333Z] node.kubernetes.io/unreachable:NoSchedule
[2021-07-09T14:32:13.333Z] node.kubernetes.io/unschedulable:NoSchedule
[2021-07-09T14:32:13.333Z] Unschedulable: true
[2021-07-09T14:32:13.333Z] Lease:
[2021-07-09T14:32:13.333Z] HolderIdentity: rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] AcquireTime: <unset>
[2021-07-09T14:32:13.333Z] RenewTime: Fri, 09 Jul 2021 13:08:04 +0000
[2021-07-09T14:32:13.333Z] Conditions:
[2021-07-09T14:32:13.333Z] Type Status LastHeartbeatTime LastTransitionTime Reason Message
[2021-07-09T14:32:13.333Z] ---- ------ ----------------- ------------------ ------ -------
[2021-07-09T14:32:13.333Z] MemoryPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z] DiskPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z] PIDPressure Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z] Ready Unknown Fri, 09 Jul 2021 13:08:03 +0000 Fri, 09 Jul 2021 13:08:44 +0000 NodeStatusUnknown Kubelet stopped posting node status.
[2021-07-09T14:32:13.333Z] Addresses:
[2021-07-09T14:32:13.333Z] ExternalIP: 172.31.249.227
[2021-07-09T14:32:13.333Z] InternalIP: 172.31.249.227
[2021-07-09T14:32:13.333Z] Hostname: rioliu09175612-kgg8w-rhel-1
[2021-07-09T14:32:13.333Z] Capacity:
[2021-07-09T14:32:13.333Z] cpu: 8
[2021-07-09T14:32:13.333Z] ephemeral-storage: 19066Mi
[2021-07-09T14:32:13.333Z] hugepages-1Gi: 0
[2021-07-09T14:32:13.333Z] hugepages-2Mi: 0
[2021-07-09T14:32:13.333Z] memory: 16265896Ki
[2021-07-09T14:32:13.333Z] pods: 250
[2021-07-09T14:32:13.333Z] Allocatable:
[2021-07-09T14:32:13.333Z] cpu: 7500m
[2021-07-09T14:32:13.333Z] ephemeral-storage: 16919193161
[2021-07-09T14:32:13.333Z] hugepages-1Gi: 0
[2021-07-09T14:32:13.333Z] hugepages-2Mi: 0
[2021-07-09T14:32:13.333Z] memory: 15114920Ki
[2021-07-09T14:32:13.333Z] pods: 250
[2021-07-09T14:32:13.333Z] System Info:
[2021-07-09T14:32:13.333Z] Machine ID: 68b4570c8c2f40dfa8c34a353d7aa935
[2021-07-09T14:32:13.333Z] System UUID: 97AA2C42-B450-4CA9-CC9E-4675C05C9D09
[2021-07-09T14:32:13.333Z] Boot ID: 46a37500-282c-4f56-a934-841dd0bfc02c
[2021-07-09T14:32:13.333Z] Kernel Version: 3.10.0-1160.31.1.el7.x86_64
[2021-07-09T14:32:13.333Z] OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo)
[2021-07-09T14:32:13.333Z] Operating System: linux
[2021-07-09T14:32:13.333Z] Architecture: amd64
[2021-07-09T14:32:13.333Z] Container Runtime Version: cri-o://1.20.3-7.rhaos4.7.git41925ef.el7
[2021-07-09T14:32:13.333Z] Kubelet Version: v1.20.0+bd7b30d
[2021-07-09T14:32:13.333Z] Kube-Proxy Version: v1.20.0+bd7b30d
[2021-07-09T14:32:13.333Z] ProviderID: vsphere://422caa97-50b4-a94c-cc9e-4675c05c9d09
[2021-07-09T14:32:13.333Z] Non-terminated Pods: (13 in total)
[2021-07-09T14:32:13.333Z] Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
[2021-07-09T14:32:13.333Z] --------- ---- ------------ ---------- --------------- ------------- ---
[2021-07-09T14:32:13.333Z] openshift-cluster-node-tuning-operator tuned-6bhjp 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 154m
[2021-07-09T14:32:13.333Z] openshift-dns dns-default-bwzjf 65m (0%) 0 (0%) 131Mi (0%) 0 (0%) 141m
[2021-07-09T14:32:13.333Z] openshift-image-registry node-ca-lh9br 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 153m
[2021-07-09T14:32:13.333Z] openshift-ingress-canary ingress-canary-tzfts 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 154m
[2021-07-09T14:32:13.334Z] openshift-machine-config-operator machine-config-daemon-gb94s 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 119m
[2021-07-09T14:32:13.334Z] openshift-monitoring node-exporter-hllwn 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 154m
[2021-07-09T14:32:13.334Z] openshift-multus multus-crksd 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 145m
[2021-07-09T14:32:13.334Z] openshift-multus network-metrics-daemon-26prg 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 151m
[2021-07-09T14:32:13.334Z] openshift-network-diagnostics network-check-target-z77nd 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 148m
[2021-07-09T14:32:13.334Z] openshift-ovn-kubernetes ovnkube-node-n6hf5 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 148m
[2021-07-09T14:32:13.334Z] openshift-vsphere-infra coredns-rioliu09175612-kgg8w-rhel-1 200m (2%) 0 (0%) 400Mi (2%) 0 (0%) 84m
[2021-07-09T14:32:13.334Z] openshift-vsphere-infra keepalived-rioliu09175612-kgg8w-rhel-1 200m (2%) 0 (0%) 400Mi (2%) 0 (0%) 84m
[2021-07-09T14:32:13.334Z] openshift-vsphere-infra mdns-publisher-rioliu09175612-kgg8w-rhel-1 100m (1%) 0 (0%) 200Mi (1%) 0 (0%) 84m
[2021-07-09T14:32:13.334Z] Allocated resources:
[2021-07-09T14:32:13.334Z] (Total limits may be over 100 percent, i.e., overcommitted.)
[2021-07-09T14:32:13.334Z] Resource Requests Limits
[2021-07-09T14:32:13.334Z] -------- -------- ------
[2021-07-09T14:32:13.334Z] cpu 714m (9%) 0 (0%)
[2021-07-09T14:32:13.334Z] memory 2426Mi (16%) 0 (0%)
[2021-07-09T14:32:13.334Z] ephemeral-storage 0 (0%) 0 (0%)
[2021-07-09T14:32:13.334Z] hugepages-1Gi 0 (0%) 0 (0%)
[2021-07-09T14:32:13.334Z] hugepages-2Mi 0 (0%) 0 (0%)
[2021-07-09T14:32:13.334Z] Events:
[2021-07-09T14:32:13.334Z] Type Reason Age From Message
[2021-07-09T14:32:13.334Z] ---- ------ ---- ---- -------
[2021-07-09T14:32:13.334Z] Normal NodeNotSchedulable 85m kubelet Node rioliu09175612-kgg8w-rhel-1 status is now: NodeNotSchedulable
[2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:14.331Z]
[2021-07-09T14:32:14.331Z]
[2021-07-09T14:32:14.331Z] print detail msg for co(AVAILABLE != True or PROGRESSING!=False or DEGRADED!=False or version != target_version) if exist:
[2021-07-09T14:32:14.331Z]
[2021-07-09T14:32:14.331Z] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Abnormal co details==~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[2021-07-09T14:32:14.331Z]
[2021-07-09T14:32:14.331Z]
[2021-07-09T14:32:14.975Z] ~~~~~~~~
[2021-07-09T14:32:14.975Z] machine-config
[2021-07-09T14:32:14.975Z] monitoring
[2021-07-09T14:32:14.975Z] network
[2021-07-09T14:32:14.975Z] ~~~~~~~~
[2021-07-09T14:32:14.975Z]
[2021-07-09T14:32:15.931Z] #### Quick diagnosis: The first abnormal cluster operator is often the culprit! ####
[2021-07-09T14:32:15.931Z] => Below status and logs for the different conditions of all abnormal cos are sorted by 'lastTransitionTime':
[2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] ManagementStateDegraded False - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:11:18Z [network] Upgradeable True - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:14:08Z [network] Available True - -
[2021-07-09T14:32:15.931Z] 2021-07-09T10:15:03Z [machine-config] Upgradeable True AsExpected -
[2021-07-09T14:32:15.931Z] 2021-07-09T12:10:10Z [machine-config] -->>Available False<<-- - Cluster -->>not available<<-- for 4.7.19
[2021-07-09T14:32:15.931Z] 2021-07-09T12:12:09Z [machine-config] -->>Progressing True<<-- - Working towards 4.7.19
[2021-07-09T14:32:15.931Z] 2021-07-09T13:08:44Z [network] -->>Progressing True<<-- Deploying DaemonSet "openshift-multus/multus" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] 2021-07-09T13:16:21Z [monitoring] -->>Available False<<-- - -
[2021-07-09T14:32:15.931Z] 2021-07-09T13:16:36Z [machine-config] -->>Degraded True<<-- MachineConfigDaemonFailed Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
[2021-07-09T14:32:15.931Z] 2021-07-09T13:29:43Z [monitoring] -->>Degraded True<<-- UpdatingnodeExporterFailed Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of openshift-monitoring/node-exporter: got 1 unavailable nodes
[2021-07-09T14:32:15.931Z] 2021-07-09T13:35:15Z [network] -->>Degraded True<<-- RolloutHung DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-07-09T13:24:42Z
[2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] -->>Progressing True<<-- RollOutInProgress Rolling out the stack.
[2021-07-09T14:32:15.931Z] 2021-07-09T14:29:45Z [monitoring] Upgradeable True RollOutInProgress Rollout of the monitoring stack is in progress. Please wait until it finishes.
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-multus/network-metrics-daemon" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-07-09T13:25:32Z
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is -->>not available<<-- (awaiting 1 nodes)
[2021-07-09T14:32:15.931Z] DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-07-09T13:24:43Z
[2021-07-09T14:32:15.931Z]
[2021-07-09T14:32:15.931Z] --------------------------
[2021-07-09T14:32:15.931Z]
[2021-07-09T14:32:17.875Z] Name: machine-config
[2021-07-09T14:32:17.875Z] Namespace:
[2021-07-09T14:32:17.875Z] Labels: <none>
[2021-07-09T14:32:17.875Z] Annotations: exclude.release.openshift.io/internal-openshift-hosted: true
[2021-07-09T14:32:17.875Z] include.release.openshift.io/self-managed-high-availability: true
[2021-07-09T14:32:17.875Z] include.release.openshift.io/single-node-developer: true
[2021-07-09T14:32:17.875Z] API Version: config.openshift.io/v1
[2021-07-09T14:32:17.875Z] Kind: ClusterOperator
[2021-07-09T14:32:17.875Z] Metadata:
[2021-07-09T14:32:17.875Z] Creation Timestamp: 2021-07-09T10:08:07Z
[2021-07-09T14:32:17.876Z] Generation: 1
[2021-07-09T14:32:17.876Z] Managed Fields:
[2021-07-09T14:32:17.876Z] API Version: config.openshift.io/v1
[2021-07-09T14:32:17.876Z] Fields Type: FieldsV1
[2021-07-09T14:32:17.876Z] fieldsV1:
[2021-07-09T14:32:17.876Z] f:metadata:
[2021-07-09T14:32:17.876Z] f:annotations:
[2021-07-09T14:32:17.876Z] .:
[2021-07-09T14:32:17.876Z] f:exclude.release.openshift.io/internal-openshift-hosted:
[2021-07-09T14:32:17.876Z] f:include.release.openshift.io/self-managed-high-availability:
[2021-07-09T14:32:17.876Z] f:include.release.openshift.io/single-node-developer:
[2021-07-09T14:32:17.876Z] f:spec:
[2021-07-09T14:32:17.876Z] f:status:
[2021-07-09T14:32:17.876Z] .:
[2021-07-09T14:32:17.876Z] f:relatedObjects:
[2021-07-09T14:32:17.876Z] f:versions:
[2021-07-09T14:32:17.876Z] Manager: cluster-version-operator
[2021-07-09T14:32:17.876Z] Operation: Update
[2021-07-09T14:32:17.876Z] Time: 2021-07-09T10:08:07Z
[2021-07-09T14:32:17.876Z] API Version: config.openshift.io/v1
[2021-07-09T14:32:17.876Z] Fields Type: FieldsV1
[2021-07-09T14:32:17.876Z] fieldsV1:
[2021-07-09T14:32:17.876Z] f:status:
[2021-07-09T14:32:17.876Z] f:conditions:
[2021-07-09T14:32:17.876Z] f:extension:
[2021-07-09T14:32:17.876Z] .:
[2021-07-09T14:32:17.876Z] f:master:
[2021-07-09T14:32:17.876Z] f:worker:
[2021-07-09T14:32:17.876Z] f:relatedObjects:
[2021-07-09T14:32:17.876Z] f:versions:
[2021-07-09T14:32:17.876Z] Manager: machine-config-operator
[2021-07-09T14:32:17.876Z] Operation: Update
[2021-07-09T14:32:17.876Z] Time: 2021-07-09T13:16:36Z
[2021-07-09T14:32:17.876Z] Resource Version: 124620
[2021-07-09T14:32:17.876Z] Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config
[2021-07-09T14:32:17.876Z] UID: 9b848845-2913-4f1b-8348-0672b17c87c9
[2021-07-09T14:32:17.876Z] Spec:
[2021-07-09T14:32:17.876Z] Status:
[2021-07-09T14:32:17.876Z] Conditions:
[2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T12:12:09Z
[2021-07-09T14:32:17.876Z] Message: Working towards 4.7.19
[2021-07-09T14:32:17.876Z] Status: True
[2021-07-09T14:32:17.876Z] Type: Progressing
[2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T13:16:36Z
[2021-07-09T14:32:17.876Z] Message: Unable to apply 4.7.19: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
[2021-07-09T14:32:17.876Z] Reason: MachineConfigDaemonFailed
[2021-07-09T14:32:17.876Z] Status: True
[2021-07-09T14:32:17.876Z] Type: Degraded
[2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T12:10:10Z
[2021-07-09T14:32:17.876Z] Message: Cluster not available for 4.7.19
[2021-07-09T14:32:17.876Z] Status: False
[2021-07-09T14:32:17.876Z] Type: Available
[2021-07-09T14:32:17.876Z] Last Transition Time: 2021-07-09T10:15:03Z
[2021-07-09T14:32:17.876Z] Reason: AsExpected
[2021-07-09T14:32:17.876Z] Status: True
[2021-07-09T14:32:17.876Z] Type: Upgradeable
[2021-07-09T14:32:17.876Z] Extension:
[2021-07-09T14:32:17.876Z] Master: all 3 nodes are at latest configuration rendered-master-e327977b6a5b999cef4cbfbb711a6e90
[2021-07-09T14:32:17.876Z] Worker: 0 (ready 0) out of 4 nodes are updating to latest configuration rendered-worker-04fc66aec510739dc2cb7bc4696f80c0
[2021-07-09T14:32:17.876Z] Related Objects:
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name: openshift-machine-config-operator
[2021-07-09T14:32:17.876Z] Resource: namespaces
[2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: machineconfigpools
[2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: controllerconfigs
[2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: kubeletconfigs
[2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: containerruntimeconfigs
[2021-07-09T14:32:17.876Z] Group: machineconfiguration.openshift.io
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: machineconfigs
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name:
[2021-07-09T14:32:17.876Z] Resource: nodes
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name: openshift-kni-infra
[2021-07-09T14:32:17.876Z] Resource: namespaces
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name: openshift-openstack-infra
[2021-07-09T14:32:17.876Z] Resource: namespaces
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name: openshift-ovirt-infra
[2021-07-09T14:32:17.876Z] Resource: namespaces
[2021-07-09T14:32:17.876Z] Group:
[2021-07-09T14:32:17.876Z] Name: openshift-vsphere-infra
[2021-07-09T14:32:17.876Z] Resource: namespaces
[2021-07-09T14:32:17.876Z] Versions:
[2021-07-09T14:32:17.876Z] Name: operator
[2021-07-09T14:32:17.876Z] Version: 4.7.13
[2021-07-09T14:32:17.876Z] Events: <none>
[2021-07-09T14:32:17.876Z]
Still hit the same bug when upgraded 4.7.13-x86_64--> 4.7.0-0.nightly-2021-08-27-190811
$ oc get co --no-headers | grep -v '.True.*False.*False'
machine-config 4.7.13 False True True 3h18m
monitoring 4.7.0-0.nightly-2021-08-27-190811 False False True 164m
network 4.7.0-0.nightly-2021-08-27-190811 True True True 6h2m
$ oc get no
NAME STATUS ROLES AGE VERSION
kewang30113609-r5nwz-master-0 Ready master 6h6m v1.20.0+4593a24
kewang30113609-r5nwz-master-1 Ready master 6h6m v1.20.0+4593a24
kewang30113609-r5nwz-master-2 Ready master 6h6m v1.20.0+4593a24
kewang30113609-r5nwz-rhel-0 NotReady,SchedulingDisabled worker 4h52m v1.20.0+9689d22
kewang30113609-r5nwz-rhel-1 Ready worker 4h52m v1.20.0+9689d22
kewang30113609-r5nwz-worker-5564x Ready worker 5h54m v1.20.0+df9c838
kewang30113609-r5nwz-worker-s5bqb Ready worker 5h54m v1.20.0+df9c838
...
Status:
Conditions:
Last Transition Time: 2021-08-30T06:34:33Z
Message: Working towards 4.7.0-0.nightly-2021-08-27-190811
Status: True
Type: Progressing
Last Transition Time: 2021-08-30T07:26:36Z
Message: Unable to apply 4.7.0-0.nightly-2021-08-27-190811: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)
Reason: MachineConfigDaemonFailed
Status: True
Type: Degraded
Last Transition Time: 2021-08-30T06:32:33Z
Message: Cluster not available for 4.7.0-0.nightly-2021-08-27-190811
Status: False
Type: Available
Last Transition Time: 2021-08-30T03:49:17Z
Reason: AsExpected
Status: True
Type: Upgradeable
...
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-cb85aa357520053a2b25d779d251aaee True False False 3 3 3 0 6h4m
worker rendered-worker-aac5d9e7df7359f63e356a255aec21ef False True False 4 0 0 0 6h4m
Upgrade with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS, got the passed result.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.8.15 True False 3h20m Cluster version is 4.8.15
$ oc get clusterversion -o json|jq ".items[0].status.history"
[
{
"completionTime": "2021-10-14T05:37:21Z",
"image": "quay.io/openshift-release-dev/ocp-release:4.8.15-x86_64",
"startedTime": "2021-10-14T03:35:00Z",
"state": "Completed",
"verified": false,
"version": "4.8.15"
},
{
"completionTime": "2021-10-14T01:39:25Z",
"image": "quay.io/openshift-release-dev/ocp-release@sha256:e121118406de30f9a92168808979d9363c1a576401c399bf6f528fb47c48b16c",
"startedTime": "2021-10-14T01:04:54Z",
"state": "Completed",
"verified": false,
"version": "4.7.34"
}
]
The last successful upgrade tests:
4.7.0-0.nightly-2021-08-25-185239 -> 4.8.0-0.nightly-2021-08-25-185034 with profile 80_IPI on AWS RHCOS & RHEL7.9 & FIPS on & OVN & Etcd Encryption & http_proxy & STS
astoycos, I think this bug was solved from above resuls.
@kewang Thanks so much for taking the time to reproduce :), If you see this arise again please feel free to reopen this bug. Thanks, Andrew |