Description of problem: Upgrade an upi/barematal cluster from v4.4 to v4.5 failed. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.6 True True 26m Unable to apply 4.5.0-0.nightly-2020-06-01-081609: the cluster operator machine-config has not yet successfully rolled out ... NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.6 True True 118m Unable to apply 4.5.0-0.nightly-2020-06-01-081609: the cluster operator openshift-apiserver is degraded One of node(the 1st one during machineconfig apply when mco update) went into SchedulingDisabled status due to unexpected on-disk state validating against rendered-master-cb46ac55ab43168a2cd273b9db70f4db #oc get node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ugdci01210623-gkgdr-control-plane-0 Ready master,worker 3h17m v1.17.1 10.0.99.122 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 ugdci01210623-gkgdr-control-plane-1 Ready master,worker 3h17m v1.17.1 10.0.99.141 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 ugdci01210623-gkgdr-control-plane-2 Ready,SchedulingDisabled master,worker 3h18m v1.17.1 10.0.99.101 <none> Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) 4.18.0-147.8.1.el8_1.x86_64 cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 #./oc describe co machine-config ... Status: Conditions: Last Transition Time: 2020-06-01T15:05:48Z Message: Cluster not available for 4.5.0-0.nightly-2020-06-01-081609 Status: False Type: Available Last Transition Time: 2020-06-01T14:53:03Z Message: Working towards 4.5.0-0.nightly-2020-06-01-081609 Status: True Type: Progressing Last Transition Time: 2020-06-01T15:05:48Z Message: Unable to apply 4.5.0-0.nightly-2020-06-01-081609: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-cb46ac55ab43168a2cd273b9db70f4db expected 293f78b64d86f9f0491f6baa991e3f0c8fe1b046 has 8af4f709c4ba9c0afff3408ecc99c8fce61dd314, retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-06-01T13:26:10Z Reason: AsExpected Status: True Type: Upgradeable ... The SchedulingDsiabled node info: Name: ugdci01210623-gkgdr-control-plane-2 Roles: master,worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=ugdci01210623-gkgdr-control-plane-2 kubernetes.io/os=linux node-role.kubernetes.io/master= node-role.kubernetes.io/worker= node.openshift.io/os_id=rhcos Annotations: machineconfiguration.openshift.io/currentConfig: rendered-master-cb46ac55ab43168a2cd273b9db70f4db machineconfiguration.openshift.io/desiredConfig: rendered-master-475ccecc469d7f1a6ecc77487a545501 machineconfiguration.openshift.io/reason: unexpected on-disk state validating against rendered-master-cb46ac55ab43168a2cd273b9db70f4db machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 01 Jun 2020 09:17:35 -0400 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ugdci01210623-gkgdr-control-plane-2 AcquireTime: <unset> RenewTime: Mon, 01 Jun 2020 12:35:35 -0400 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Mon, 01 Jun 2020 12:32:21 -0400 Mon, 01 Jun 2020 11:02:04 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 01 Jun 2020 12:32:21 -0400 Mon, 01 Jun 2020 11:02:04 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 01 Jun 2020 12:32:21 -0400 Mon, 01 Jun 2020 11:02:04 -0400 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 01 Jun 2020 12:32:21 -0400 Mon, 01 Jun 2020 11:02:14 -0400 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.99.101 Hostname: ugdci01210623-gkgdr-control-plane-2 Capacity: cpu: 8 ephemeral-storage: 83334124Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16420564Ki pods: 250 Allocatable: cpu: 7500m ephemeral-storage: 75726986728 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15269588Ki pods: 250 System Info: Machine ID: d8ead127ad884ca1974be63103dad0c4 System UUID: d8ead127-ad88-4ca1-974b-e63103dad0c4 Boot ID: 459dfedb-dce7-494e-9248-4d2518813a83 Kernel Version: 4.18.0-147.8.1.el8_1.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 44.81.202005250830-0 (Ootpa) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.17.4-12.dev.rhaos4.4.git2be4d9c.el8 Kubelet Version: v1.17.1 Kube-Proxy Version: v1.17.1 PodCIDR: 10.128.2.0/24 PodCIDRs: 10.128.2.0/24 Non-terminated Pods: (20 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- node-upgrade hello-daemonset-8mnr9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 126m openshift-cluster-node-tuning-operator tuned-rqgg5 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 111m openshift-controller-manager controller-manager-7dxhm 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 110m openshift-dns dns-default-49q4j 65m (0%) 0 (0%) 110Mi (0%) 512Mi (3%) 103m openshift-etcd etcd-ugdci01210623-gkgdr-control-plane-2 430m (5%) 0 (0%) 860Mi (5%) 0 (0%) 118m openshift-image-registry node-ca-7hfpg 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 111m openshift-kube-apiserver kube-apiserver-ugdci01210623-gkgdr-control-plane-2 330m (4%) 0 (0%) 1174Mi (7%) 0 (0%) 111m openshift-kube-controller-manager kube-controller-manager-ugdci01210623-gkgdr-control-plane-2 100m (1%) 0 (0%) 500Mi (3%) 0 (0%) 111m openshift-kube-scheduler openshift-kube-scheduler-ugdci01210623-gkgdr-control-plane-2 20m (0%) 0 (0%) 100Mi (0%) 0 (0%) 112m openshift-machine-config-operator machine-config-daemon-sxkxl 40m (0%) 0 (0%) 100Mi (0%) 0 (0%) 102m openshift-machine-config-operator machine-config-server-k8s4h 20m (0%) 0 (0%) 50Mi (0%) 0 (0%) 99m openshift-monitoring node-exporter-2c7bm 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 112m openshift-multus multus-admission-controller-q725b 20m (0%) 0 (0%) 20Mi (0%) 0 (0%) 108m openshift-multus multus-vtpzp 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 105m openshift-sdn ovs-8wl6f 100m (1%) 0 (0%) 400Mi (2%) 0 (0%) 108m openshift-sdn sdn-5dj9s 100m (1%) 0 (0%) 200Mi (1%) 0 (0%) 108m openshift-sdn sdn-controller-fhxxn 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 108m openshift-service-catalog-apiserver apiserver-rgq5x 0 (0%) 0 (0%) 200Mi (1%) 0 (0%) 179m openshift-service-catalog-controller-manager controller-manager-47k85 100m (1%) 0 (0%) 100Mi (0%) 0 (0%) 179m ui-upgrade hello-daemonset-fb7v4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 128m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1474m (19%) 0 (0%) memory 4384Mi (29%) 512Mi (3%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NodeHasSufficientMemory 3h18m (x8 over 3h18m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 3h18m (x8 over 3h18m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasNoDiskPressure Normal NodeReady 3h13m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeReady Normal Starting 3h9m kubelet, ugdci01210623-gkgdr-control-plane-2 Starting kubelet. Normal NodeAllocatableEnforced 3h9m kubelet, ugdci01210623-gkgdr-control-plane-2 Updated Node Allocatable limit across pods Normal NodeHasSufficientMemory 3h9m (x2 over 3h9m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 3h9m (x2 over 3h9m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 3h9m (x2 over 3h9m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasSufficientPID Warning Rebooted 3h9m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 has been rebooted, boot id: ad866e97-bd4f-42c5-a0d7-8ccd1f7ab847 Normal NodeReady 3h9m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeReady Normal NodeNotSchedulable 99m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeNotSchedulable Normal Starting 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Starting kubelet. Normal NodeHasSufficientMemory 93m (x2 over 93m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 93m (x2 over 93m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 93m (x2 over 93m) kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeHasSufficientPID Warning Rebooted 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 has been rebooted, boot id: 459dfedb-dce7-494e-9248-4d2518813a83 Normal NodeAllocatableEnforced 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Updated Node Allocatable limit across pods Normal NodeNotReady 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeNotReady Normal NodeNotSchedulable 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeNotSchedulable Normal NodeReady 93m kubelet, ugdci01210623-gkgdr-control-plane-2 Node ugdci01210623-gkgdr-control-plane-2 status is now: NodeReady ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Version-Release number of selected component (if applicable): v4.4.6 to v4.5.0-0.nightly-2020-06-01-081609 How reproducible: always Steps to Reproduce: 1. Upgrade an upi/barematal cluster with following command. ./oc adm upgrade --to-image=registry.svc.ci.openshift.org/ocp/release:4.5.0-0.nightly-2020-06-01-081609 --force=true --allow-explicit-upgrade=true 2. 3. Actual results: Upgrade failed. Expected results: Upgrade succeed. Additional info: QE catch the issue in ci test against an all-in-one cluster(3 nodes which are both masters and nodes). And later we can also reproduce it with a normal cluster with 3 master nodes + 2 worker nodes. During the reproduce on 3masters+2nodes cluster, both one master and one node went into SchedulingDisabled status. # ./oc get node NAME STATUS ROLES AGE VERSION jliu-ci2945a-tt5ww-compute-0 Ready,SchedulingDisabled worker 15h v1.17.1 jliu-ci2945a-tt5ww-compute-1 Ready worker 15h v1.17.1 jliu-ci2945a-tt5ww-control-plane-0 Ready,SchedulingDisabled master 15h v1.17.1 jliu-ci2945a-tt5ww-control-plane-1 Ready master 15h v1.17.1 jliu-ci2945a-tt5ww-control-plane-2 Ready master 15h v1.17.1 # ./oc describe mcp|grep Message Message: Message: Message: All nodes are updating to rendered-master-1d6adae106548d9f2190b3985eb3e3d0 Message: Node jliu-ci2945a-tt5ww-control-plane-0 is reporting: "unexpected on-disk state validating against rendered-master-4490f79d9acb502b82ac97eb27f2df40" Message: Message: Message: Message: All nodes are updating to rendered-worker-e22f30af8021f3b72185c8d7ddaf089e Message: Node jliu-ci2945a-tt5ww-compute-0 is reporting: "unexpected on-disk state validating against rendered-worker-1a594bc02fe11be6f9e8b538565c6ea9" Message:
Are there any pending CSRs?
(In reply to Ryan Phillips from comment #2) > Are there any pending CSRs? No. I also hit the same issue here. [root@preserve-jialiu-ansible ~]# oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-06-03T12:09:35Z Generation: 1 Resource Version: 64602 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: a73ea77e-d63a-4653-bc4b-2dd654b9614b Spec: Status: Conditions: Last Transition Time: 2020-06-03T13:36:12Z Message: Cluster not available for 4.5.0-0.nightly-2020-06-03-013823 Status: False Type: Available Last Transition Time: 2020-06-03T13:22:41Z Message: Working towards 4.5.0-0.nightly-2020-06-03-013823 Status: True Type: Progressing Last Transition Time: 2020-06-03T13:36:12Z Message: Unable to apply 4.5.0-0.nightly-2020-06-03-013823: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-8e0e11b397529df2a61694e0a8e66f81 expected 293f78b64d86f9f0491f6baa991e3f0c8fe1b046 has 8af4f709c4ba9c0afff3408ecc99c8fce61dd314, retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-06-03T12:17:16Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.4.6 Events: <none> From QE downstream ci history, another upgrade on AWS get passed.
quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-9734e6d304b88b243df5a1720570bf63ab684b2317755e0ed00de0aa32b05ce8/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-sxkxl/machine-config-daemon/machine-config-daemon/logs/current.log 17 2020-06-01T16:35:54.217898171Z I0601 16:35:54.217585 2387 daemon.go:768] Current config: rendered-master-cb46ac55ab43168a2cd273b9db70f4db 16 2020-06-01T16:35:54.217898171Z I0601 16:35:54.217647 2387 daemon.go:769] Desired config: rendered-master-475ccecc469d7f1a6ecc77487a545501 15 2020-06-01T16:35:54.217898171Z I0601 16:35:54.217653 2387 daemon.go:783] Pending config: rendered-master-cb46ac55ab43168a2cd273b9db70f4db 14 2020-06-01T16:35:54.2267214Z I0601 16:35:54.226653 2387 daemon.go:992] Validating against pending config rendered-master-cb46ac55ab43168a2cd273b9db70f4db 13 2020-06-01T16:35:54.5174904Z E0601 16:35:54.517375 2387 daemon.go:1378] could not stat file: "/etc/crio/crio.conf", error: lstat /etc/crio/crio.conf: no such file or directory 12 2020-06-01T16:35:54.5174904Z E0601 16:35:54.517460 2387 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-master-cb46ac55ab43168a2cd273b9db70f4db /etc/crio/crio.conf is missing
I am unable to reproduce this with aws. Can I please get access to a reproducer environment before the upgrade is run?
This is likely similar to https://bugzilla.redhat.com/show_bug.cgi?id=1842906 the difference would be the payload date (06/01 vs 06/02) mainly, let's keep both open while we investigate both clusters
(In reply to Urvashi Mohnani from comment #5) > I am unable to reproduce this with aws. Can I please get access to a > reproducer environment before the upgrade is run? We did not hit the issue on aws either. We hit it on upi/bare metal during ci test. I will setup a fresh cluster which is the same with above reproduced cluster and don't upgrade it to share with u.
Issue is also reproduced on upi/vsphere: $ oc describe co machine-config Name: machine-config Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-06-08T01:16:27Z Generation: 1 Resource Version: 171901 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: bbca3610-843d-4541-b331-c3d25f238f05 Spec: Status: Conditions: Last Transition Time: 2020-06-08T06:55:14Z Message: Cluster not available for 4.5.0-0.nightly-2020-06-08-031520 Status: False Type: Available Last Transition Time: 2020-06-08T06:38:38Z Message: Working towards 4.5.0-0.nightly-2020-06-08-031520 Status: True Type: Progressing Last Transition Time: 2020-06-08T06:55:14Z Message: Unable to apply 4.5.0-0.nightly-2020-06-08-031520: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-47719088b0ef9d158c3ec4883d58396e expected 3341f6f8cd92f44df9398570c1601bb8349879e3 has e3f4e2596eaf47a0081a4df04607eec9acd88e05, retrying Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-06-08T01:24:40Z Reason: AsExpected Status: True Type: Upgradeable Extension: Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: master Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: worker Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: machine-config-controller Resource: controllerconfigs Versions: Name: operator Version: 4.4.0-0.nightly-2020-06-01-211921 Events: <none>