Description of problem: install disconnected ocp with 4.5.0-0.nightly-2020-09-27-230429, and upgrade to 4.6.0-0.nightly-2020-09-28-171716, upgrade is stuck at working towards 4.6.0-0.nightly-2020-09-28-171716: 11% complete. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-09-27-230429 True True 9h Working towards 4.6.0-0.nightly-2020-09-28-171716: 11% complete And found that one master node is in “Ready,SchedulingDisabled” state. # ./oc get nodes NAME STATUS ROLES AGE VERSION jima4529-mpmvv-master-0 Ready master 10h v1.19.0+fff8183 jima4529-mpmvv-master-1 Ready,SchedulingDisabled master 10h v1.18.3+970c1b3 jima4529-mpmvv-master-2 Ready master 10h v1.18.3+970c1b3 jima4529-mpmvv-worker-gwpgk Ready worker 10h v1.19.0+fff8183 jima4529-mpmvv-worker-zt4lq Ready worker 10h v1.19.0+fff8183 # ./oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-4972ec85a2023290770f5f4ae8c6cdba False True True 3 1 1 1 10h worker rendered-worker-a3186aae6c8f5175f7d2e9813803d837 True False False 2 2 2 0 10h From master machine config pool is in degraded state as below message: message: 'Node jima4529-mpmvv-master-1 is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \"etcd-quorum-guard-74dc799696-fmzxr\": global timeout reached: 1m30s"' reason: 1 nodes are reporting degraded status on sync please get must-gather log from http://10.66.147.15/jima/must-gather.tar.gz. Version-Release number of selected component (if applicable): How reproducible: not sure, hit the issue on disconnected upi on vsphere Steps to Reproduce: 1. Install 4.5.0-0.nightly-2020-09-27-230429 2. upgrade to 4.6.0-0.nightly-2020-09-28-171716 3. Actual results: Upgrade is failed Expected results: upgrade is successful. Additional info:
This appears to be a race which is the result of migration of quorum-gaurd from `machine-config-operator` to `cluster-etcd-operator` in the logs we see. >namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9t7v5/machine-config-daemon/machine-config-daemon/logs/current.log:9531:2020-09-29T09:33:55.054581019Z E0929 09:33:55.054515 4588 daemon.go:344] error when evicting pod "etcd-quorum-guard-74dc799696-fmzxr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. >namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9t7v5/machine-config-daemon/machine-config-daemon/logs/current.log:9533:2020-09-29T09:34:00.061796753Z I0929 09:34:00.061747 4588 update.go:150] Draining failed with: error when evicting pod "etcd-quorum-guard-74dc799696-fmzxr": global timeout reached: 1m30s, retrying Then also ``` 2Z I0929 01:25:16.559295 1 garbagecollector.go:461] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"apps/v1", Kind:"ReplicaSet", Name:"etcd-quorum-guard-74dc799696", UID:"ca993874-1a03-43e1-845e-164e658e85dc", Controller:(*bool)(0xc004228107), BlockOwnerDeletion:(*bool)(0xc004228108)}, Namespace:"openshift-etcd"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"apps/v1", Kind:"Deployment", Name:"etcd-quorum-guard", UID:"3fb009df-d5a3-4cbf-80a4-0b6a6baf010f", Controller:(*bool)(0xc009fc6bd7), BlockOwnerDeletion:(*bool)(0xc009fc6bd8)}}, will not garbage collect ``` ``` namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-jima4529-mpmvv-master-2/kube-scheduler/kube-scheduler/logs/previous.log:157:2020-09-29T01:24:45.482610509Z I0929 01:24:45.482555 1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="openshift-etcd/etcd-quorum-guard-74dc799696-2ct7h" err="0/5 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules." ``` My understanding here is that quorum-gaurd can become wedged in this case if one node goes unscheduleable or kubelet is unreachable and that the solution could be removing the remove toleration timeouts[1]. I have begun the backport[2] process if this is true. I am going to ask for clarification of my theory by Cloud Team. [1] https://github.com/openshift/cluster-etcd-operator/pull/426 [2] https://github.com/openshift/machine-config-operator/pull/2133
etcd-quorum-guard-74dc799696-fmzxr is on jima4529-mpmvv-master-1. That is must be the host MCO is trying to currently drain. etcd-quorum-guard-74dc799696-fmzxr itself is healthy, so it's not the pod blocking drain. etcd-quorum-guard-74dc799696-fgdvx is on jima4529-mpmvv-master-0. Presumably, this is the host that was already drained and upgraded. etcd-quorum-guard-74dc799696-fgdvx is NOT ready. This will properly prevent etcd-quorum-guard from being drained on other hosts. namespaces/openshift-etcd/pods/etcd-quorum-guard-74dc799696-fgdvx/ Logs are empty. Container is listed as running according to pod.yaml. etcd itself looks like it's not part of the cluster, lots of these messages and it's trying to propose itself as leader. cat namespaces/openshift-etcd/pods/etcd-jima4529-mpmvv-master-0/etcd/etcd/logs/current.log 2020-09-29T10:11:35.849943686Z 2020-09-29 10:11:35.849874 E | etcdserver: publish error: etcdserver: request timed out 2020-09-29T10:11:36.516667391Z 2020-09-29 10:11:36.516592 W | rafthttp: health check for peer 55cc1518dab92a20 could not connect: EOF 2020-09-29T10:11:36.567173335Z 2020-09-29 10:11:36.567072 W | rafthttp: health check for peer 55cc1518dab92a20 could not connect: EOF 2020-09-29T10:11:36.863466319Z 2020-09-29 10:11:36.863396 W | rafthttp: health check for peer a47d6736324309a8 could not connect: EOF 2020-09-29T10:11:36.886144039Z 2020-09-29 10:11:36.884921 W | rafthttp: health check for peer a47d6736324309a8 could not connect: EOF So, looks like etcd itself failed on first master after upgrade. etcd-quorum-guard is not ready, but maybe it would be helpful to have some logs in that pod. etcd-quorum-guard working as intended and saved disaster.
I would like to say that my first glance at this was not to sound :) sorry for the confusion. > 2020-09-29T09:02:28.035839826Z 2020-09-29 09:02:28.033761 I | embed: rejected connection from "10.0.0.180:58380" (error "tls: \"10.0.0.180\" does not match any of DNSNames [\"jima4529.qe.devclu ster.openshift.com\" \"localhost\" \"10.0.0.233\"] (lookup jima4529.qe.devcluster.openshift.com on 10.0.0.250:53: no such host)", ServerName "", IPAddresses ["10.0.0.233"], DNSNames ["jima4529. qe.devcluster.openshift.com" "localhost" "10.0.0.233"]) So what this tells me is when the node upgarded after reboot the node came up with a differnt IP. As the TLS certs for etcd have IP baked in the SAN failure is expected. Nodes must have static IP, closing as not a bug. As Michael very kindly pointed out this quorum-guard did its job.
More failure cases that need some explaination. > 2020-09-29T10:12:10.763396335Z 2020-09-29 10:12:10.763347 W | rafthttp: health check for peer f62fd232d108aa69 could not connect: dial tcp 10.0.0.233:2380: connect: no route to host I guess its a fair question to ask how host network is broken?
I am going to reopen this to review https://bugzilla.redhat.com/show_bug.cgi?id=1883521#c4 in speaking with Joseph Callen he said that there were previous issues around networking segfaulting. Now I am unsure how this would brick hostNetwork but I think its worth looking at.
This is an ovn-kubernetes deployment not openshift-sdn
OVN upgrade is currently broken https://bugzilla.redhat.com/show_bug.cgi?id=1880591. Some logs are missing in this must gather like the previous logs for ovs-node pods when the system was running 4.5. I'm going to dupe this bug to that one, and if we see this same failure again after 1880591 is fixed feel free to open it back up. *** This bug has been marked as a duplicate of bug 1880591 ***
sorry my bad, I pulled the wrong must gather and thought this was ovn-kubernetes, its openshift-sdn. Moving back.
Looking at the *right* must gather this time. ovs is incorrectly running in container mode after upgrade. I don't see any system journal logs in the must gather, but it's definitely likely that system OVS is running at the same time as the container OVS. We were tracking that here: https://bugzilla.redhat.com/show_bug.cgi?id=1874696 A consequence of running the two at the same time can result in a segfault here: https://bugzilla.redhat.com/show_bug.cgi?id=1875534 There's another case where we may run OVS in a container instead of using the host OVS. Tracking that here: https://bugzilla.redhat.com/show_bug.cgi?id=1884101 I'm not sure how that would have any impact on host networking. openshift-sdn doesn't move the NIC to OVS. Regardless it would be good to see if you can reproduce after 1884101 is fixed.
Jinyun could you please try to reproduce with https://bugzilla.redhat.com/show_bug.cgi?id=1884101 ?
(In reply to Tim Rozet from comment #11) > Jinyun could you please try to reproduce with > https://bugzilla.redhat.com/show_bug.cgi?id=1884101 ? With this fix I mean for that bug: https://github.com/openshift/cluster-network-operator/pull/816
Very likely a duplicate of Bug 1884101 *** This bug has been marked as a duplicate of bug 1884101 ***
https://bugzilla.redhat.com/show_bug.cgi?id=1885365
Ross now that the fix for upgrade (tracked by 1880591) is merged, can you please see if this bug is reproducible?
This looks like a problem with the DHCP server setup. As Ross noticed, the IP on the master changes and in his deployment I can see: master-0 changed from 10.0.0.164 to 10.0.0.228 violating quorum guard Then in journal: Oct 08 10:36:27 localhost dhclient[856]: bound to 10.0.0.228 -- renewal in 122 seconds. 9:47 Oct 08 10:39:25 rbrattai081840-zgsjw-master-0 dhclient[1753]: bound to 10.0.0.164 -- renewal in 117 seconds 9:47 The IP changes on the host and it's pure chance which one kubelet and other processes then start and use: Oct 08 12:14:46 rbrattai081840-zgsjw-master-0 hyperkube[1889]: I1008 12:14:46.243139 1889 flags.go:59] FLAG: --address="10.0.0.228"
It looks like the errors listed in comment 21 are due to an incorrect ignition file: Oct 13 01:08:54 localhost ignition[1122]: INFO : files: createFilesystemsFiles: createFiles: op(a): [started] writing file "/etc/dhcp/dhclient.conf" [root@jimatest17-m6fn7-master-0 ~]# cat /etc/dhcp/dhclient.conf supersede domain-search "jimatest17.qe.devcluster.openshift.com"; prepend domain-name-servers ; ^ there are no IPs listed which results in: Oct 13 15:30:39 jimatest17-m6fn7-master-0 dhclient[422208]: /etc/dhcp/dhclient.conf line 2: semicolon expected. For your dhcp server, I think you should also use static entries for your nodes. I don't see how this is a networking bug.
I test again according to comments 22 1. install ocp 4.5.14 2. reserve control node ip(add ips into /etc/dhcp/dhcpd.conf on dhcp server and restart dhcpd) 3. upgrade to 4.6.0-rc.4 upgrade is successful. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-rc.4 True False 26m Cluster version is 4.6.0-rc.4 so will add comments into 4.6 release note to ask customers to set their control plane nodes to reservations after they have been provisioned I reopen the bug, since it is not release block now, remove keywords Upgradeblocker and the target release. @Joseph, the default-lease-time on dhcp server is 30min, I noticed that ip change only happened twice: 1. install 4.5, when booting up node, the dhcp ip is different than one used to getting ignition file 2. upgrade to 4.6, when updating rhcos os and rebooting the node ip is not changed if just only rebooting node. From dhcp log, reboot node during above two scenarios only took less than 5min, while default-least-time is 30min, I think it should be enough, right? So it's really strange since upgrade is successful three weeks ago with same env, same dhcp configuration. Issue only happened recently. maybe some change involve the issue?
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.
@Jima, I don't have access to the QE environment. Do you know if there was more activity on it at the time of the original upgrade? Maybe more DHCP requests than usual? I asked the kni-networking folks about the prepend because that was a thing in 4.5: https://coreos.slack.com/archives/CG6252WBY/p1602607901407100 The guess was that this was not being rendered correctly at some point: prepend domain-name-servers {{ .Infra.Status.PlatformStatus.VSphere.NodeDNSIP }}; https://github.com/openshift/machine-config-operator/blob/release-4.5/templates/master/00-master/vsphere/files/dhcp-dhclient-conf.yaml
Joseph, I see that all the commits in the dhclient.conf are yours: https://github.com/openshift/machine-config-operator/commits/release-4.5/templates/master/00-master/vsphere/files/dhcp-dhclient-conf.yaml I'm not familiar with where the .Infra.Status.PlatformStatus.VSphere.NodeDNSIP is defined, can I assign this bug to you?
Over to MCO then, since that's where the erroneous file lives.
That file got dropped on the way to 4.6 so, 4.5 has it and it's configured, 4.6 doesn't https://github.com/openshift/machine-config-operator/pull/1840/files On the upgrade, the MCO will remove that file (dhcp thing) so whoever was/is in charge of that has to create some sort of support instead. I'm not sure what's happening here network wise but the rendering issue doesn't seem to be related, we're in 4.6 once the new render happens and it won't render that file as it's not in 4.6 so whatever happens there, something else is doing it. I think it's also worth noticing https://bugzilla.redhat.com/show_bug.cgi?id=1883521#c19 where it says that this was working some time ago. Moving to Ben (and adjusting priority).
If I understand correctly, this is solved by creating a static DHCP assignment in the DHCP server, which is something deployers should be doing anyway, right? Is there anything to fix here? As far as why the address changes without the static assignment, my guess would be that it has something to do with the removal of dhclient from the image in 4.6. That means any leases from 4.5 will be lost on upgrade, so the system will have to request a new one. I'm not sure why the DHCP server wouldn't just give it back the active lease, but the fact that forcing it to do so makes me think this isn't an issue with the client side (for example, if the MAC address where the DHCP request is being sent from changed that could cause the same issue, but it wouldn't be fixed by a static assignment on the server side).