Bug 1883521 - [ipi on vSphere 6.5/6.7]upgrade from 4.5 to 4.6 is failed at step of updating mco,master mcp is degraded
Summary: [ipi on vSphere 6.5/6.7]upgrade from 4.5 to 4.6 is failed at step of updating...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.7.0
Assignee: Ben Nemec
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1917186
TreeView+ depends on / blocked
 
Reported: 2020-09-29 13:31 UTC by jima
Modified: 2024-03-25 16:36 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-17 23:05:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1900784 0 unspecified CLOSED [DOCS] Static DHCP assignment needed in IPI vSphere installations 2023-12-15 20:08:40 UTC
Red Hat Knowledge Base (Solution) 5595281 0 None None None 2020-12-04 13:20:22 UTC

Internal Links: 1900784

Description jima 2020-09-29 13:31:47 UTC
Description of problem:
install disconnected ocp with 4.5.0-0.nightly-2020-09-27-230429, and upgrade to 4.6.0-0.nightly-2020-09-28-171716, upgrade is stuck at working towards 4.6.0-0.nightly-2020-09-28-171716: 11% complete.

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-09-27-230429   True        True          9h      Working towards 4.6.0-0.nightly-2020-09-28-171716: 11% complete

And found that one master node is in “Ready,SchedulingDisabled” state.
# ./oc get nodes
NAME                          STATUS                     ROLES    AGE   VERSION
jima4529-mpmvv-master-0       Ready                      master   10h   v1.19.0+fff8183
jima4529-mpmvv-master-1       Ready,SchedulingDisabled   master   10h   v1.18.3+970c1b3
jima4529-mpmvv-master-2       Ready                      master   10h   v1.18.3+970c1b3
jima4529-mpmvv-worker-gwpgk   Ready                      worker   10h   v1.19.0+fff8183
jima4529-mpmvv-worker-zt4lq   Ready                      worker   10h   v1.19.0+fff8183

# ./oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-4972ec85a2023290770f5f4ae8c6cdba   False     True       True       3              1                   1                     1                      10h
worker   rendered-worker-a3186aae6c8f5175f7d2e9813803d837   True      False      False      2              2                   2                     0                      10h

From master machine config pool is in degraded state as below message:

    message: 'Node jima4529-mpmvv-master-1 is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \"etcd-quorum-guard-74dc799696-fmzxr\": global timeout reached: 1m30s"'
    reason: 1 nodes are reporting degraded status on sync

please get must-gather log from http://10.66.147.15/jima/must-gather.tar.gz.

Version-Release number of selected component (if applicable):


How reproducible:
not sure, hit the issue on disconnected upi on vsphere

Steps to Reproduce:
1. Install 4.5.0-0.nightly-2020-09-27-230429
2. upgrade to 4.6.0-0.nightly-2020-09-28-171716
3.

Actual results:
Upgrade is failed

Expected results:
upgrade is successful.

Additional info:

Comment 1 Sam Batschelet 2020-10-01 12:25:45 UTC
This appears to be a race which is the result of migration of quorum-gaurd from `machine-config-operator` to `cluster-etcd-operator` in the logs we see.


>namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9t7v5/machine-config-daemon/machine-config-daemon/logs/current.log:9531:2020-09-29T09:33:55.054581019Z E0929 09:33:55.054515    4588 daemon.go:344] error when evicting pod "etcd-quorum-guard-74dc799696-fmzxr" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.


>namespaces/openshift-machine-config-operator/pods/machine-config-daemon-9t7v5/machine-config-daemon/machine-config-daemon/logs/current.log:9533:2020-09-29T09:34:00.061796753Z I0929 09:34:00.061747    4588 update.go:150] Draining failed with: error when evicting pod "etcd-quorum-guard-74dc799696-fmzxr": global timeout reached: 1m30s, retrying

Then also 

```
2Z I0929 01:25:16.559295       1 garbagecollector.go:461] object garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"apps/v1", Kind:"ReplicaSet", Name:"etcd-quorum-guard-74dc799696", UID:"ca993874-1a03-43e1-845e-164e658e85dc", Controller:(*bool)(0xc004228107), BlockOwnerDeletion:(*bool)(0xc004228108)}, Namespace:"openshift-etcd"} has at least one existing owner: []v1.OwnerReference{v1.OwnerReference{APIVersion:"apps/v1", Kind:"Deployment", Name:"etcd-quorum-guard", UID:"3fb009df-d5a3-4cbf-80a4-0b6a6baf010f", Controller:(*bool)(0xc009fc6bd7), BlockOwnerDeletion:(*bool)(0xc009fc6bd8)}}, will not garbage collect
```

```
namespaces/openshift-kube-scheduler/pods/openshift-kube-scheduler-jima4529-mpmvv-master-2/kube-scheduler/kube-scheduler/logs/previous.log:157:2020-09-29T01:24:45.482610509Z I0929 01:24:45.482555       1 factory.go:445] "Unable to schedule pod; no fit; waiting" pod="openshift-etcd/etcd-quorum-guard-74dc799696-2ct7h" err="0/5 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match node selector, 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't match pod anti-affinity rules."
```

My understanding here is that quorum-gaurd can become wedged in this case if one node goes unscheduleable or kubelet is unreachable and that the solution could be removing the remove toleration timeouts[1]. I have begun the backport[2] process if this is true. I am going to ask for clarification of my theory by Cloud Team.

[1] https://github.com/openshift/cluster-etcd-operator/pull/426
[2] https://github.com/openshift/machine-config-operator/pull/2133

Comment 2 Michael Gugino 2020-10-01 13:28:32 UTC
etcd-quorum-guard-74dc799696-fmzxr is on jima4529-mpmvv-master-1.  That is must be the host MCO is trying to currently drain.  etcd-quorum-guard-74dc799696-fmzxr itself is healthy, so it's not the pod blocking drain.

etcd-quorum-guard-74dc799696-fgdvx is on jima4529-mpmvv-master-0.  Presumably, this is the host that was already drained and upgraded.  etcd-quorum-guard-74dc799696-fgdvx is NOT ready.  This will properly prevent etcd-quorum-guard from being drained on other hosts.


namespaces/openshift-etcd/pods/etcd-quorum-guard-74dc799696-fgdvx/  Logs are empty.  Container is listed as running according to pod.yaml.

etcd itself looks like it's not part of the cluster, lots of these messages and it's trying to propose itself as leader.

cat namespaces/openshift-etcd/pods/etcd-jima4529-mpmvv-master-0/etcd/etcd/logs/current.log

2020-09-29T10:11:35.849943686Z 2020-09-29 10:11:35.849874 E | etcdserver: publish error: etcdserver: request timed out
2020-09-29T10:11:36.516667391Z 2020-09-29 10:11:36.516592 W | rafthttp: health check for peer 55cc1518dab92a20 could not connect: EOF
2020-09-29T10:11:36.567173335Z 2020-09-29 10:11:36.567072 W | rafthttp: health check for peer 55cc1518dab92a20 could not connect: EOF
2020-09-29T10:11:36.863466319Z 2020-09-29 10:11:36.863396 W | rafthttp: health check for peer a47d6736324309a8 could not connect: EOF
2020-09-29T10:11:36.886144039Z 2020-09-29 10:11:36.884921 W | rafthttp: health check for peer a47d6736324309a8 could not connect: EOF


So, looks like etcd itself failed on first master after upgrade.  etcd-quorum-guard is not ready, but maybe it would be helpful to have some logs in that pod.  etcd-quorum-guard working as intended and saved disaster.

Comment 3 Sam Batschelet 2020-10-01 14:16:34 UTC
I would like to say that my first glance at this was not to sound :) sorry for the confusion.

>  2020-09-29T09:02:28.035839826Z 2020-09-29 09:02:28.033761 I | embed: rejected connection from "10.0.0.180:58380" (error "tls: \"10.0.0.180\" does not match any of DNSNames [\"jima4529.qe.devclu  ster.openshift.com\" \"localhost\" \"10.0.0.233\"] (lookup jima4529.qe.devcluster.openshift.com on 10.0.0.250:53: no such host)", ServerName "", IPAddresses ["10.0.0.233"], DNSNames ["jima4529.  qe.devcluster.openshift.com" "localhost" "10.0.0.233"])

So what this tells me is when the node upgarded after reboot the node came up with a differnt IP. As the TLS certs for etcd have IP baked in the SAN failure is expected. Nodes must have static IP, closing as not a bug. As Michael very kindly pointed out this quorum-guard did its job.

Comment 4 Sam Batschelet 2020-10-01 14:53:40 UTC
More failure cases that need some explaination.

> 2020-09-29T10:12:10.763396335Z 2020-09-29 10:12:10.763347 W | rafthttp: health check for peer f62fd232d108aa69 could not connect: dial tcp 10.0.0.233:2380: connect: no route to host

I guess its a fair question to ask how host network is broken?

Comment 5 Sam Batschelet 2020-10-01 15:51:08 UTC
I am going to reopen this to review https://bugzilla.redhat.com/show_bug.cgi?id=1883521#c4 in speaking with Joseph Callen he said that there were previous issues around networking segfaulting. Now I am unsure how this would brick hostNetwork but I think its worth looking at.

Comment 7 Tim Rozet 2020-10-01 17:40:10 UTC
This is an ovn-kubernetes deployment not openshift-sdn

Comment 8 Tim Rozet 2020-10-01 18:06:24 UTC
OVN upgrade is currently broken https://bugzilla.redhat.com/show_bug.cgi?id=1880591. Some logs are missing in this must gather like the previous logs for ovs-node pods when the system was running 4.5. I'm going to dupe this bug to that one, and if we see this same failure again after 1880591 is fixed feel free to open it back up.

*** This bug has been marked as a duplicate of bug 1880591 ***

Comment 9 Tim Rozet 2020-10-01 18:14:45 UTC
sorry my bad, I pulled the wrong must gather and thought this was ovn-kubernetes, its openshift-sdn. Moving back.

Comment 10 Tim Rozet 2020-10-01 18:52:13 UTC
Looking at the *right* must gather this time. ovs is incorrectly running in container mode after upgrade. I don't see any system journal logs in the must gather, but it's definitely  likely that system OVS is running at the same time as the container OVS. We were tracking that here:

https://bugzilla.redhat.com/show_bug.cgi?id=1874696

A consequence of running the two at the same time can result in a segfault here:
https://bugzilla.redhat.com/show_bug.cgi?id=1875534

There's another case where we may run OVS in a container instead of using the host OVS. Tracking that here:
https://bugzilla.redhat.com/show_bug.cgi?id=1884101

I'm not sure how that would have any impact on host networking. openshift-sdn doesn't move the NIC to OVS. Regardless it would be good to see if you can reproduce after 1884101 is fixed.

Comment 11 Tim Rozet 2020-10-02 01:44:26 UTC
Jinyun could you please try to reproduce with https://bugzilla.redhat.com/show_bug.cgi?id=1884101 ?

Comment 12 Tim Rozet 2020-10-02 01:45:37 UTC
(In reply to Tim Rozet from comment #11)
> Jinyun could you please try to reproduce with
> https://bugzilla.redhat.com/show_bug.cgi?id=1884101 ?

With this fix I mean for that bug:
https://github.com/openshift/cluster-network-operator/pull/816

Comment 13 Dan Williams 2020-10-02 04:48:06 UTC
Very likely a duplicate of Bug 1884101

*** This bug has been marked as a duplicate of bug 1884101 ***

Comment 16 Ross Brattain 2020-10-06 11:52:44 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1885365

Comment 17 Tim Rozet 2020-10-08 02:20:42 UTC
Ross now that the fix for upgrade (tracked by 1880591) is merged, can you please see if this bug is reproducible?

Comment 18 Tim Rozet 2020-10-08 13:58:53 UTC
This looks like a problem with the DHCP server setup. As Ross noticed, the IP on the master changes and in his deployment I can see:
master-0 changed from 10.0.0.164 to 10.0.0.228  violating quorum guard


Then in journal:

Oct 08 10:36:27 localhost dhclient[856]: bound to 10.0.0.228 -- renewal in 122 seconds.
9:47
Oct 08 10:39:25 rbrattai081840-zgsjw-master-0 dhclient[1753]: bound to 10.0.0.164 -- renewal in 117 seconds
9:47

The IP changes on the host and it's pure chance which one kubelet and other processes then start and use:

Oct 08 12:14:46 rbrattai081840-zgsjw-master-0 hyperkube[1889]: I1008 12:14:46.243139    1889 flags.go:59] FLAG: --address="10.0.0.228"

Comment 23 Tim Rozet 2020-10-13 16:12:42 UTC
It looks like the errors listed in comment 21 are due to an incorrect ignition file:
Oct 13 01:08:54 localhost ignition[1122]: INFO     : files: createFilesystemsFiles: createFiles: op(a): [started]  writing file "/etc/dhcp/dhclient.conf"

[root@jimatest17-m6fn7-master-0 ~]# cat /etc/dhcp/dhclient.conf 
supersede domain-search "jimatest17.qe.devcluster.openshift.com";
prepend domain-name-servers ;

^ there are no IPs listed

which results in:
Oct 13 15:30:39 jimatest17-m6fn7-master-0 dhclient[422208]: /etc/dhcp/dhclient.conf line 2: semicolon expected.

For your dhcp server, I think you should also use static entries for your nodes. I don't see how this is a networking bug.

Comment 24 jima 2020-10-14 05:32:37 UTC
I test again according to comments 22 
1. install ocp 4.5.14
2. reserve control node ip(add ips into /etc/dhcp/dhcpd.conf on dhcp server and restart dhcpd)
3. upgrade to 4.6.0-rc.4

upgrade is successful.
# ./oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-rc.4   True        False         26m     Cluster version is 4.6.0-rc.4

so will add comments into 4.6 release note to ask customers to set their control plane nodes to reservations after they have been provisioned

I reopen the bug, since it is not release block now, remove keywords Upgradeblocker and the target release. 

@Joseph, the default-lease-time on dhcp server is 30min, I noticed that ip change only happened twice:
1. install 4.5, when booting up node, the dhcp ip is different than one used to getting ignition file
2. upgrade to 4.6, when updating rhcos os and rebooting the node

ip is not changed if just only rebooting node.

From dhcp log, reboot node during above two scenarios only took less than 5min, while default-least-time is 30min, I think it should be enough, right?
So it's really strange since upgrade is successful three weeks ago with same env, same dhcp configuration.
Issue only happened recently. maybe some change involve the issue?

Comment 25 Stephen Cuppett 2020-10-14 11:44:04 UTC
Setting target release to the active development branch (4.7.0). For any fixes, where required and requested, cloned BZs will be created for those release maintenance streams where appropriate once they are identified.

Comment 26 Joseph Callen 2020-10-14 13:36:55 UTC
@Jima, I don't have access to the QE environment. Do you know if there was more activity on it at the time of the original upgrade?
Maybe more DHCP requests than usual?

I asked the kni-networking folks about the prepend because that was a thing in 4.5:
https://coreos.slack.com/archives/CG6252WBY/p1602607901407100

The guess was that this was not being rendered correctly at some point:     

prepend domain-name-servers {{ .Infra.Status.PlatformStatus.VSphere.NodeDNSIP }};

https://github.com/openshift/machine-config-operator/blob/release-4.5/templates/master/00-master/vsphere/files/dhcp-dhclient-conf.yaml

Comment 27 Juan Luis de Sousa-Valadas 2020-10-14 14:29:19 UTC
Joseph, I see that all the commits in the dhclient.conf are yours:
https://github.com/openshift/machine-config-operator/commits/release-4.5/templates/master/00-master/vsphere/files/dhcp-dhclient-conf.yaml

I'm not familiar with where the .Infra.Status.PlatformStatus.VSphere.NodeDNSIP is defined, can I assign this bug to you?

Comment 29 Dan Williams 2020-10-14 14:59:37 UTC
Over to MCO then, since that's where the erroneous file lives.

Comment 32 Antonio Murdaca 2020-10-22 10:16:21 UTC
That file got dropped on the way to 4.6 so, 4.5 has it and it's configured, 4.6 doesn't https://github.com/openshift/machine-config-operator/pull/1840/files

On the upgrade, the MCO will remove that file (dhcp thing) so whoever was/is in charge of that has to create some sort of support instead. I'm not sure what's happening here network wise but the rendering issue doesn't seem to be related, we're in 4.6 once the new render happens and it won't render that file as it's not in 4.6 so whatever happens there, something else is doing it.

I think it's also worth noticing https://bugzilla.redhat.com/show_bug.cgi?id=1883521#c19 where it says that this was working some time ago. Moving to Ben (and adjusting priority).

Comment 34 Ben Nemec 2020-10-22 20:49:49 UTC
If I understand correctly, this is solved by creating a static DHCP assignment in the DHCP server, which is something deployers should be doing anyway, right? Is there anything to fix here?

As far as why the address changes without the static assignment, my guess would be that it has something to do with the removal of dhclient from the image in 4.6. That means any leases from 4.5 will be lost on upgrade, so the system will have to request a new one. I'm not sure why the DHCP server wouldn't just give it back the active lease, but the fact that forcing it to do so makes me think this isn't an issue with the client side (for example, if the MAC address where the DHCP request is being sent from changed that could cause the same issue, but it wouldn't be fixed by a static assignment on the server side).


Note You need to log in before you can comment on or make changes to this bug.