Bug 1895024
Summary: | 4.6.1 Install Fails with ovs Pods in CrashLoopBackOff | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Chris Keller <ckeller> |
Component: | Networking | Assignee: | Juan Luis de Sousa-Valadas <jdesousa> |
Networking sub component: | openshift-sdn | QA Contact: | Ross Brattain <rbrattai> |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aaustin, aconstan, anbhat, bugzilla, fshaikh, jdesousa, jkho, jocolema, maupadhy, mjahangi, mwasher, openshift-bugs-escalate, pchavva, rbost, rheinzma, rjamadar, simore, sjayapra, skanakal, sthaha, xialhuan, ychoukse, zzhao |
Version: | 4.6.z | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-01-20 13:50:56 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chris Keller
2020-11-05 15:25:51 UTC
I am also running into this issue upgrading a UPI baremetal environment from 4.5.18 to 4.6.4 in the fast channel. When a node reboots from RHCOS 4.5 to 4.6 as part of the upgrade, the ovs pod on the node enters a CrashLoopBackOff state because it is trying to launch OVS inside the container like Chris's environment. I noticed that the ovs-connfiguration systemd unit was not starting to create the /var/run/ovs-config-executed file because ovs-vswitchd was not starting. ovs-vswitchd was not starting because the ovs- pod was already trying to run openvswitch in the container. I tried Chris' workaround of deleting the PID file, but that results in conflicting copies of ovsdb-server running on the node. I was able to workaround the issue in a more sane manner with the procedure below. This must be done after every node reboot. 1. SSH into node 2. touch /var/run/ovs-config-executed 3. Delete ovs-* pod on the node, verify that it restarts and says something like "openvswitch is running in systmed" 4. systemctl start ovs-configuration (this starts all dependent ovs services as well) Linking customer case 02810779 that saw this issue. The customer in 02810779 continues to have this problem after a reboot of a node. Below is the state of the services after a reboot. If we manually restart openvswitch.service and ovs-configuration.service, all is well until the next reboot. * ovs-configuration.service - Configures OVS with proper host networking configuration Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled) Active: inactive (dead) Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Configures OVS with proper host networking configuration. Dec 03 06:39:57 localhost systemd[1]: ovs-configuration.service: Job ovs-configuration.service/start failed with result 'dependency'. * openvswitch.service - Open vSwitch Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; enabled; vendor preset: disabled) Active: inactive (dead) Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Open vSwitch. Dec 03 06:39:57 localhost systemd[1]: openvswitch.service: Job openvswitch.service/start failed with result 'dependency'. * ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Drop-In: /etc/systemd/system/ovs-vswitchd.service.d `-10-ovs-vswitchd-restart.conf Active: inactive (dead) Dec 03 06:39:57 localhost systemd[1]: Dependency failed for Open vSwitch Forwarding Unit. Dec 03 06:39:57 localhost systemd[1]: ovs-vswitchd.service: Job ovs-vswitchd.service/start failed with result 'dependency'. * ovsdb-server.service - Open vSwitch Database Unit Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/ovsdb-server.service.d `-10-ovsdb-restart.conf Active: active (running) since Thu 2020-12-03 06:40:30 UTC; 18h ago (-> signifies "Requires" in service unit below) - ovs-configuration.service -> openvswitch.service - openvswitch.service -> ovsdb-server.service - openvswitch.service -> ovs-vswitchd.service - ovs-vswitchd.service -> ovsdb-server.service - ovsdb-server.service is the only one without a Requires directive Note that ovsdb-server is the only service in the group that is running. BUT, note the timestamp. It became active AFTER services that are dependent on it failed already. I would have expected the Restart directive on ovs-vswitchd.service to kick-in but I don't fully understand it. Unfortunately, we don't have logs from Dec 03 06:00 to review (journal was truncated) but we've another customer in case 02816167 we're asking about gathering information. I see two customers (both linked to bug) that are running in vSphere seeing this problem where openvswitch and ovs-configuration services require manual restart after a reboot. This is the only workaround at the moment. Is there any way we can make these OVS-related services more resilient to a timeout? Log Analysis: In the sosreports for both nodes that I asked customers to reboot, I see the following behavior: Dec 07 08:50:16 localhost systemd[1]: Starting Open vSwitch Database Unit... ... ... Dec 07 08:51:46 localhost systemd[1]: sssd.service: Start operation timed out. Terminating. Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Start operation timed out. Terminating. Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Failed with result 'timeout'. Dec 07 08:51:46 localhost systemd[1]: Failed to start Open vSwitch Database Unit. Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Open vSwitch. Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Configures OVS with proper host networking configuration. Dec 07 08:51:46 localhost systemd[1]: ovs-configuration.service: Job ovs-configuration.service/start failed with result 'dependency'. Dec 07 08:51:46 localhost systemd[1]: openvswitch.service: Job openvswitch.service/start failed with result 'dependency'. Dec 07 08:51:46 localhost systemd[1]: Dependency failed for Open vSwitch Forwarding Unit. Dec 07 08:51:46 localhost systemd[1]: ovs-vswitchd.service: Job ovs-vswitchd.service/start failed with result 'dependency'. Dec 07 08:51:46 localhost systemd[1]: ovsdb-server.service: Consumed 82ms CPU time Dec 07 08:51:46 localhost systemd[1]: ovs-delete-transient-ports.service: Starting requested but asserts failed. Dec 07 08:51:46 localhost systemd[1]: Assertion failed for Open vSwitch Delete Transient Ports. After these services timeout initially, only ovsdb-server recovers: Dec 07 08:51:47 localhost systemd[1]: ovsdb-server.service: Service RestartSec=100ms expired, scheduling restart. Dec 07 08:51:47 localhost systemd[1]: ovsdb-server.service: Scheduled restart job, restart counter is at 1. Dec 07 08:52:20 localhost ovs-ctl[1405]: Starting ovsdb-server. Dec 07 08:52:20 localhost systemd[1]: Started Open vSwitch Database Unit. Both customers also have low entropy (under 1,000) reported in the sosreports and I can see crng init doesn't complete until >1 minute after boot. It looks like this is a contributor to the slow boot and eventual timeout of OVS-related services. However, as mentioned in my last comment, I'd still expect the services to be more resilient to this. Hi Dev team I flag the bug "customer escalation" as the customer has raised their concern about the progress. It would be appreciated if higher the priority in the situation. Thank you. Eva Huang, Escalation Manager Hi Dev team Are you working through the Christmas and New Year Period? Please advise. Thank you. Best regards Eva Huang Escalation Manager Hi I have got a customer(02816539) who has got the same issue. Regards selim I've seen a few people commenting on this bug, and they are touching the /var/run/ovs-config-executed. I don't think anyone should be doing this as it would indicate ovs-configuration.service ran successfully, which is clearly hasn't if you had to touch the file! @Juan Luis, can you confirm that users should *not* touch v/ar/run/ovs-config-executed? I created https://access.redhat.com/solutions/5635011 last week and its linked to this BZ. Also, note https://bugzilla.redhat.com/show_bug.cgi?id=1895024#c10 where I mention a workaround that helped 2 of the 3 linked customers at the time. Looks like the current timeouts are openvswitch.service TimeoutStartUSec=infinity ovs-vswitchd.service TimeoutStartUSec=5min ovsdb-server.service TimeoutStartUSec=1min 30s ovs-configuration.service TimeoutStartUSec=infinity We might be able to just bump up the start timeout for now. cat > /etc/systemd/system/ovsdb-server.service.d/20-ovsdb-timeout.conf [Service] TimeoutStartSec=20min Robert,
> I've seen a few people commenting on this bug, and they are touching the /var/run/ovs-config-executed. I don't think anyone should be doing this as it would indicate ovs-configuration.service ran successfully, which is clearly hasn't if you had to touch the file!
If you see the script, all it does is cleaning a bunch of stuff, if the stuff isn't there it's not necessary to clean it up first. So it's not necessarily wrong to touch it, if you are aware of what you're not running and understand the consecuences of not running it, but as a rule of a thumb it's not a good idea and the good workaround would be runnning the script or starting the ovs-configuration service manually.
Ross,
About increasing the timeout to 20 minutes or something huge, I'm worried because I don't know what unintended side effects can we face.
Chris,
I believe you can reproduce this easily, can you reproduce this every time or just during the first boot during the installation? If you can reproduce this, is there any change you give me access to a cluster with this issue? I need cluster-admin.
Juan, Yes I can reproduce this every time I do an installation. I will ping you on chat. Hello, The last reported version to fail is 4.6.9 and we're unable to reproduce this with 4.6.11. All the affected people, can you please try to reproduce with 4.6.11? If you cannot I'll move this to OVS or RHCOS people so that they can figure out what in the changelog has fixed it. Thanks, Juan I ran a quick test with my reproducer: With 4.6.9 I am able to reproduce the issue. Core reason is slow entropy on the VM's (no rdrand extension) ~~~ [stack@osp16amd ocp-test4]$ oc get node NAME STATUS ROLES AGE VERSION ocp-gm466-master-0 NotReady master 7m18s v1.19.0+7070803 ocp-gm466-master-1 NotReady master 7m16s v1.19.0+7070803 ocp-gm466-master-2 NotReady master 7m20s v1.19.0+7070803 [core@ocp-gm466-bootstrap ~]$ cat /proc/sys/kernel/random/entropy_avail 152 [core@ocp-gm466-bootstrap ~]$ cat /proc/cpuinfo | grep rdrand | wc -l 0 [core@ocp-gm466-bootstrap ~]$ uname -a Linux ocp-gm466-bootstrap 4.18.0-193.29.1.el8_2.x86_64 #1 SMP Thu Oct 22 10:09:53 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux ~~~ Reason is again timeout of the ovsdb-server service ~~~ [core@ocp-gm466-master-0 ~]$ sudo journalctl -u ovsdb-server Jan 20 09:29:47 ocp-gm466-master-0 systemd[1]: ovsdb-server.service: Failed with result 'timeout'. ~~~ I retried with 4.6.12 and here the cluster deploys. It looks like 4.6.12 fixes https://bugzilla.redhat.com/show_bug.cgi?id=1884682 (kernel 4.18.0-193.40.1.el8_2.x86_64) which improves entropy ~~~ [stack@osp16amd ocp-test4]$ oc get node NAME STATUS ROLES AGE VERSION ocp-8cfxz-master-0 Ready master 10m v1.19.0+9c69bdc ocp-8cfxz-master-1 Ready master 10m v1.19.0+9c69bdc ocp-8cfxz-master-2 Ready master 10m v1.19.0+9c69bdc [core@ocp-8cfxz-master-2 ~]$ cat /proc/sys/kernel/random/entropy_avail 2040 [core@ocp-8cfxz-master-2 ~]$ cat /proc/cpuinfo | grep rdrand | wc -l 0 [core@ocp-8cfxz-master-2 ~]$ uname -a Linux ocp-8cfxz-master-2 4.18.0-193.40.1.el8_2.x86_64 #1 SMP Wed Jan 6 10:54:57 EST 2021 x86_64 x86_64 x86_64 GNU/Linux ~~~ So the entropy reason for this issue is gone, however I still consider the missing dependency from ovs-configuration.service to kubelet the root cause as various reasons can trigger this timeout or start issue. Hello Robert, First of all thank you for your excellent analysis. We very rarely see such a deep analysis. Regarding: > However it seems that the kubelet does not have a hard systemd dependency to the ovs-configuration.service, allowing the kubelet to start while OVS is not yet initialized, causing all this mess. So the external OVS stack should be more stable. and > So the entropy reason for this issue is gone, however I still consider the missing dependency from ovs-configuration.service to kubelet the root cause as various reasons can trigger this timeout or start issue. That wouldn't actually solve the problem... The ovs-configuration service is a one-shot service and if the ovsdb-server fails to start the ovs-configuration service won't start at all (in newer systemd releases we can set it to restart=on-abnormal so that wouldn't be an issue but we cannot do that in RHEL). So if it failed to start and we had the kubelet.service requiring ovs-configuration.service the kubelet wouldn't start at all. We could make it a hard dependency to OVS but then the network operator would still report degraded, then the OVS pod would fail to start and my expectation is that everything would work but the network operator would still report as degraded and the installation would still fail. Also not having the kubelet starting is a big issue because it massively affects the debugability of the node, oc debug node/<node> or oc adm must-gather wouldn't work among other things, at least with a kubelet running broken node we can run hostNetwork: false pods. Regarding what actually changed, that can't be it, 4.6.12 ships: machine-os 46.82.202101131942-0 Red Hat Enterprise Linux CoreOS which ships 4.18.193.37.1.el8_2 which includes the fix in BZ#1884682 |