1616440 – Upgrading control plane from 3.7.44 to 3.7.54 occasionally causes ovsdb-server.service start operation to timeout

Bug 1616440 - Upgrading control plane from 3.7.44 to 3.7.54 occasionally causes ovsdb-server.service start operation to timeout

Summary: Upgrading control plane from 3.7.44 to 3.7.54 occasionally causes ovsdb-serve...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.7.z
Assignee:	Patrick Dillon
QA Contact:	Meng Bo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1636234
TreeView+	depends on / blocked

Reported:	2018-08-15 22:42 UTC by emahoney
Modified:	2021-12-10 17:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: systemd has default service timeout of 90 seconds and in some cases ovsdb-server does not start up in that time. Consequence: openvswitch, which has a dependency on ovsdb-server, cannot start and installation fails. Fix: increase systemd timeout for ovsdb-server to five minutes. Result: ovsdb-server does not timeout and openvswitch starts successfully.
Clone Of:
Clones:	1636234 (view as bug list)
Environment:
Last Closed:	2019-08-01 14:33:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description emahoney 2018-08-15 22:42:04 UTC

Description of problem:When upgrading the control plan from 3.7.44 to 3.7.54 the ovsdb-server.service returns the following error:

    Aug 15 12:15:07 node.example.com systemd[1]: ovsdb-server.service start operation timed out. Terminating.

I am attaching the node logs during this upgrade as well as the playbook output. 

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:
Reproducible only on this env

Steps to Reproduce:
1. Run the control plan upgrade 3.7.44 to 3.7.54
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 emahoney 2018-08-15 22:51:01 UTC

Here is the ansible versioning from during the run:

~~~
ansible.noarch                   2.4.5.0-1.el7ae             @ocp39-2018Q4-A0.el7.x86_64
openshift-ansible.noarch         3.7.57-1.git.33.cf01e48.el7 @ocp37-2018Q4-A0.el7.x86_64
~~~

Thanks,
Evan Mahoney

Comment 4 Scott Dodson 2018-08-16 02:55:55 UTC

Can you make sure we have a sosreport or something else that would show all package versions before and after the upgrade was performed on a failing host?

Moving to networking.

Comment 5 Casey Callendrello 2018-08-16 09:35:26 UTC

It looks like the error is printed to the logs:

Aug 15 12:15:23 ... ovs-ctl[20774]: PMD: net_mlx4: cannot load glue library: libibverbs.so.1: cannot open shared object file: No such file or directory
Aug 15 12:15:23 ... ovs-ctl[20774]: PMD: net_mlx4: cannot initialize PMD due to missing run-time dependency on rdma-core libraries (libibverbs, libmlx4)


Is the customer using Infiniband?

Comment 9 Casey Callendrello 2018-08-21 17:01:45 UTC

@Weibin, do you think you could try and reproduce this?

Comment 10 Weibin Liang 2018-08-22 14:57:54 UTC

Upgrading v3.7.44 to latest v3.7.62, "systemctl status ovsdb-server.service" show ovsdb-server is running active without any failure in both master and node.

Comment 31 Mark Michelson 2018-09-27 16:43:28 UTC

I took a look into this and want to provide some analysis.

Based on the journalctl output for the ovsdb-server unit, I put the following timeline together. The timestamps here are relative to the start of the timeline:

00:00 systemd stops ovsdb-server.
00:31 systemd starts ovsdb-server.
02:01 systemd reports that ovsdb-server start timed out.
02:01 systemd starts ovsdb-server again.
02:36 ovsdb-server starts successfully.

At 00:00, when systemd stops ovsdb-server, the messages in the logs indicate that the process exited gracefully. While this should be correct, it may be worth looking at a couple of pieces of data to be certain this is correct:
* Using `ps`, see if there are any ovsdb-server processes still running.
* Look for a .pid file in the OVS run directory. By default, I believe this will be /var/run/openvswitch. I'm not sure if OpenShift uses a different directory.

The other thing that comes to mind is that the TCP port used by the ovsdb-server may still be in use when ovsdb-server is restarted. For instance, there may be connections in TIME_WAIT or some other state. It's possible that until those connections clear up, ovsdb-server can't bind to the TCP port. It is worth using netstat (or ss) when this issue occurs to see if there are any TCP connections still in place.

The thing that's throwing me off most right now is that the ovsdb-server.log file doesn't have any sort of errors in it. I would suspect that if the new process is unable to bind, then we would see that.

Either way, I'm in agreement with Scott right now that a viable workaround is to increase the timeouts on systemd since the problem appears to right itself after a while. I think it's only a workaround though, because we would like to know exactly what is causing the delay and how we can correct it if possible.

Comment 33 Scott Dodson 2018-10-02 14:20:05 UTC

Sure, looking at the logs from the original report the task that fails is 

2018-08-15 12:13:37,112 p=5022 u=ab41553 |  TASK [openshift_node_upgrade : Start openvswitch service] ******************************************************************************
2018-08-15 12:13:37,112 p=5022 u=ab41553 |  task path: /usr/share/ansible/openshift-ansible/roles/openshift_node_upgrade/tasks/restart.yml:29

Immediately prior to this we should create the dropin directory, copy the dropin file, and if that task changed reload systemd before we start ovs.

My question would be do we need to implement this for 3.11, 3.10, 3.9, 3.7 or is this something unique to 3.7?

Comment 34 Scott Dodson 2018-10-02 14:52:09 UTC

Discussed w/ Ben on IRC, this is needed in 3.7 and 3.9. 3.10 and later run ovs in a pod and this change is not applicable there.

Comment 35 Patrick Dillon 2018-10-04 20:21:11 UTC

PR: https://github.com/openshift/openshift-ansible/pull/10326

Comment 36 Scott Dodson 2018-10-08 14:11:34 UTC

Pr merged.

Comment 37 Scott Dodson 2018-10-16 13:51:37 UTC

In openshift-ansible-3.7.68-1

Comment 38 Meng Bo 2018-10-18 10:06:09 UTC

Tested with openshift-ansible-3.7.68

Does not meet the ovsdb-service error.

And the new added steps are executed during the upgrading.

TASK [openshift_node_upgrade : Create ovsdb-server dropin config] **************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node_upgrade/tasks/openvswitch.yml:7
Using module file /usr/lib/python2.7/site-packages/ansible/modules/files/ini_file.py
changed: [REDACTED] => {
    "changed": true, 
    "diff": {
        "after": "", 
        "after_header": "/etc/systemd/system/ovsdb-server.service.d/timeout.conf (content)", 
        "before": "", 
        "before_header": "/etc/systemd/system/ovsdb-server.service.d/timeout.conf (content)"
    }, 
    "gid": 0, 
    "group": "root", 
    "invocation": {
        "module_args": {
            "attributes": null, 
            "backup": false, 
            "content": null, 
            "create": true, 
            "delimiter": null, 
            "directory_mode": null, 
            "follow": false, 
            "force": null, 
            "group": null, 
            "mode": null, 
            "no_extra_spaces": false, 
            "option": "TimeoutStartSec", 
            "owner": null, 
            "path": "/etc/systemd/system/ovsdb-server.service.d/timeout.conf", 
            "regexp": null, 
            "remote_src": null, 
            "section": "Service", 
            "selevel": null, 
            "serole": null, 
            "setype": null, 
            "seuser": null, 
            "src": null, 
            "state": "present", 
            "unsafe_writes": null, 
            "value": "300"
        }
    }, 
    "mode": "0644", 
    "msg": "section and option added", 
    "owner": "root", 
    "path": "/etc/systemd/system/ovsdb-server.service.d/timeout.conf", 
    "secontext": "unconfined_u:object_r:systemd_unit_file_t:s0", 
    "size": 33, 
    "state": "file", 
    "uid": 0
}

Comment 39 Scott Dodson 2019-08-01 14:33:30 UTC

Fixed in openshift-ansible-3.7.72-1.git.0.5c45a8a.el7 and later.

Note You need to log in before you can comment on or make changes to this bug.