Bug 1403945

Summary: ovsdb-server is dead after upgrade from 2.5.0-14 to 2.5.0-22
Product: Red Hat Enterprise Linux 7 Reporter: Ihar Hrachyshka <ihrachys>
Component: openvswitchAssignee: Open vSwitch development team <ovs-team>
Status: CLOSED CANTFIX QA Contact: Network QE <network-qe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.3CC: aconole, atragler, dbecker, emacchi, ihrachys, mandreou, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, sathlang, sclewis
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1403080 Environment:
Last Closed: 2016-12-12 17:06:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1403080    
Bug Blocks:    

Comment 1 Ihar Hrachyshka 2016-12-12 16:59:33 UTC
Cloned the original bug to openvswitch component because it does not seem OSP related.

The reduced steps to reproduce are (starting with RHEL 7.3):

- yum install openvswitch-2.5.0-14.git20160727.el7fdp.x86_64
- systemctl start openvswitch
- check that both ovsdb-server and vswitchd processes are up
- yum install openvswitch-2.5.0-22.git20160727.el7fdp.x86_64

Observe that only vswitchd process is up.

Before package update:

[root@localhost ~]# ps ax|grep ovs
 3363 ?        S<s    0:00 ovsdb-server: monitoring pid 3364 (healthy)
 3364 ?        S<     0:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach --monitor
 3373 ?        S<s    0:00 ovs-vswitchd: monitoring pid 3374 (healthy)
 3374 ?        S<Ll   0:00 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
 3424 pts/1    S+     0:00 grep --color=auto ovs

After package update:

[root@localhost ~]# ps ax|grep ovs
 3563 ?        S<s    0:00 ovs-vswitchd: monitoring pid 3564 (healthy)
 3564 ?        S<Ll   0:00 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach --monitor
 3666 pts/1    S+     0:00 grep --color=auto ovs

In ovsdb-server.log, the only message after update is:
2016-12-12T16:44:09.528Z|00002|daemon_unix(monitor)|INFO|pid 3554 died, exit status 0, exiting

In vswitchd log:

2016-12-12T16:44:09.528Z|00036|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection closed by peer
2016-12-12T16:44:10.528Z|00037|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2016-12-12T16:44:10.528Z|00038|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2016-12-12T16:44:10.528Z|00039|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 2 seconds before reconnect
2016-12-12T16:44:12.528Z|00040|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2016-12-12T16:44:12.528Z|00041|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2016-12-12T16:44:12.528Z|00042|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 4 seconds before reconnect
2016-12-12T16:44:16.529Z|00043|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2016-12-12T16:44:16.530Z|00044|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2016-12-12T16:44:16.530Z|00045|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 8 seconds before reconnect
2016-12-12T16:44:24.530Z|00046|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2016-12-12T16:44:24.530Z|00047|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2016-12-12T16:44:24.530Z|00048|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 8 seconds before reconnect
2016-12-12T16:44:32.530Z|00049|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2016-12-12T16:44:32.530Z|00050|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connection attempt failed (No such file or directory)
2016-12-12T16:44:32.530Z|00051|reconnect|INFO|unix:/var/run/openvswitch/db.sock: waiting 8 seconds before reconnect
...

In systemctl status openvswitch:

● openvswitch.service - Open vSwitch
   Loaded: loaded (/usr/lib/systemd/system/openvswitch.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
...
Dec 12 11:44:09 localhost.localdomain systemd[1]: Stopping Open vSwitch...
Dec 12 11:44:09 localhost.localdomain systemd[1]: Starting Open vSwitch...
Dec 12 11:44:09 localhost.localdomain systemd[1]: Started Open vSwitch.
Dec 12 11:44:09 localhost.localdomain systemd[1]: Stopping Open vSwitch...
Dec 12 11:44:09 localhost.localdomain systemd[1]: Stopped Open vSwitch.

^^ NOTE THE ORDER OF MESSAGES

Does it suggest that we stop the service after it's started? Or in the middle of it? It's not completely clear.

The state can be fixed by 'systemctl restart openvswitch' that gets ovsdb-server back.

Environment: Red Hat Enterprise Linux Server release 7.3 Beta (Maipo)

Comment 2 Aaron Conole 2016-12-12 17:01:18 UTC
Is it possible to try with -15?  If you don't have access, the RPMs are available at:

http://download-node-02.eng.bos.redhat.com/brewroot/packages/openvswitch/2.5.0/15.git20160727.el7fdb/

-15 introduced a change whereby we don't restart openvswitch on upgrade due to dhcp interface acquisition errors.

Comment 3 Ihar Hrachyshka 2016-12-12 17:02:09 UTC
Also in /var/log/messages, I see:

Dec 12 11:44:09 localhost yum[3581]: Updated: openvswitch-2.5.0-22.git20160727.el7fdp.x86_64
Dec 12 11:44:09 localhost systemd: Reloading.
Dec 12 11:44:09 localhost systemd: Stopping Open vSwitch...
Dec 12 11:44:09 localhost systemd: Starting Open vSwitch Database Unit...
Dec 12 11:44:09 localhost systemd: Starting Open vSwitch...
Dec 12 11:44:09 localhost systemd: Started Open vSwitch.
Dec 12 11:44:09 localhost ovs-ctl: ovsdb-server is already running.
Dec 12 11:44:09 localhost ovs-ctl: Enabling remote OVSDB managers [  OK  ]
Dec 12 11:44:09 localhost systemd: Stopping Open vSwitch...
Dec 12 11:44:09 localhost systemd: Stopped Open vSwitch.
Dec 12 11:44:09 localhost ovs-ctl: Exiting ovsdb-server (3554) [  OK  ]
Dec 12 11:44:09 localhost systemd: Stopped Open vSwitch Database Unit.

Comment 4 Jiri Benc 2016-12-12 17:06:57 UTC
When reading bug description, it should be clear what the problem is and what are the relevant information. After hitting "clone", the text offered by bugzilla should be taken only as a base and should be hand edited to be useful for the future bug assignee.

If you look at the bug description of this bug, it is anything but helpful. Lots of poorly formatted blobs with no information whatsoever, lots of unrelated noise, etc. Remember, several people are going to read this (devel, QE. PM, team leads, etc.) and would have to painfully go through all of this churn.

Please do not do this. I'm closing this bug, please file a comprehensive clean bug report.

Comment 5 Ihar Hrachyshka 2016-12-12 17:17:41 UTC
OK, checked with -15 (needed to use 'downgrade' instead of just 'install' since it's with an Epoch).

Not sure which case you wanted to check, so I checked several things.

1. If I upgrade -14 to -15 then to -22, then everything works fine.
2. If I upgrade -14 straight to -22, then ovsdb-server dies.

So I suspect there is something in between -15 and -22 that is unsafe with post-update restart still being present (that I believe was present in -14).

Comment 6 Ihar Hrachyshka 2016-12-12 17:24:46 UTC
Ack, reported a hopefully better one: https://bugzilla.redhat.com/show_bug.cgi?id=1403958