Description of problem: OVS is dead OVSNeutronAgent will keep running and checking OVS status periodically After idle period of time with no instances, /var/run/openvswitch directory disappears on controllers. [root@controller-0 ~]# ll /var/run/openvswitch ls: cannot access /var/run/openvswitch: No such file or directory [root@controller-0 ~]# [root@controller-0 ~]# ovs-vsctl show ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) 2018-06-11 10:56:42.594 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2018-06-11 10:56:42.595 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2018-06-11 10:56:50.597 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2018-06-11 10:56:50.597 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2018-06-11 10:56:58.600 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2018-06-11 10:56:58.601 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2018-06-11 10:57:06.249 102859 ERROR neutron.agent.linux.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused) 2018-06-11 10:57:06.249 102859 ERROR neutron.agent.linux.async_process [-] Process [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json] dies due to the error: ovsdb-client: failed to connect to "tcp:127.0.0.1:6640" (Connection refused) 2018-06-11 10:57:06.603 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: send error: Connection refused 2018-06-11 10:57:06.604 102859 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: connection dropped (Connection refused) 2018-06-11 10:57:07.778 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] Switch connection timeout 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] Failed to communicate with the switch: RuntimeError: Switch connection timeout 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int Traceback (most recent call last): 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_int.py", line 52, in check_canary_table 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int flows = self.dump_flows(constants.CANARY_TABLE) 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 141, in dump_flows 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int (dp, ofp, ofpp) = self._get_dp() 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 67, in _get_dp 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self._cached_dpid = new_dpid 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int self.force_reraise() 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int six.reraise(self.type_, self.value, self.tb) 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ovs_bridge.py", line 50, in _get_dp 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int dp = self._get_dp_by_dpid(self._cached_dpid) 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int File "/usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py", line 69, in _get_dp_by_dpid 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int raise RuntimeError(m) 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int RuntimeError: Switch connection timeout 2018-06-11 10:57:07.780 102859 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.br_int 2018-06-11 10:57:07.781 102859 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-c76421a3-0d63-4284-85f9-a7cfffa743ff - - - - -] OVS is dead. OVSNeutronAgent will keep running and checking OVS status periodically. Version-Release number of selected component (if applicable): OSP 13 Puddle 2018-06-06.3 How reproducible: Deploy an environment. Leave the environment for a day without any instance existing. Try to boot an instance and check the status of ovs and the /var/run/openvswitch directory. SOS reports attached.
Created attachment 1450036 [details] sosreport_undercloud-0_part_1
Created attachment 1450037 [details] sosreport_undercloud-0_part_2
Created attachment 1450038 [details] sosreport_controller-0_part_1
Created attachment 1450039 [details] sosreport_controller-0_part_2
Created attachment 1450040 [details] sosreport_controller-1_part_1
Created attachment 1450041 [details] sosreport_controller-1_part_2
Created attachment 1450042 [details] sosreport_controller-2_part_1
Created attachment 1450043 [details] sosreport_controller-2_part_2
Created attachment 1450044 [details] sosreport_computeovsdpdk-0
Created attachment 1450045 [details] sosreport_computeovsdpdk-1
Sosreport logs divided into two part should be combined by the following: cat sosreport-undercloud-0.redhat.local-20180611080635.tar.xz* > sosreport-undercloud-0.redhat.local-20180611080635.tar.xz
ovs-vswitchd.log ---------------- 2018-06-10T14:44:02.200Z|00379|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 11 flow_mods in the 2 s starting 57 s ago (11 deletes) 2018-06-11T03:25:01.887Z|00380|vlog|WARN|failed to open /var/log/openvswitch/ovs-vswitchd.log for logging: Permission denied [root@controller-0 heat-admin]# ll /var/log/openvswitch/ total 48 -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 12 03:26 ovsdb-server.log -rw-r-----. 1 openvswitch hugetlbfs 484 Jun 10 11:11 ovsdb-server.log-20180611 -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 11 03:25 ovsdb-server.log-20180612 -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 12 03:26 ovs-vswitchd.log -rw-r-----. 1 openvswitch hugetlbfs 42895 Jun 10 14:44 ovs-vswitchd.log-20180611 -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 11 03:25 ovs-vswitchd.log-20180612 [root@controller-0 heat-admin]# ll /var/log/openvswitch/ -d drwxr-x---. 2 root root 186 Jun 12 08:08 /var/log/openvswitch/ As per the logs, once the deployment is completed, all the tests has been passed, which implies that the the permissions were fine during the deployment. But if left over for overnight, the log folder /var/log/openvswitch ownership changes to root:root. This results in permission denied error in the ovsdb-server and ovs-vswitchd process. And those processes fails, resulting in this error. Looking for possibility of log management tools affecting ownership.
The ownership of the directory /var/log/openvswitch is never changed to openvswitch:hugetlbfs, it is always with root:root as it is created by systemd. After the logs files are rotated, the new logs files could not be opened by ovs daemons. It results in permission denied error log in the older log file (after logrotate). ---- 2018-06-11T03:25:01.887Z|00380|vlog|WARN|failed to open /var/log/openvswitch/ovs-vswitchd.log for logging: Permission denied ---- Though this permission denied is an issue to be looked upon, it does not clarify why ovsdb-server and ovs-vswitchd services are going to error state.
Here are the steps that i can reproduce it. ------------- [root@computeovsdpdk-0 heat-admin]# logrotate -f /etc/logrotate.conf [root@computeovsdpdk-0 heat-admin]# ll /var/log/openvswitch/ total 296 -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 12 12:55 ovsdb-server.log -rw-r-----. 1 openvswitch hugetlbfs 352 Jun 10 11:25 ovsdb-server.log-20180611.gz -rw-r-----. 1 openvswitch hugetlbfs 168 Jun 12 12:49 ovsdb-server.log-20180612.gz -rw-r-----. 1 openvswitch hugetlbfs 0 Jun 12 12:55 ovs-vswitchd.log -rw-r-----. 1 openvswitch hugetlbfs 289322 Jun 10 14:43 ovs-vswitchd.log-20180611 -rw-r-----. 1 openvswitch hugetlbfs 167 Jun 12 12:49 ovs-vswitchd.log-20180612.gz [root@computeovsdpdk-0 heat-admin]# ps aux | grep ovs openvsw+ 1465 0.0 0.0 55088 2944 ? S<s 12:38 0:00 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=db:Open_vSwitch,SSL,certificate --bootstrap-ca-cert=db:Open_vSwitch,SSL,ca_cert --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovsdb-server.log --pidfile=/var/run/openvswitch/ovsdb-server.pid --detach openvsw+ 1554 198 0.1 2904664 160420 ? S<Lsl 12:38 35:14 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --user openvswitch:hugetlbfs --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach 42435 3358 0.0 0.0 11684 1392 ? Ss 12:39 0:00 /bin/bash /neutron_ovs_agent_launcher.sh 42435 4002 0.0 0.0 52656 2836 ? S 12:39 0:00 ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json [root@computeovsdpdk-0 heat-admin]# docker exec -it logrotate_crond bash ()[root@01b72ee362c4 /]# /sbin/lsof -nPs +L1 +D /var/log 2>&1| awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print $2}' | /bin/xargs -n1 -r -t kill -HUP kill -HUP 1465 kill -HUP 1554 ()[root@01b72ee362c4 /]# ------------------------------
To explain the behavior, since the /var/log/openvswitch directory has root:root ownership, after logrotate, ovs processes is not able to open the new files. Because of which lsof command lists those rotated files as "(deleted)". --------- [root@computeovsdpdk-0 heat-admin]# /sbin/lsof -nPs +L1 +D /var/log ovsdb-ser 1465 openvswitch 22w REG 8,2 220 0 14127046 /var/log/openvswitch/ovsdb-server.log-20180612 (deleted) ovs-vswit 1554 openvswitch 83w REG 8,2 220 0 14127063 /var/log/openvswitch/ovs-vswitchd.log-20180612 (deleted) -------- Now the "logrotate_crond" container, which runs periodically to kill the processes which have the "(deleted)" log files, kill the ovs processes, which results in the ovsdb-server and ovs-vswitchd service error. There are 2 issue here: 1) after logrotate, ovs logs are not capture because of the permission denied error - Aaron, could you comment on it 2) logrotate_crond container is killing ovs process, which is unexpected even though error in logging - I will continue to look in to it
For 1) In the Open vSwitch spec file, we do set the log directory ownership: %attr(750,openvswitch,openvswitch) /var/log/openvswitch Which RPM version is this?
> Which RPM version is this? From bz description: OSP 13 Puddle 2018-06-06.3 http://download.devel.redhat.com/rcm-guest/puddles/OpenStack/13.0-RHEL-7/2018-06-06.3 which includes cross-tagged openvswitch-2.9.0-19.el7fdp.1 build.
This is strange since bz1583744 is marked as verify with the correct openvswitch/openvswitch owner
I have not looked at the directory ownership after the deployment, as the issue is happened after couple of days. If it is correct after deployment, then is there a possibility that it gets changed after overnight. re-deploying now.
Checked on another OSP13 deployment (with latest puddle), found that the directory permission is still root:root. -------------- [root@overcloud-controller-0 heat-admin]# ll /var/log/openvswitch/ -d drwxr-x---. 2 root root 54 Jun 13 08:56 /var/log/openvswitch/ [root@overcloud-controller-0 heat-admin]# yum list openvswitch Loaded plugins: product-id, search-disabled-repos, subscription-manager This system is not registered with an entitlement server. You can use subscription-manager to register. Installed Packages openvswitch.x86_64 2.9.0-19.el7fdp.1 @rhos-13.0-signed [root@overcloud-controller-0 heat-admin]# ll /var/log/openvswitch/ total 12 -rw-r-----. 1 openvswitch hugetlbfs 359 Jun 13 08:56 ovsdb-server.log -rw-r-----. 1 openvswitch hugetlbfs 4747 Jun 13 09:23 ovs-vswitchd.log ---------------
Checking on the overcloud-full image (where openvswitch is pre-installed), the directory owenership is root:root. Does it happen if the openvswitch is updated rather than installed? ------- [root@undercloud01 images]# ll mount_image/var/log/openvswitch/ -d drwxr-x---. 2 root root 6 May 30 09:59 mount_image/var/log/openvswitch/ -------
sounds related to bug we had to fix for /var/log/lastlog in overcloud-full images This seems to be a possible virt-sysprep interaction issue or issue in logrotate config to make sure the process is bumped (sig HUP?) when log files are rotated so it will start to log to new file instead of old deleted one The permissions will help for correct ownership but the log rotation configuration has to be clean or this will continue to happen
The problem isn't /var/log/lastlog directly - the virt-sysprep line is killing everything in /var/log; we then have hacks to put back lastlog. If this works outside of deployment images, then it's likely a virt-sysprep issue and I know how to fix it.
the logrotate configuration we ship with openvswitch is ok, since ovsdb-server and ovs-vswitchd run as openvswitch user and so they can create the new logs inside /var/log/openvswitch (with the correct owner:group) after the logs are rotated (/etc/logrotate.d/openvswitch).
Reproduced. [root@newplan-controller-0 ~]# ovs-vsctl show ovs-vsctl: unix:/var/run/openvswitch/db.sock: database connection failed (No such file or directory) After rebooting controllers it works again. [root@newplan-controller-1 ~]# ls -l /var/run/ |grep openvswitch drwxr-xr-x. 2 openvswitch hugetlbfs 240 Jun 14 17:57 openvswitch
(In reply to Saravanan KR from comment #15) > To explain the behavior, since the /var/log/openvswitch directory has > root:root ownership, after logrotate, ovs processes is not able to open the > new files. Because of which lsof command lists those rotated files as > "(deleted)". > > --------- > [root@computeovsdpdk-0 heat-admin]# /sbin/lsof -nPs +L1 +D /var/log > ovsdb-ser 1465 openvswitch 22w REG 8,2 220 0 14127046 > /var/log/openvswitch/ovsdb-server.log-20180612 (deleted) > ovs-vswit 1554 openvswitch 83w REG 8,2 220 0 14127063 > /var/log/openvswitch/ovs-vswitchd.log-20180612 (deleted) > -------- > > Now the "logrotate_crond" container, which runs periodically to kill the > processes which have the "(deleted)" log files, kill the ovs processes, > which results in the ovsdb-server and ovs-vswitchd service error. > > There are 2 issue here: > 1) after logrotate, ovs logs are not capture because of the permission > denied error - Aaron, could you comment on it > 2) logrotate_crond container is killing ovs process, which is unexpected > even though error in logging - I will continue to look in to it If indeed 2) is one of the root causes for this problem, then the fix for that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946
(In reply to Michele Baldessari from comment #26) > (In reply to Saravanan KR from comment #15) > > 2) logrotate_crond container is killing ovs process, which is unexpected > > even though error in logging - I will continue to look in to it > > If indeed 2) is one of the root causes for this problem, then the fix for > that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946 Indeed, (2) is the cause of killing the ovs processes resulting in ovs service error and disappearing of /var/run/openvswitch. Thanks for pointing out. And (1) also should be fixed, because logrotate of ovs logs will no more collect logs after a day, which is one of the main debugging log for OVS-DPDK deployment.
Might be helped with https://review.openstack.org/#/c/575525 in upstream stable/queens which should send SIGHUP when logrotate is run
(In reply to Saravanan KR from comment #27) > (In reply to Michele Baldessari from comment #26) > > (In reply to Saravanan KR from comment #15) > > > 2) logrotate_crond container is killing ovs process, which is unexpected > > > even though error in logging - I will continue to look in to it > > > > If indeed 2) is one of the root causes for this problem, then the fix for > > that is being tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1584946 > > Indeed, (2) is the cause of killing the ovs processes resulting in ovs > service error and disappearing of /var/run/openvswitch. Thanks for pointing > out. > > And (1) also should be fixed, because logrotate of ovs logs will no more > collect logs after a day, which is one of the main debugging log for > OVS-DPDK deployment. As per https://bugzilla.redhat.com/show_bug.cgi?id=1584946#c19, it seems that the ovs service is no longer stopping due to log rotation when https://review.openstack.org/#/c/575525 is applied.
There's an ordering problem with the spec file. When creating images, it is installed once. However, the group/user for ovs doesn't exist: Preparing... ################################# [100%] Updating / installing... 1:openvswitch-2.9.0-19.el7fdp.1 warning: user openvswitch does not exist - using root warning: group openvswitch does not exist - using root warning: user openvswitch does not exist - using root warning: group openvswitch does not exist - using root ################################# [100%] warning: user openvswitch does not exist - using root warning: group openvswitch does not exist - using root [root@localhost ~]# ls -ld /var/log/openvswitch/ drwxr-x---. 2 root root 6 May 30 09:59 /var/log/openvswitch/ Deleting the RPM and reinstalling it without deleting the user, and things work: Preparing... ################################# [100%] Updating / installing... 1:openvswitch-2.9.0-19.el7fdp.1 ################################# [100%] [root@localhost ~]# ls -ld /var/log/openvswitch/ drwxr-x---. 2 openvswitch openvswitch 6 May 30 09:59 /var/log/openvswitch/ [root@localhost ~]#
The 19.1 spec file has the useradd/groupadd in %post, as Jon noticed, which is the reason for the ordering problem.
Workaround (chown openvswitch.openvswitch /var/log/openvswitch) added to image build step.
[root@computeovsdpdk-0 ~]# ls -ltrd /var/log/openvswitch drwxr-x---. 2 openvswitch openvswitch 54 Jun 21 10:05 /var/log/openvswitch
*** Bug 1593770 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2083