Bug 1316283

Summary: If there is a /var/lib/neutron/ha_confs/<router-id>.pid then l3 agent fails to spawn a keepalived process for that router
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-neutronAssignee: Assaf Muller <amuller>
Status: CLOSED ERRATA QA Contact: Alexander Stafeyev <astafeye>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.0 (Kilo)CC: amuller, bperkins, chrisw, jmelvin, jruzicka, majopela, nchandek, nyechiel, srevivo, tfreger
Target Milestone: asyncKeywords: Reopened, ZStream
Target Release: 7.0 (Kilo)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-neutron-2015.1.2-13.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1545481 (view as bug list) Environment:
Last Closed: 2018-02-14 15:39:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1545481    

Description Jeremy 2016-03-09 21:31:35 UTC
Description of problem:
The problem we are seeing is that if the .pid file for the previous keepalived process (located in /var/lib/neutron/ha_confs/<router_id>.pid) already exists then the L3 agent fails to spawn a keepalived process for that router.  "  

What I really mean is that if the process ID in the pid file already exists then the keepalived process does not get spawned.  The example shows what I mean - the <router_id>.pid file contains process id 4315 and that pid has already been taken by /usr/bin/python/usr/bin/keystone-all.  So, the keepalived process never gets created for this router.  


 grep "respawning keepalived for uuid a14e9f90-b5ee-420b-b89b-76c18b0a3a0a" /var/log/neutron/l3-agent.log

 We find that when we reboot a neutron node without first deleting the entire contents of  /var/lib/neutron/ha_confs/ then the 'old' pids contained in <router_id>.pid and <router_id>.pid-vrrp get checked by the L3 agent and if they already exists then the L3 agent does not spawn a keepalived for that router.  The L3 agent should check to see if the process id's in those files are keepalived's and if they are not then go ahead and create one an re-write the process id's in the .pid and .pid-vrrp file.


Version-Release number of selected component (if applicable):
openstack-neutron-2015.1.2-9.el7ost.noarch 

How reproducible:


Steps to Reproduce:
1).  Pick a router that you want to make display this issue;  record the router_id
2).  kill the two processes denoted in these two files:
  /var/lib/neutron/ha_confs/<router_id>.pid
 /var/lib/neutron/ha_confs/<router_id>.pid-vrrp
3).  Make sure that no keepalived process comes back for this router:
  ps --ppid 1 -f|grep -- keepalived.-P| grep <router_id>
4). Now pick out an existing process id - anything that's really  running - and edit the <router_id>.pid and put that processid in it.  For example, I had a keystone-all process running as pid 4340 and I put that into my <router_id>.pid file and <router_id>.pid-vrrp.
5).  Wait a few minutes and you will find that your router_id is now showing up with this error message and that there is no keepalived for it:
Actual results:


Expected results:


Additional info:

Comment 2 Assaf Muller 2016-03-09 21:42:46 UTC
Assigned to Miguel for triaging. Miguel, could this be a bug with ProcessManager? It's odd, looking at the 'active' method implementation in Kilo code, it checked (As it does on master) that the resource's UUID is present in the cmdline of the pid in question, and the UUID for keepalived_manager is the router_id, so the scenario in this RHBZ "should" not be happening.

Comment 3 Miguel Angel Ajo 2016-03-10 10:57:33 UTC
It could be, I would debug around here [1], as you said:

That bug may have been prevented by that logic, we look for the UUID of the process to be in the process cmdline /proc/<pid>/cmdline, and otherwise active will return False, triggering the respawning on the ProcessMonitor.


Please note that the logic for spawning keepalived and tracking the vrrp child is in [2]

I'm sending the bug to @hmlnarik as suggested on ping.





[1] https://github.com/openstack/neutron/blob/2768da320d7fb1630f2ffa32ec6485b279ba37e8/neutron/agent/linux/external_process.py#L134

[2] https://github.com/openstack/neutron/blob/stable/kilo/neutron/agent/linux/keepalived.py#L332

Comment 9 Assaf Muller 2016-03-28 22:28:25 UTC
Bad bot! Bad! Go away.

Comment 10 Mike McCune 2016-03-28 22:41:35 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 17 Jeremy 2016-04-12 13:09:19 UTC
SandboxB:[root@ttbossbxbmgmt0001 neutron1596344-keepalive]# yum localupdate openstack-neutron-2015.1.2-13.el7ost.noarch.rpm openstack-neutron-common-2015.1.2-13.el7ost.noarch.rpm openstack-neutron-ml2-2015.1.2-13.el7ost.noarch.rpm python-neutron-2015.1.2-13.el7ost.noarch.rpm
Loaded plugins: langpacks, priorities, product-id, rhnplugin, search-disabled-repos, subscription-manager
This system is not registered with RHN Classic or Red Hat Satellite.
You can use rhn_register to register.
Red Hat Satellite or RHN Classic support will be disabled.
Examining openstack-neutron-2015.1.2-13.el7ost.noarch.rpm: openstack-neutron-2015.1.2-13.el7ost.noarch
Marking openstack-neutron-2015.1.2-13.el7ost.noarch.rpm as an update to openstack-neutron-2015.1.2-11.el7ost.noarch
Examining openstack-neutron-common-2015.1.2-13.el7ost.noarch.rpm: openstack-neutron-common-2015.1.2-13.el7ost.noarch
Marking openstack-neutron-common-2015.1.2-13.el7ost.noarch.rpm as an update to openstack-neutron-common-2015.1.2-11.el7ost.noarch
Examining openstack-neutron-ml2-2015.1.2-13.el7ost.noarch.rpm: openstack-neutron-ml2-2015.1.2-13.el7ost.noarch
Marking openstack-neutron-ml2-2015.1.2-13.el7ost.noarch.rpm as an update to openstack-neutron-ml2-2015.1.2-11.el7ost.noarch
Examining python-neutron-2015.1.2-13.el7ost.noarch.rpm: python-neutron-2015.1.2-13.el7ost.noarch
Marking python-neutron-2015.1.2-13.el7ost.noarch.rpm as an update to python-neutron-2015.1.2-11.el7ost.noarch
Resolving Dependencies
--> Running transaction check
---> Package openstack-neutron.noarch 0:2015.1.2-11.el7ost will be updated
---> Package openstack-neutron.noarch 0:2015.1.2-13.el7ost will be an update
---> Package openstack-neutron-common.noarch 0:2015.1.2-11.el7ost will be updated
--> Processing Dependency: openstack-neutron-common = 2015.1.2-11.el7ost for package: openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch
datadog                                                                                                    |  951 B  00:00:00
enterprise-cloud_ECE_Custom_Packages_ECE_Custom_Packages                                                   | 2.1 kB  00:00:00
rhel-7-server-openstack-7.0-director-rpms                                                                  | 2.1 kB  00:00:00
rhel-7-server-openstack-7.0-optools-rpms                                                                   | 2.1 kB  00:00:00
rhel-7-server-openstack-7.0-rpms                                                                           | 2.1 kB  00:00:00
rhel-7-server-rh-common-rpms                                                                               | 2.1 kB  00:00:00
rhel-7-server-rhceph-1.3-tools-rpms                                                                        | 2.1 kB  00:00:00
rhel-7-server-rpms                                                                                         | 2.1 kB  00:00:00
rhel-7-server-satellite-tools-6.1-rpms                                                                     | 2.1 kB  00:00:00
rhel-ha-for-rhel-7-server-rpms                                                                             | 2.1 kB  00:00:00
treasure-data                                                                                              | 2.9 kB  00:00:00
---> Package openstack-neutron-common.noarch 0:2015.1.2-13.el7ost will be an update
---> Package openstack-neutron-ml2.noarch 0:2015.1.2-11.el7ost will be updated
---> Package openstack-neutron-ml2.noarch 0:2015.1.2-13.el7ost will be an update
---> Package python-neutron.noarch 0:2015.1.2-11.el7ost will be updated
---> Package python-neutron.noarch 0:2015.1.2-13.el7ost will be an update
--> Finished Dependency Resolution
Error: Package: openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch (@/openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch)
           Requires: openstack-neutron-common = 2015.1.2-11.el7ost
           Removing: openstack-neutron-common-2015.1.2-11.el7ost.noarch (@/openstack-neutron-common-2015.1.2-11.el7ost.noarch)
               openstack-neutron-common = 2015.1.2-11.el7ost
           Updated By: openstack-neutron-common-2015.1.2-13.el7ost.noarch (/openstack-neutron-common-2015.1.2-13.el7ost.noarch)
               openstack-neutron-common = 2015.1.2-13.el7ost
           Available: openstack-neutron-common-2015.1.0-12.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.0-12.el7ost
           Available: openstack-neutron-common-2015.1.0-16.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.0-16.el7ost
           Available: openstack-neutron-common-2015.1.1-6.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.1-6.el7ost
           Available: openstack-neutron-common-2015.1.1-7.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.1-7.el7ost
           Available: openstack-neutron-common-2015.1.2-3.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-3.el7ost
           Available: openstack-neutron-common-2015.1.2-6.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-6.el7ost
           Available: openstack-neutron-common-2015.1.2-7.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-7.el7ost
           Available: openstack-neutron-common-2015.1.2-9.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-9.el7ost
**********************************************************************
yum can be configured to try to resolve such errors by temporarily enabling
disabled repos and searching for missing dependencies.
To enable this functionality please set 'notify_only=0' in /etc/yum/pluginconf.d/search-disabled-repos.conf
**********************************************************************

Error: Package: openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch (@/openstack-neutron-openvswitch-2015.1.2-11.el7ost.noarch)
           Requires: openstack-neutron-common = 2015.1.2-11.el7ost
           Removing: openstack-neutron-common-2015.1.2-11.el7ost.noarch (@/openstack-neutron-common-2015.1.2-11.el7ost.noarch)
               openstack-neutron-common = 2015.1.2-11.el7ost
           Updated By: openstack-neutron-common-2015.1.2-13.el7ost.noarch (/openstack-neutron-common-2015.1.2-13.el7ost.noarch)
               openstack-neutron-common = 2015.1.2-13.el7ost
           Available: openstack-neutron-common-2015.1.0-12.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.0-12.el7ost
           Available: openstack-neutron-common-2015.1.0-16.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.0-16.el7ost
           Available: openstack-neutron-common-2015.1.1-6.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.1-6.el7ost
           Available: openstack-neutron-common-2015.1.1-7.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.1-7.el7ost
           Available: openstack-neutron-common-2015.1.2-3.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-3.el7ost
           Available: openstack-neutron-common-2015.1.2-6.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-6.el7ost
           Available: openstack-neutron-common-2015.1.2-7.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-7.el7ost
           Available: openstack-neutron-common-2015.1.2-9.el7ost.noarch (rhel-7-server-openstack-7.0-rpms)
               openstack-neutron-common = 2015.1.2-9.el7ost
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest
SandboxB:[root@ttbossbxbmgmt0001 neutron1596344-keepalive]#

SandboxB:[root@ttbossbxbmgmt0001 neutron1596344-keepalive]# yum repolist
Loaded plugins: langpacks, priorities, product-id, rhnplugin, search-disabled-repos, subscription-manager
This system is not registered with RHN Classic or Red Hat Satellite.
You can use rhn_register to register.
Red Hat Satellite or RHN Classic support will be disabled.
repo id                                                   repo name                                                       status
datadog                                                   Datadog, Inc.                                                         53
!enterprise-cloud_ECE_Custom_Packages_ECE_Custom_Packages ECE Custom Packages                                                    5
!rhel-7-server-openstack-7.0-director-rpms/7Server/x86_64 Red Hat Enterprise Linux OpenStack Platform 7.0 director for RH       78
!rhel-7-server-openstack-7.0-optools-rpms/7Server/x86_64  Red Hat Enterprise Linux OpenStack Platform 7.0 Operational Too       91
!rhel-7-server-openstack-7.0-rpms/7Server/x86_64          Red Hat Enterprise Linux OpenStack Platform 7.0 for RHEL 7 (RPM    1,094
!rhel-7-server-rh-common-rpms/7Server/x86_64              Red Hat Enterprise Linux 7 Server - RH Common (RPMs)                 168
!rhel-7-server-rhceph-1.3-tools-rpms/7Server/x86_64       Red Hat Ceph Storage Tools 1.3 for Red Hat Enterprise Linux 7 S       71
!rhel-7-server-rpms/7Server/x86_64                        Red Hat Enterprise Linux 7 Server (RPMs)                        20,259+2
!rhel-7-server-satellite-tools-6.1-rpms/x86_64            Red Hat Satellite Tools 6.1 (for RHEL 7 Server) (RPMs)                83
!rhel-ha-for-rhel-7-server-rpms/7Server/x86_64            Red Hat Enterprise Linux High Availability (for RHEL 7 Server)       190
treasure-data/7Server/x86_64                              Ye Ole Rpm Repo                                                        9
repolist: 22,101
SandboxB:[root@ttbossbxbmgmt0001 neutron1596344-keepalive]#

Comment 20 Alexander Stafeyev 2016-04-21 13:33:17 UTC
Verified. 

[root@overcloud-controller-0 ~]# rpm -qa | grep openstack-neutron-2015
openstack-neutron-2015.1.2-13.el7ost.noarch


saw the following in l3-agent.log: 

2016-04-21 13:15:44.778 22415 ERROR neutron.agent.linux.external_process [-] default-service for router with uuid cf2abf13-4ad2-4f0e-9f5e-9bc8539d990d not found. The process should not have died
2016-04-21 13:15:44.779 22415 ERROR neutron.agent.linux.external_process [-] respawning keepalived for uuid cf2abf13-4ad2-4f0e-9f5e-9bc8539d990d


________________________________________________

The pid was changed. 

The manually pid (keystone pid) : 

[root@overcloud-controller-0 ~]# cat /var/lib/neutron/ha_confs/cf2abf13-4ad2-4f0e-9f5e-9bc8539d990d.pid
8708

After the log showed the error the pid changed: 

[root@overcloud-controller-0 ~]# cat /var/lib/neutron/ha_confs/cf2abf13-4ad2-4f0e-9f5e-9bc8539d990d.pid
26096

Comment 22 errata-xmlrpc 2016-05-12 16:02:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1062.html