Bug 1444883 - Neutron agent is dead after undercloud upgrade from Newton to Ocata
Summary: Neutron agent is dead after undercloud upgrade from Newton to Ocata
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 11.0 (Ocata)
Assignee: Sofer Athlan-Guyot
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks: 1445355
TreeView+ depends on / blocked
 
Reported: 2017-04-24 13:26 UTC by Raviv Bar-Tal
Modified: 2017-06-12 12:34 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1445355 (view as bug list)
Environment:
Last Closed: 2017-06-12 12:34:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Raviv Bar-Tal 2017-04-24 13:26:52 UTC
Description of problem:
After performing manual upgrade procedure, the neutron agent is dead on undercloud.

Version-Release number of selected component (if applicable):


How reproducible:
Manually run the undercloud upgrade procedure,
check neutron agent - neutron agent-list

Steps to Reproduce:
yum repolist -v enabled
sudo systemctl list-units 'openstack-*'

#get latest rhos-release & setup 11 repos
sudo yum localinstall -y http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm
sudo rhos-release 11

# disable OSP10 repos
sudo yum-config-manager --disable 'rhelosp-10.0*'
yum repolist -v enabled

# stop services as per [bz-1372040](https://bugzilla.redhat.com/show_bug.cgi?id=1372040#c6)
# this will exist in OSP10 upgrades docs.
sudo systemctl stop 'openstack-*'
sudo systemctl stop 'neutron-*'
sudo systemctl stop httpd

#if you are going to do a backwards compatibility install save the old tht dir
# cp -r /usr/share/openstack-tripleo-heat-templates ~/tht

# update instack-undercloud and friends before running the upgrade
sudo yum -y update instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient

# UPGRADE 
openstack undercloud upgrade

Actual results:
Neutron agent is dead 

Expected results:

Neutron agent is alive
Additional info:

Comment 1 Sofer Athlan-Guyot 2017-04-24 14:18:48 UTC
This is a follow up of https://bugzilla.redhat.com/show_bug.cgi?id=1436729 .  We created a new one to remove confusion with mixversion as it's unrelated.

Comment 3 Sofer Athlan-Guyot 2017-04-24 16:22:09 UTC
So looking at an env that fails to rip all the neutron wrap we have all the postscript in places:


rpm -q --scripts openstack-neutron-openvswitch:


postinstall scriptlet (using /bin/sh):

if [ $1 -eq 1 ] ; then 
        # Initial installation 
        systemctl preset neutron-openvswitch-agent.service >/dev/null 2>&1 || : 
fi 

oldconf=/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini
newconf=/etc/neutron/plugins/ml2/openvswitch_agent.ini
if [ $1 -gt 1 ]; then
    if [ -e $oldconf ]; then
        # Imitate noreplace
        cp $newconf ${newconf}.rpmnew
        cp $oldconf $newconf
    fi
fi

if [ $1 -ge 2 ]; then
    # We're upgrading

    # Detect if the neutron-openvswitch-agent is running
    ovs_agent_running=0
    systemctl status neutron-openvswitch-agent > /dev/null 2>&1 && ovs_agent_running=1 || :

    # If agent is running, stop it
    [ $ovs_agent_running -eq 1 ] && systemctl stop neutron-openvswitch-agent > /dev/null 2>&1 || :

    # Search all orphaned neutron-rootwrap-daemon processes and since all are triggered by sudo,
    # get the actual rootwrap-daemon process.
    
for pid in $(ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'); do 
   kill $(ps --ppid $pid -o pid=) 
done 


    # If agent was running, start it back with new code
    [ $ovs_agent_running -eq 1 ] && systemctl start neutron-openvswitch-agent > /dev/null 2>&1 || :
fi
preuninstall scriptlet (using /bin/sh):

if [ $1 -eq 0 ] ; then 
        # Package removal, not upgrade 
        systemctl --no-reload disable neutron-openvswitch-agent.service > /dev/null 2>&1 || : 
        systemctl stop neutron-openvswitch-agent.service > /dev/null 2>&1 || : 
fi


So it looks ok.

Running the command on the platform:

[root@undercloud-0 ~]# ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'
12453


root     12453  0.0  0.0 193392  2820 ?        S    08:34   0:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf

So we indeed have a rogue rootwrap daemon.

The ovs-vswitchd is not working properly:


[root@undercloud-0 ~]# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2017-04-24 08:34:54 EDT; 3h 27min ago
 Main PID: 12332 (ovs-vswitchd)
   CGroup: /system.slice/ovs-vswitchd.service
           └─12332 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach

Apr 24 08:34:54 undercloud-0.redhat.local systemd[1]: Starting Open vSwitch Forwarding Unit...
Apr 24 08:34:54 undercloud-0.redhat.local ovs-ctl[12308]: Starting ovs-vswitchd [  OK  ]
Apr 24 08:34:54 undercloud-0.redhat.local ovs-ctl[12308]: Enabling remote OVSDB managers [  OK  ]
Apr 24 08:34:54 undercloud-0.redhat.local systemd[1]: Started Open vSwitch Forwarding Unit.
Apr 24 08:46:56 undercloud-0.redhat.local ovs-vswitchd[12332]: ovs|00052|rconn|ERR|br-int<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting
Apr 24 08:46:57 undercloud-0.redhat.local ovs-vswitchd[12332]: ovs|00054|rconn|ERR|br-ctlplane<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting


and neutron agent-list is not happy:

+--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                      | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+
| a036287f-4031-44a8-8454-b2bba1620bc6 | Open vSwitch agent | undercloud-0.redhat.local |                   | xxx   | True           | neutron-openvswitch-agent |
| f41a2dff-be90-4fe3-9038-82caf6e7c025 | DHCP agent         | undercloud-0.redhat.local | nova              | :-)   | True           | neutron-dhcp-agent        |
+--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+


In the env we have:

[root@undercloud-0 ~]# grep neutron /var/log/yum.log 
Apr 24 06:52:49 Installed: python-neutronclient-6.0.0-2.el7ost.noarch
Apr 24 06:52:54 Installed: puppet-neutron-9.5.0-2.el7ost.noarch
Apr 24 06:55:39 Installed: python-neutron-lib-0.4.0-1.el7ost.noarch
Apr 24 06:55:41 Installed: 1:python-neutron-9.2.0-8.el7ost.noarch
Apr 24 06:55:41 Installed: 1:openstack-neutron-common-9.2.0-8.el7ost.noarch
Apr 24 06:55:42 Installed: 1:openstack-neutron-9.2.0-8.el7ost.noarch
Apr 24 06:59:31 Installed: 1:openstack-neutron-ml2-9.2.0-8.el7ost.noarch
Apr 24 06:59:42 Installed: 1:openstack-neutron-openvswitch-9.2.0-8.el7ost.noarch
Apr 24 08:30:39 Updated: python-neutronclient-6.1.0-1.el7ost.noarch
Apr 24 08:31:07 Updated: python-neutron-lib-1.1.0-1.el7ost.noarch
Apr 24 08:31:08 Updated: 1:python-neutron-10.0.1-1.el7ost.noarch
Apr 24 08:31:09 Updated: 1:openstack-neutron-common-10.0.1-1.el7ost.noarch
Apr 24 08:31:09 Updated: 1:openstack-neutron-openvswitch-10.0.1-1.el7ost.noarch
Apr 24 08:31:09 Updated: 1:openstack-neutron-10.0.1-1.el7ost.noarch
Apr 24 08:31:09 Updated: 1:openstack-neutron-ml2-10.0.1-1.el7ost.noarch
Apr 24 08:32:04 Updated: puppet-neutron-10.3.0-2.el7ost.noarch


In an irc discussion with Jacob about having more information on the processes states, before and during upgrade.

Jacob was planning to provide a  neutron-openvswitch-agent package with the postscript modified to have some ps -ef triggered at the right time.

Asking Jacob if I got that right :)

Comment 4 Jakub Libosvar 2017-04-25 09:44:18 UTC
(In reply to Sofer Athlan-Guyot from comment #3)
...
> 
> Asking Jacob if I got that right :)

That's correct, the scratch-build packages are at https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13076113

Comment 5 Sofer Athlan-Guyot 2017-04-25 10:30:50 UTC
Thanks Jakub I think that's going to help to get to the bottom of it.  My idea here is that maybe the init process didn't yet took the orphaned process in at the time the post script is run making this line:

   ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'

returns empty.

Comment 7 Jakub Libosvar 2017-04-25 10:42:26 UTC
(In reply to Sofer Athlan-Guyot from comment #5)
> Thanks Jakub I think that's going to help to get to the bottom of it.  My
> idea here is that maybe the init process didn't yet took the orphaned
> process in at the time the post script is run making this line:
> 
>    ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'
> 
> returns empty.

You told me that it reproduces even if you run 'yum update' 10 minutes after agent was stopped.

Comment 8 Sofer Athlan-Guyot 2017-04-25 13:02:00 UTC
(In reply to Jakub Libosvar from comment #7)
> (In reply to Sofer Athlan-Guyot from comment #5)
> > Thanks Jakub I think that's going to help to get to the bottom of it.  My
> > idea here is that maybe the init process didn't yet took the orphaned
> > process in at the time the post script is run making this line:
> > 
> >    ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'
> > 
> > returns empty.
> 
> You told me that it reproduces even if you run 'yum update' 10 minutes after
> agent was stopped.

hum. ... yep that's correct, the logs showed us that it was upgraded well after it was upgraded, but still produced the orphaned process we saw on the platform together.

Anyway that was just an "idea", we will see with the debug output of the package were it's going.

Comment 9 Jaromir Coufal 2017-04-25 14:07:26 UTC
For OSP11, let's make sure we have documented that reboot is needed after undercloud upgrade. (Dan can you make sure we have it in the docs?). Since this is part of standard procedure, we don't need to backport the "fix" to async later on.

I'd clone this bug to OSP12 to make sure it is improved user experience in the following releases.

Comment 10 Dan Macpherson 2017-04-25 14:54:54 UTC
No prob, Jarda. We've already got the reboot procedure included after the undercloud upgrade specifically for kernel or ovs package updates. I'll modify the text to say the reboot is compulsory.

Comment 12 Jakub Libosvar 2017-05-04 16:50:34 UTC
(In reply to Jaromir Coufal from comment #9)
> For OSP11, let's make sure we have documented that reboot is needed after
> undercloud upgrade. (Dan can you make sure we have it in the docs?). Since
> this is part of standard procedure, we don't need to backport the "fix" to
> async later on.
> 
> I'd clone this bug to OSP12 to make sure it is improved user experience in
> the following releases.

Based on that I'm also postponing this to the Zstream.

BTW has anybody tried the upgrade with the scratch builds I had provided. Would be helpful to identify the root cause.

Comment 13 Raviv Bar-Tal 2017-05-08 06:37:13 UTC
Hi Jakub, All.
As the fix is postponed to z stream, we can test your private build after the release, I will contact you and we sill debug it together

Comment 14 Sofer Athlan-Guyot 2017-06-12 12:34:10 UTC
Hi Raviv, going to close this one as the need to reboot is in the documentation.  Don't hesitate to re-open it if there is any update.


Note You need to log in before you can comment on or make changes to this bug.