Description of problem: After performing manual upgrade procedure, the neutron agent is dead on undercloud. Version-Release number of selected component (if applicable): How reproducible: Manually run the undercloud upgrade procedure, check neutron agent - neutron agent-list Steps to Reproduce: yum repolist -v enabled sudo systemctl list-units 'openstack-*' #get latest rhos-release & setup 11 repos sudo yum localinstall -y http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/rhos-release/rhos-release-latest.noarch.rpm sudo rhos-release 11 # disable OSP10 repos sudo yum-config-manager --disable 'rhelosp-10.0*' yum repolist -v enabled # stop services as per [bz-1372040](https://bugzilla.redhat.com/show_bug.cgi?id=1372040#c6) # this will exist in OSP10 upgrades docs. sudo systemctl stop 'openstack-*' sudo systemctl stop 'neutron-*' sudo systemctl stop httpd #if you are going to do a backwards compatibility install save the old tht dir # cp -r /usr/share/openstack-tripleo-heat-templates ~/tht # update instack-undercloud and friends before running the upgrade sudo yum -y update instack-undercloud openstack-puppet-modules openstack-tripleo-common python-tripleoclient # UPGRADE openstack undercloud upgrade Actual results: Neutron agent is dead Expected results: Neutron agent is alive Additional info:
This is a follow up of https://bugzilla.redhat.com/show_bug.cgi?id=1436729 . We created a new one to remove confusion with mixversion as it's unrelated.
So looking at an env that fails to rip all the neutron wrap we have all the postscript in places: rpm -q --scripts openstack-neutron-openvswitch: postinstall scriptlet (using /bin/sh): if [ $1 -eq 1 ] ; then # Initial installation systemctl preset neutron-openvswitch-agent.service >/dev/null 2>&1 || : fi oldconf=/etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini newconf=/etc/neutron/plugins/ml2/openvswitch_agent.ini if [ $1 -gt 1 ]; then if [ -e $oldconf ]; then # Imitate noreplace cp $newconf ${newconf}.rpmnew cp $oldconf $newconf fi fi if [ $1 -ge 2 ]; then # We're upgrading # Detect if the neutron-openvswitch-agent is running ovs_agent_running=0 systemctl status neutron-openvswitch-agent > /dev/null 2>&1 && ovs_agent_running=1 || : # If agent is running, stop it [ $ovs_agent_running -eq 1 ] && systemctl stop neutron-openvswitch-agent > /dev/null 2>&1 || : # Search all orphaned neutron-rootwrap-daemon processes and since all are triggered by sudo, # get the actual rootwrap-daemon process. for pid in $(ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }'); do kill $(ps --ppid $pid -o pid=) done # If agent was running, start it back with new code [ $ovs_agent_running -eq 1 ] && systemctl start neutron-openvswitch-agent > /dev/null 2>&1 || : fi preuninstall scriptlet (using /bin/sh): if [ $1 -eq 0 ] ; then # Package removal, not upgrade systemctl --no-reload disable neutron-openvswitch-agent.service > /dev/null 2>&1 || : systemctl stop neutron-openvswitch-agent.service > /dev/null 2>&1 || : fi So it looks ok. Running the command on the platform: [root@undercloud-0 ~]# ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }' 12453 root 12453 0.0 0.0 193392 2820 ? S 08:34 0:00 sudo neutron-rootwrap-daemon /etc/neutron/rootwrap.conf So we indeed have a rogue rootwrap daemon. The ovs-vswitchd is not working properly: [root@undercloud-0 ~]# systemctl status ovs-vswitchd ● ovs-vswitchd.service - Open vSwitch Forwarding Unit Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled) Active: active (running) since Mon 2017-04-24 08:34:54 EDT; 3h 27min ago Main PID: 12332 (ovs-vswitchd) CGroup: /system.slice/ovs-vswitchd.service └─12332 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openvswitch/ovs-vswitchd.pid --detach Apr 24 08:34:54 undercloud-0.redhat.local systemd[1]: Starting Open vSwitch Forwarding Unit... Apr 24 08:34:54 undercloud-0.redhat.local ovs-ctl[12308]: Starting ovs-vswitchd [ OK ] Apr 24 08:34:54 undercloud-0.redhat.local ovs-ctl[12308]: Enabling remote OVSDB managers [ OK ] Apr 24 08:34:54 undercloud-0.redhat.local systemd[1]: Started Open vSwitch Forwarding Unit. Apr 24 08:46:56 undercloud-0.redhat.local ovs-vswitchd[12332]: ovs|00052|rconn|ERR|br-int<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting Apr 24 08:46:57 undercloud-0.redhat.local ovs-vswitchd[12332]: ovs|00054|rconn|ERR|br-ctlplane<->tcp:127.0.0.1:6633: no response to inactivity probe after 5 seconds, disconnecting and neutron agent-list is not happy: +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ | a036287f-4031-44a8-8454-b2bba1620bc6 | Open vSwitch agent | undercloud-0.redhat.local | | xxx | True | neutron-openvswitch-agent | | f41a2dff-be90-4fe3-9038-82caf6e7c025 | DHCP agent | undercloud-0.redhat.local | nova | :-) | True | neutron-dhcp-agent | +--------------------------------------+--------------------+---------------------------+-------------------+-------+----------------+---------------------------+ In the env we have: [root@undercloud-0 ~]# grep neutron /var/log/yum.log Apr 24 06:52:49 Installed: python-neutronclient-6.0.0-2.el7ost.noarch Apr 24 06:52:54 Installed: puppet-neutron-9.5.0-2.el7ost.noarch Apr 24 06:55:39 Installed: python-neutron-lib-0.4.0-1.el7ost.noarch Apr 24 06:55:41 Installed: 1:python-neutron-9.2.0-8.el7ost.noarch Apr 24 06:55:41 Installed: 1:openstack-neutron-common-9.2.0-8.el7ost.noarch Apr 24 06:55:42 Installed: 1:openstack-neutron-9.2.0-8.el7ost.noarch Apr 24 06:59:31 Installed: 1:openstack-neutron-ml2-9.2.0-8.el7ost.noarch Apr 24 06:59:42 Installed: 1:openstack-neutron-openvswitch-9.2.0-8.el7ost.noarch Apr 24 08:30:39 Updated: python-neutronclient-6.1.0-1.el7ost.noarch Apr 24 08:31:07 Updated: python-neutron-lib-1.1.0-1.el7ost.noarch Apr 24 08:31:08 Updated: 1:python-neutron-10.0.1-1.el7ost.noarch Apr 24 08:31:09 Updated: 1:openstack-neutron-common-10.0.1-1.el7ost.noarch Apr 24 08:31:09 Updated: 1:openstack-neutron-openvswitch-10.0.1-1.el7ost.noarch Apr 24 08:31:09 Updated: 1:openstack-neutron-10.0.1-1.el7ost.noarch Apr 24 08:31:09 Updated: 1:openstack-neutron-ml2-10.0.1-1.el7ost.noarch Apr 24 08:32:04 Updated: puppet-neutron-10.3.0-2.el7ost.noarch In an irc discussion with Jacob about having more information on the processes states, before and during upgrade. Jacob was planning to provide a neutron-openvswitch-agent package with the postscript modified to have some ps -ef triggered at the right time. Asking Jacob if I got that right :)
(In reply to Sofer Athlan-Guyot from comment #3) ... > > Asking Jacob if I got that right :) That's correct, the scratch-build packages are at https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13076113
Thanks Jakub I think that's going to help to get to the bottom of it. My idea here is that maybe the init process didn't yet took the orphaned process in at the time the post script is run making this line: ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }' returns empty.
(In reply to Sofer Athlan-Guyot from comment #5) > Thanks Jakub I think that's going to help to get to the bottom of it. My > idea here is that maybe the init process didn't yet took the orphaned > process in at the time the post script is run making this line: > > ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }' > > returns empty. You told me that it reproduces even if you run 'yum update' 10 minutes after agent was stopped.
(In reply to Jakub Libosvar from comment #7) > (In reply to Sofer Athlan-Guyot from comment #5) > > Thanks Jakub I think that's going to help to get to the bottom of it. My > > idea here is that maybe the init process didn't yet took the orphaned > > process in at the time the post script is run making this line: > > > > ps -f --ppid 1 | awk '/.*neutron-rootwrap-daemon/ { print $2 }' > > > > returns empty. > > You told me that it reproduces even if you run 'yum update' 10 minutes after > agent was stopped. hum. ... yep that's correct, the logs showed us that it was upgraded well after it was upgraded, but still produced the orphaned process we saw on the platform together. Anyway that was just an "idea", we will see with the debug output of the package were it's going.
For OSP11, let's make sure we have documented that reboot is needed after undercloud upgrade. (Dan can you make sure we have it in the docs?). Since this is part of standard procedure, we don't need to backport the "fix" to async later on. I'd clone this bug to OSP12 to make sure it is improved user experience in the following releases.
No prob, Jarda. We've already got the reboot procedure included after the undercloud upgrade specifically for kernel or ovs package updates. I'll modify the text to say the reboot is compulsory.
(In reply to Jaromir Coufal from comment #9) > For OSP11, let's make sure we have documented that reboot is needed after > undercloud upgrade. (Dan can you make sure we have it in the docs?). Since > this is part of standard procedure, we don't need to backport the "fix" to > async later on. > > I'd clone this bug to OSP12 to make sure it is improved user experience in > the following releases. Based on that I'm also postponing this to the Zstream. BTW has anybody tried the upgrade with the scratch builds I had provided. Would be helpful to identify the root cause.
Hi Jakub, All. As the fix is postponed to z stream, we can test your private build after the release, I will contact you and we sill debug it together
Hi Raviv, going to close this one as the need to reboot is in the documentation. Don't hesitate to re-open it if there is any update.