Bug 1348700 - rhel-osp-director: 9.0 different services failed to load after reboot of undercloud. (instack-undercloud)
Summary: rhel-osp-director: 9.0 different services failed to load after reboot of unde...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 9.0 (Mitaka)
Assignee: Emilien Macchi
QA Contact: Arik Chernetsky
URL:
Whiteboard:
: 1264901 1364587 1367931 (view as bug list)
Depends On: 1368104
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-21 19:53 UTC by Alexander Chuzhoy
Modified: 2016-10-28 13:48 UTC (History)
29 users (show)

Fixed In Version: instack-undercloud-4.0.0-13.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1349452 1368104 (view as bug list)
Environment:
Last Closed: 2016-08-24 13:30:51 UTC
Target Upstream Version:
Embargoed:
tvignaud: needinfo+
tvignaud: needinfo+


Attachments (Terms of Use)
neutron and rabbitmq logs (545.82 KB, application/x-gzip)
2016-06-21 19:55 UTC, Alexander Chuzhoy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1612789 0 None None None 2016-08-12 19:56:38 UTC
Launchpad 1614475 0 None None None 2016-08-18 12:24:07 UTC
OpenStack gerrit 355612 0 None None None 2016-08-17 13:37:43 UTC
OpenStack gerrit 356372 0 None None None 2016-08-17 10:43:19 UTC
OpenStack gerrit 356754 0 None None None 2016-08-17 21:43:23 UTC
Red Hat Product Errata RHBA-2016:1766 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 9 director 0day Advisory 2016-08-24 17:29:37 UTC

Description Alexander Chuzhoy 2016-06-21 19:53:17 UTC
rhel-osp-director: 9.0 rabbitmq and neutron-server failed to load after reboot of undercloud.

Environment:
instack-undercloud-4.0.0-4.el7ost.noarch
python-neutron-8.1.0-6.el7ost.noarch
python-neutron-lib-0.0.2-1.el7ost.noarch
openstack-neutron-8.1.0-6.el7ost.noarch
openstack-neutron-openvswitch-8.1.0-6.el7ost.noarch
python-neutronclient-4.1.1-2.el7ost.noarch
openstack-neutron-common-8.1.0-6.el7ost.noarch
openstack-neutron-ml2-8.1.0-6.el7ost.noarch
rabbitmq-server-3.6.2-3.el7ost.noarch


Steps to reproduce:
1. Deploy overcloud with:
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --swift-storage-scale 0 --block-storage-scale 0 --neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre --neutron-network-vlan-ranges datacentre:118:143 --neutron-bridge-mappings datacentre:br-ex --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /home/stack/ssl-heat-templates/environments/enable-tls.yaml -e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml --ceph-storage-scale 1

2. Populate the overcloud
3. Reboot the undercloud
4. After reboot attempt to run "nova list" on the undercloud.


Result:

No handlers could be found for logger "keystoneauth.identity.generic.base"

[stack@instack ~]$ sudo systemctl|grep -i fail
● neutron-server.service  loaded failed     failed   OpenStack Neutron Server
● ovirt-guest-agent.service loaded failed     failed          oVirt Guest Agent
● postfix.service   loaded failed     failed          Postfix Mail Transport Agent
● rabbitmq-server.service   loaded failed     failed          RabbitMQ broker


Manually starting rabbitmq and neutron-server works.

Workaround:
systemctl start rabbitmq-server.service
systemctl start neutron-server.service
nova list

Comment 1 Alexander Chuzhoy 2016-06-21 19:55:10 UTC
Created attachment 1170384 [details]
neutron and rabbitmq logs

Comment 2 Alexander Chuzhoy 2016-06-21 20:49:20 UTC
On another deployment/attempt, only the neutron-server was failed after rebooting the undercloud.
Manually started it.

Comment 3 Mike Burns 2016-06-23 13:18:45 UTC
Splitting this into separate bugs for neutron and rabbit

Comment 4 Mike Burns 2016-06-23 13:22:02 UTC
Can you please provide rabbit logs?

Comment 5 Andrew Beekhof 2016-06-28 23:39:40 UTC
Peter, can you resolve/follow this from a rabbit perspective please?
If there are still issues after that we can bounce it to the neutron team.

Comment 6 Peter Lemenkov 2016-06-29 12:58:54 UTC
Few things - you shouldn't use systemctl for RabbitMQ cluster management. Please use pcs for that.

We found recently that RabbitMQ might not join cluster in some cases (see bug 1343905). If possible please upgrade resource-agents up to resource-agents-3.9.5-76.el7

Comment 7 Jay Dobies 2016-06-30 13:11:17 UTC
Peter - On the undercloud, we're not HA enabled yet, so it will be using systemctl.

Comment 12 Peter Lemenkov 2016-07-08 13:25:07 UTC
I really don't see anything related to RabbitMQ in the logs (or log snippets) posted above. I suspect some systemd-related failure.

Do we have anything in journald logs?

Comment 13 Dan Macpherson 2016-07-11 00:44:44 UTC
Just what I posted in comment #8.

Comment 18 Andrew Beekhof 2016-07-26 01:19:33 UTC
> I suspect some systemd-related failure.

Is a version of the rabbit agent that uses systemctl to perform the actual starts and stops?

Comment 19 Fabio Massimo Di Nitto 2016-07-26 03:23:20 UTC
(In reply to Andrew Beekhof from comment #18)
> > I suspect some systemd-related failure.
> 
> Is a version of the rabbit agent that uses systemctl to perform the actual
> starts and stops?

this is on the undercloud, there is no pacemaker or ras involved yet.

Comment 20 Michele Baldessari 2016-07-26 07:57:51 UTC
Rasca and I took a deeper look at this. This is not a specific rabbit issue but it has to do with systemd/network configuration because we found a bunch of services not started after a reboot:

galera:
160726  7:11:06 [Note] Server socket created on IP: '192.0.2.1'.
160726  7:11:06 [ERROR] Can't start server: Bind on TCP/IP port. Got error: 99: Cannot assign requested address
160726  7:11:06 [ERROR] Do you already have another mysqld server running on port: 3306 ?
160726  7:11:06 [ERROR] Aborting

rabbitmq:
=ERROR REPORT==== 26-Jul-2016::07:11:10 ===
Failed to start Ranch listener {acceptor,{192,0,2,1},5672} in ranch_tcp:listen([{port,
  5672},{ip,{192,0,2,1}},inet,binary,{packet,raw},{reuseaddr,true},{backlog,128},{nodelay,true},{exit_on_close,false}]) for reason eaddrnotavail (can't assign requested address)

swift:
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: Traceback (most recent call last):
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/bin/swift-proxy-server", line 23, in <module>
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: sys.exit(run_wsgi(conf_file, 'proxy-server', **options))
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 883, in run_wsgi
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: error_msg = strategy.bind_ports()
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 481, in bind_ports
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: self.sock = get_socket(self.conf)
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 189, in get_socket
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: family=address_family)
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: sock.bind(addr)
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib64/python2.7/socket.py", line 224, in meth
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: return getattr(self._sock,name)(*args)
lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: socket.error: [Errno 99] Cannot assign requested address

The problem seems to be that the 192.0.2.1 which sits on br-ctlplane is configured too late and so all services that bind to the ip are failed:

lug 26 07:11:12 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info>  (br-ctlplane): link connected
lug 26 07:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface br-ctlplane:  [  OK  ]
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface eth0:  [  OK  ]
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Starting Network Time Service...
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Started Network Time Service.
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info>  (eth1): enslaved to non-master-type device ovs-system; ignoring
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface eth1:  [  OK  ]
lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Started LSB: Bring up/down networking.
lug 26 03:11:18 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info>  (eth1): enslaved to non-master-type device ovs-system; ignoring
lug 26 03:11:18 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info>  (eth1): link connected

Comment 21 Michele Baldessari 2016-07-26 08:03:53 UTC
According to https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ one course of action (since it seems we do not use NetworkManager to configure 192.0.2.1 on the ctlplane) is to make sure all services use:
"""
Alternatively, you can change your service that needs the network to be up, to include After=network-online.target and Wants=network-online.target.
"""

Comment 22 Michele Baldessari 2016-07-26 08:52:06 UTC
Another possible approach (since adding network-online is probably way too cumbersome and a quick test did not show good results at least with rabbit), is
to enable the sysctl ip_nonlocal_bind which will let the bind() syscall 
proceed even without the ip address.

A quick test showed that setting net.ipv4.ip_nonlocal_bind makes all services
start up correctly, except for:
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector...
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com dnsmasq[12297]: dnsmasq: unknown interface br-ctlplane
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector.
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state.
Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service failed.


Likely because dnsmasq does not call bind() but literally checks for an interface called br-ctlplane :/

Comment 23 Raoul Scarazzini 2016-07-26 14:48:22 UTC
So I tried to reproduce the problem and even if, as Michele was writing initially, we found something (not just related to rabbitmq but also with galera and swift services), with the latest puddle or one version earlier I was not able to have the same behavior.
The undercloud always reboots correctly with all the services started in place.
So maybe something was fixed in the meantime, but not in the puddles, instead this could be a matter of operating system packages (NetworkManager?).
So in the end, if we can have something reproducible then we can give further investigation on the problem, otherwise we'll close this bug.
In any case this is not a rabbit-only problem and the bug assignment needs to be changed.
But first of all let's reproduce it.

Comment 24 Fabio Massimo Di Nitto 2016-07-26 15:01:07 UTC
As for comment #23, can you please try to reproduce the issue with the latest puddle?

Comment 25 Fabio Massimo Di Nitto 2016-07-26 15:03:13 UTC
Changing to a generic component, dropping the blocker flags at this stage. Will keep the DFG:PIDONE for a bit longer, but if it's still a bug it needs to find a new home.

Comment 26 Fabio Massimo Di Nitto 2016-07-27 09:08:55 UTC
Moving back to OSP9 and reassigning to DFG:DF-OSP9 for further investigation.

Comment 27 Mike Orazi 2016-07-27 21:05:36 UTC
Need to re-triage this and assign the correct owner.

Comment 28 Alexander Chuzhoy 2016-07-27 22:12:37 UTC
The issue doesn't reproduce. Will re-open if it will.

Comment 29 Fabio Massimo Di Nitto 2016-08-09 10:39:21 UTC
*** Bug 1364587 has been marked as a duplicate of this bug. ***

Comment 34 John Eckersberg 2016-08-09 15:12:31 UTC
So here's what I think is wrong.

We've got...

[root@undercloud ~]# grep After /usr/lib/systemd/system/rabbitmq-server.service 
After=network.target epmd.0.0.socket

After=network.target is pretty standard.  All sorts of stuff uses it, openstack services and core services alike:

[root@undercloud ~]# grep -l 'After=.*network.target' /usr/lib/systemd/system/*.service | wc -l
128

So what's the deal?  Relevant excerpts from https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ :

"network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off."

OK, not what I would expect.

"network-online.target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network."

OK, so in theory we should not need that for our servers.  But in practice, I think we do.  Here's why.  We explicitly bind our services to non-loopback/non-wildcard addresses:

[root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf
ec2_listen=192.0.2.1
osapi_compute_listen=192.0.2.1
metadata_listen=192.0.2.1
osapi_volume_listen=192.0.2.1
[root@undercloud ~]# grep 192 /etc/rabbitmq/rabbitmq-env.conf 
NODE_IP_ADDRESS=192.0.2.1

etc.

And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind:

[root@undercloud ~]# cat /proc/sys/net/ipv4/ip_nonlocal_bind 
0

Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP.

So the race that happens on the undercloud is roughly:

(1) systemd reaches network.target, interface not yet configured with address
(2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails
(3) NetworkManager configures address on interface

Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.

So, how to fix?

Option 1 is to set ip_nonlocal_bind on the undercloud.

Option 2 from the systemd docs is to enable NetworkManager-wait-online.service, which causes network.target to behave the same as network-online.target.  This will ensure that the interfaces are up and addresses configured before starting the rest of the services.

I don't really have an opinion either way.

Comment 35 Mark McLoughlin 2016-08-10 06:48:44 UTC
(In reply to John Eckersberg from comment #34)

> So the race that happens on the undercloud is roughly:
> 
> (1) systemd reaches network.target, interface not yet configured with address
> (2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit
> fails
> (3) NetworkManager configures address on interface
> 
> Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind.

Thanks, eck. Makes sense.

Is there any reason to suspect this is a regression in OSP-9?

> So, how to fix?
> 
> Option 1 is to set ip_nonlocal_bind on the undercloud.
> 
> Option 2 from the systemd docs is to enable
> NetworkManager-wait-online.service, which causes network.target to behave
> the same as network-online.target.  This will ensure that the interfaces are
> up and addresses configured before starting the rest of the services.
> 
> I don't really have an opinion either way.

In terms of consequences for the OSP-9 release schedule, I'd say either option would put us at risk of serious unintended regressions - significant retesting would be required.

Comment 39 John Eckersberg 2016-08-10 13:54:19 UTC
(In reply to Mark McLoughlin from comment #35)
> 
> Is there any reason to suspect this is a regression in OSP-9?
> 

I don't think so.  I've (rarely) seen similar behavior since OSP-5.

Comment 44 Mike Burns 2016-08-17 21:23:51 UTC
This appears to be causing a significant regression (bug 1367931), so reverting the fix.

Comment 45 Mike Burns 2016-08-17 21:24:26 UTC
*** Bug 1367931 has been marked as a duplicate of this bug. ***

Comment 46 Emilien Macchi 2016-08-17 21:40:57 UTC
don't revert it please, here's the fix:
https://review.openstack.org/356754

Comment 47 Emilien Macchi 2016-08-17 21:44:10 UTC
in fact, we'll need both https://review.openstack.org/356755 and https://review.openstack.org/356754 patches to close the whole bug and make sure we cover all use cases.

Comment 48 Mike Burns 2016-08-18 02:11:03 UTC
Thanks Emilien.  I did revert the patch internally, but can restore it if/when these patches land upstream.

Comment 51 Thierry Vignaud 2016-08-18 12:04:25 UTC
See comment #49

Comment 52 Mike Burns 2016-08-18 12:32:57 UTC
Cloned this to bug 1368104 because it needs a fix in openstack-puppet-modules

Comment 54 Omri Hochman 2016-08-19 13:54:23 UTC
Unable to reproduce with: 

instack-undercloud-4.0.0-13.el7ost.noarch
puppet-3.6.2-4.el7sat.noarch
openstack-puppet-modules-8.1.8-2.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch

Comment 55 Michele Baldessari 2016-08-23 09:24:00 UTC
Not a big deal, but it is worth noting that we got support for 
net.ipv6.ip_nonlocal_bind in kernel kernel-3.10.0-327.21.1.el7 (via bz https://bugzilla.redhat.com/show_bug.cgi?id=1324502). So any undercloud deployment on a kernel previous to that one will fail because that sysctl key does not exist.

Comment 57 errata-xmlrpc 2016-08-24 13:30:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1766.html

Comment 58 Fabio Massimo Di Nitto 2016-08-30 12:58:56 UTC
*** Bug 1264901 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.