rhel-osp-director: 9.0 rabbitmq and neutron-server failed to load after reboot of undercloud. Environment: instack-undercloud-4.0.0-4.el7ost.noarch python-neutron-8.1.0-6.el7ost.noarch python-neutron-lib-0.0.2-1.el7ost.noarch openstack-neutron-8.1.0-6.el7ost.noarch openstack-neutron-openvswitch-8.1.0-6.el7ost.noarch python-neutronclient-4.1.1-2.el7ost.noarch openstack-neutron-common-8.1.0-6.el7ost.noarch openstack-neutron-ml2-8.1.0-6.el7ost.noarch rabbitmq-server-3.6.2-3.el7ost.noarch Steps to reproduce: 1. Deploy overcloud with: openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --swift-storage-scale 0 --block-storage-scale 0 --neutron-tunnel-types vxlan,gre --neutron-network-type vxlan,gre --neutron-network-vlan-ranges datacentre:118:143 --neutron-bridge-mappings datacentre:br-ex --ntp-server clock.redhat.com --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /home/stack/ssl-heat-templates/environments/enable-tls.yaml -e /home/stack/ssl-heat-templates/environments/inject-trust-anchor.yaml --ceph-storage-scale 1 2. Populate the overcloud 3. Reboot the undercloud 4. After reboot attempt to run "nova list" on the undercloud. Result: No handlers could be found for logger "keystoneauth.identity.generic.base" [stack@instack ~]$ sudo systemctl|grep -i fail ● neutron-server.service loaded failed failed OpenStack Neutron Server ● ovirt-guest-agent.service loaded failed failed oVirt Guest Agent ● postfix.service loaded failed failed Postfix Mail Transport Agent ● rabbitmq-server.service loaded failed failed RabbitMQ broker Manually starting rabbitmq and neutron-server works. Workaround: systemctl start rabbitmq-server.service systemctl start neutron-server.service nova list
Created attachment 1170384 [details] neutron and rabbitmq logs
On another deployment/attempt, only the neutron-server was failed after rebooting the undercloud. Manually started it.
Splitting this into separate bugs for neutron and rabbit
Can you please provide rabbit logs?
Peter, can you resolve/follow this from a rabbit perspective please? If there are still issues after that we can bounce it to the neutron team.
Few things - you shouldn't use systemctl for RabbitMQ cluster management. Please use pcs for that. We found recently that RabbitMQ might not join cluster in some cases (see bug 1343905). If possible please upgrade resource-agents up to resource-agents-3.9.5-76.el7
Peter - On the undercloud, we're not HA enabled yet, so it will be using systemctl.
I really don't see anything related to RabbitMQ in the logs (or log snippets) posted above. I suspect some systemd-related failure. Do we have anything in journald logs?
Just what I posted in comment #8.
> I suspect some systemd-related failure. Is a version of the rabbit agent that uses systemctl to perform the actual starts and stops?
(In reply to Andrew Beekhof from comment #18) > > I suspect some systemd-related failure. > > Is a version of the rabbit agent that uses systemctl to perform the actual > starts and stops? this is on the undercloud, there is no pacemaker or ras involved yet.
Rasca and I took a deeper look at this. This is not a specific rabbit issue but it has to do with systemd/network configuration because we found a bunch of services not started after a reboot: galera: 160726 7:11:06 [Note] Server socket created on IP: '192.0.2.1'. 160726 7:11:06 [ERROR] Can't start server: Bind on TCP/IP port. Got error: 99: Cannot assign requested address 160726 7:11:06 [ERROR] Do you already have another mysqld server running on port: 3306 ? 160726 7:11:06 [ERROR] Aborting rabbitmq: =ERROR REPORT==== 26-Jul-2016::07:11:10 === Failed to start Ranch listener {acceptor,{192,0,2,1},5672} in ranch_tcp:listen([{port, 5672},{ip,{192,0,2,1}},inet,binary,{packet,raw},{reuseaddr,true},{backlog,128},{nodelay,true},{exit_on_close,false}]) for reason eaddrnotavail (can't assign requested address) swift: lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: Traceback (most recent call last): lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/bin/swift-proxy-server", line 23, in <module> lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: sys.exit(run_wsgi(conf_file, 'proxy-server', **options)) lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 883, in run_wsgi lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: error_msg = strategy.bind_ports() lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 481, in bind_ports lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: self.sock = get_socket(self.conf) lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/swift/common/wsgi.py", line 189, in get_socket lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: family=address_family) lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib/python2.7/site-packages/eventlet/convenience.py", line 43, in listen lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: sock.bind(addr) lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: File "/usr/lib64/python2.7/socket.py", line 224, in meth lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: return getattr(self._sock,name)(*args) lug 26 07:11:07 haa-01.ha.lab.eng.bos.redhat.com swift-proxy-server[11476]: socket.error: [Errno 99] Cannot assign requested address The problem seems to be that the 192.0.2.1 which sits on br-ctlplane is configured too late and so all services that bind to the ip are failed: lug 26 07:11:12 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info> (br-ctlplane): link connected lug 26 07:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface br-ctlplane: [ OK ] lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface eth0: [ OK ] lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Starting Network Time Service... lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Started Network Time Service. lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info> (eth1): enslaved to non-master-type device ovs-system; ignoring lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com network[13791]: Bringing up interface eth1: [ OK ] lug 26 03:11:14 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Started LSB: Bring up/down networking. lug 26 03:11:18 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info> (eth1): enslaved to non-master-type device ovs-system; ignoring lug 26 03:11:18 haa-01.ha.lab.eng.bos.redhat.com NetworkManager[11561]: <info> (eth1): link connected
According to https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ one course of action (since it seems we do not use NetworkManager to configure 192.0.2.1 on the ctlplane) is to make sure all services use: """ Alternatively, you can change your service that needs the network to be up, to include After=network-online.target and Wants=network-online.target. """
Another possible approach (since adding network-online is probably way too cumbersome and a quick test did not show good results at least with rabbit), is to enable the sysctl ip_nonlocal_bind which will let the bind() syscall proceed even without the ip address. A quick test showed that setting net.ipv4.ip_nonlocal_bind makes all services start up correctly, except for: Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Starting PXE boot dnsmasq service for Ironic Inspector... Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com dnsmasq[12297]: dnsmasq: unknown interface br-ctlplane Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service: control process exited, code=exited status=2 Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Failed to start PXE boot dnsmasq service for Ironic Inspector. Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: Unit openstack-ironic-inspector-dnsmasq.service entered failed state. Jul 26 07:11:03 haa-01.ha.lab.eng.bos.redhat.com systemd[1]: openstack-ironic-inspector-dnsmasq.service failed. Likely because dnsmasq does not call bind() but literally checks for an interface called br-ctlplane :/
So I tried to reproduce the problem and even if, as Michele was writing initially, we found something (not just related to rabbitmq but also with galera and swift services), with the latest puddle or one version earlier I was not able to have the same behavior. The undercloud always reboots correctly with all the services started in place. So maybe something was fixed in the meantime, but not in the puddles, instead this could be a matter of operating system packages (NetworkManager?). So in the end, if we can have something reproducible then we can give further investigation on the problem, otherwise we'll close this bug. In any case this is not a rabbit-only problem and the bug assignment needs to be changed. But first of all let's reproduce it.
As for comment #23, can you please try to reproduce the issue with the latest puddle?
Changing to a generic component, dropping the blocker flags at this stage. Will keep the DFG:PIDONE for a bit longer, but if it's still a bug it needs to find a new home.
Moving back to OSP9 and reassigning to DFG:DF-OSP9 for further investigation.
Need to re-triage this and assign the correct owner.
The issue doesn't reproduce. Will re-open if it will.
*** Bug 1364587 has been marked as a duplicate of this bug. ***
So here's what I think is wrong. We've got... [root@undercloud ~]# grep After /usr/lib/systemd/system/rabbitmq-server.service After=network.target epmd.0.0.socket After=network.target is pretty standard. All sorts of stuff uses it, openstack services and core services alike: [root@undercloud ~]# grep -l 'After=.*network.target' /usr/lib/systemd/system/*.service | wc -l 128 So what's the deal? Relevant excerpts from https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ : "network.target has very little meaning during start-up. It only indicates that the network management stack is up after it has been reached. Whether any network interfaces are already configured when it is reached is undefined. It's primary purpose is for ordering things properly at shutdown: since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off." OK, not what I would expect. "network-online.target is a target that actively waits until the nework is "up", where the definition of "up" is defined by the network management software. Usually it indicates a configured, routable IP address of some kind. [...] It is strongly recommended not to pull in this target too liberally: for example network server software should generally not pull this in (since server software generally is happy to accept local connections even before any routable network interface is up), it's primary purpose is network client software that cannot operate without network." OK, so in theory we should not need that for our servers. But in practice, I think we do. Here's why. We explicitly bind our services to non-loopback/non-wildcard addresses: [root@undercloud ~]# grep '.*listen.*192' /etc/nova/nova.conf ec2_listen=192.0.2.1 osapi_compute_listen=192.0.2.1 metadata_listen=192.0.2.1 osapi_volume_listen=192.0.2.1 [root@undercloud ~]# grep 192 /etc/rabbitmq/rabbitmq-env.conf NODE_IP_ADDRESS=192.0.2.1 etc. And most importantly, on the undercloud, we do *not* set ip_nonlocal_bind: [root@undercloud ~]# cat /proc/sys/net/ipv4/ip_nonlocal_bind 0 Whereas we *do* set that option on the overcloud so haproxy can bind to the VIP addresses even when the host does not currently have the VIP. So the race that happens on the undercloud is roughly: (1) systemd reaches network.target, interface not yet configured with address (2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit fails (3) NetworkManager configures address on interface Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind. So, how to fix? Option 1 is to set ip_nonlocal_bind on the undercloud. Option 2 from the systemd docs is to enable NetworkManager-wait-online.service, which causes network.target to behave the same as network-online.target. This will ensure that the interfaces are up and addresses configured before starting the rest of the services. I don't really have an opinion either way.
(In reply to John Eckersberg from comment #34) > So the race that happens on the undercloud is roughly: > > (1) systemd reaches network.target, interface not yet configured with address > (2) service such as rabbitmq starts, tries to bind to 192.0.2.1, fails, unit > fails > (3) NetworkManager configures address on interface > > Note that on the overcloud (2) cannot fail due to ip_nonlocal_bind. Thanks, eck. Makes sense. Is there any reason to suspect this is a regression in OSP-9? > So, how to fix? > > Option 1 is to set ip_nonlocal_bind on the undercloud. > > Option 2 from the systemd docs is to enable > NetworkManager-wait-online.service, which causes network.target to behave > the same as network-online.target. This will ensure that the interfaces are > up and addresses configured before starting the rest of the services. > > I don't really have an opinion either way. In terms of consequences for the OSP-9 release schedule, I'd say either option would put us at risk of serious unintended regressions - significant retesting would be required.
(In reply to Mark McLoughlin from comment #35) > > Is there any reason to suspect this is a regression in OSP-9? > I don't think so. I've (rarely) seen similar behavior since OSP-5.
This appears to be causing a significant regression (bug 1367931), so reverting the fix.
*** Bug 1367931 has been marked as a duplicate of this bug. ***
don't revert it please, here's the fix: https://review.openstack.org/356754
in fact, we'll need both https://review.openstack.org/356755 and https://review.openstack.org/356754 patches to close the whole bug and make sure we cover all use cases.
Thanks Emilien. I did revert the patch internally, but can restore it if/when these patches land upstream.
See comment #49
Cloned this to bug 1368104 because it needs a fix in openstack-puppet-modules
Unable to reproduce with: instack-undercloud-4.0.0-13.el7ost.noarch puppet-3.6.2-4.el7sat.noarch openstack-puppet-modules-8.1.8-2.el7ost.noarch openstack-tripleo-puppet-elements-2.0.0-4.el7ost.noarch
Not a big deal, but it is worth noting that we got support for net.ipv6.ip_nonlocal_bind in kernel kernel-3.10.0-327.21.1.el7 (via bz https://bugzilla.redhat.com/show_bug.cgi?id=1324502). So any undercloud deployment on a kernel previous to that one will fail because that sysctl key does not exist.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1766.html
*** Bug 1264901 has been marked as a duplicate of this bug. ***