Bug 1434009

Summary:	neutron: OSP10 -> OSP11 upgrade fails due to neutron-server fails to start when rabbit or db is inaccessible
Product:	Red Hat OpenStack	Reporter:	Marius Cornea <mcornea>
Component:	openstack-neutron	Assignee:	Jakub Libosvar <jlibosva>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	11.0 (Ocata)	CC:	amuller, aschultz, chrisw, dbecker, jcoufal, jlibosva, jschluet, mburns, morazi, nyechiel, oblaut, rhel-osp-director-maint, sathlang, srevivo, twilson
Target Milestone:	rc	Keywords:	Triaged
Target Release:	11.0 (Ocata)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-neutron-10.0.0-8.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-17 20:09:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marius Cornea 2017-03-20 14:24:10 UTC

Description of problem:
OSP10 -> OSP11 upgrade fails during major-upgrade-converge in IPv6 deployment. The deployment is running the NeutronAPI service on a custom role called ServiceApi:

stdout: overcloud.AllNodesDeploySteps.ServiceApiDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 61df3a53-bb6f-4420-98d3-3fd76d96651a
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6
  deploy_stdout: |
    ...
    Notice: /Stage[main]/Gnocchi::Wsgi::Apache/Openstacklib::Wsgi::Apache[gnocchi_wsgi]/Apache::Vhost[gnocchi_wsgi]/Concat[10-gnocchi_wsgi.conf]/File[/etc/httpd/conf.d/10-gnocchi_wsgi.conf]/ensure: defined content as '{md5}221a6da9b094e622419cde1242660a7f'
    Notice: /Stage[main]/Aodh::Wsgi::Apache/Openstacklib::Wsgi::Apache[aodh_wsgi]/Apache::Vhost[aodh_wsgi]/Concat[10-aodh_wsgi.conf]/File[/etc/httpd/conf.d/10-aodh_wsgi.conf]/ensure: defined content as '{md5}f10fec407d63a6ffa887fec69f77dbca'
    Notice: /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content: content changed '{md5}49bacec5e46b17ef1ac23107638101c1' to '{md5}a62bf86d22650349b101bdde1b3a79c0'
    Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::config::end]: Triggered 'refresh' from 5 events
    Notice: /Stage[main]/Keystone::Db::Sync/Exec[keystone-manage db_sync]: Triggered 'refresh' from 1 events
    Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::dbsync::end]: Triggered 'refresh' from 1 events
    Notice: /Stage[main]/Keystone/Exec[keystone-manage bootstrap]: Triggered 'refresh' from 1 events
    Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::service::begin]: Triggered 'refresh' from 3 events
    Notice: /Stage[main]/Apache::Service/Service[httpd]: Triggered 'refresh' from 3 events
    Notice: Applied catalog in 244.44 seconds
    (truncated, view all with --long)
  deploy_stderr: |
    ...
    Mar 20 14:03:07 serviceapi-0.localdomain neutron-server[201359]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
    Mar 20 14:03:07 serviceapi-0.localdomain neutron-server[201359]: Could not load neutron.openstack.common.notifier.rpc_notifier
    Mar 20 14:03:09 serviceapi-0.localdomain neutron-server[201359]: /usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:200: FutureWarning: The access_policy argument is changing its default value to <class 'oslo_messaging.rpc.dispatcher.DefaultRPCAccessPolicy'> in version '?', please update the code to explicitly set None as the value: access_policy defaults to LegacyRPCAccessPolicy which exposes private methods. Explicitly set access_policy to DefaultRPCAccessPolicy or ExplicitRPCAccessPolicy.
    Mar 20 14:03:09 serviceapi-0.localdomain neutron-server[201359]: access_policy)
    Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Failed to start OpenStack Neutron Server.
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Unit neutron-server.service entered failed state.
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service failed.
    
    (truncated, view all with --long)

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-0.20170307170102.3134785.0rc2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 overcloud with custom role running systemd managed services
2. Upgrade the environment to OSP11

Actual results:
Upgrade fails during major-upgrade-converge.

Expected results:
Upgrade succeeds.

Additional info:
Attaching sosreports from the serviceapi nodes.

Comment 2 Sofer Athlan-Guyot 2017-03-20 17:18:12 UTC

Hi,

So the neutron-server eventually started:

    Mar 20 14:05:21 serviceapi-0.localdomain systemd[1]: Starting OpenStack Neutron Server..

It tooks systemd ~2min:

    Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE
    Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Failed to start OpenStack Neutron Server.

The error suggests that the port is still in used and listening:

    Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds

I could indeed repropduce the same error by trying to start the
neutron server while another was running.

All in all this exclude any parsing of ipv6 address, and it's really a
system issue.  We can't rebind to it because a process is still
listening.  After a while the socket is closed and systemd can restart
neutron-server.

In the log we can see that:

    2017-03-20 14:03:06.625 189782 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting
    2017-03-20 14:03:06.625 189784 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting
    2017-03-20 14:03:06.625 189782 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137
    2017-03-20 14:03:06.625 189784 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137
    2017-03-20 14:03:06.625 189783 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting

I think that messes up with a proper stop of neutron leading to this error.

So first, it doesn't look ipv6 related (but I may be wrong), but more
like a transient error.

Second, there is no easy way around this problem and beside finding
why one of the process didn't die properly.

Asking for some help from networking.

Comment 3 Terry Wilson 2017-03-21 16:18:56 UTC

I took a look and didn't see anything obvious.

There seem to be quite a few DuplicateMessageErrors in the logs. I don't normally see those, but I also don't usually run with multiple rabbit servers. When doing systemctl stop neutron-server when I first logged in, it seemed to take a couple of minutes to finish. Successive tries went much faster. It looked like the logs showed all of the child processes exiting normally--though the RpcWorkers actually log when their stop() method is called. It might be nice to have the API processes do the same. This will take some more investigation. I'm going on PTO for the next couple of weeks, so Assaf will make sure someone else picks this up.

Comment 4 Jakub Libosvar 2017-03-23 10:02:22 UTC

The error here appears that systemd fails to start Neutron because Neutron lacks connectivity to its depending services (like rabbit or database). This should have been solved by setting TimeoutStartSec="infinity" according https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStartSec=

The problem is that el based systems don't support "infinity" as per man page:
       TimeoutStartSec=
           Configures the time to wait for start-up. If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and
           will be shut down again. Takes a unit-less value in seconds, or a time span value such as "5min 20s". Pass "0" to disable the timeout logic. Defaults to
           DefaultTimeoutStartSec= from the manager configuration file, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd-
           system.conf(5)).

We fixed this in RDO but I didn't backport it to stable branches: https://review.rdoproject.org/r/#/c/5640/

Comment 5 Jon Schlueter 2017-03-23 17:16:48 UTC

patch proposed to ocata-rdo branch not yet merged

Comment 8 errata-xmlrpc 2017-05-17 20:09:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245