Description of problem: OSP10 -> OSP11 upgrade fails during major-upgrade-converge in IPv6 deployment. The deployment is running the NeutronAPI service on a custom role called ServiceApi: stdout: overcloud.AllNodesDeploySteps.ServiceApiDeployment_Step4.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 61df3a53-bb6f-4420-98d3-3fd76d96651a status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 6 deploy_stdout: | ... Notice: /Stage[main]/Gnocchi::Wsgi::Apache/Openstacklib::Wsgi::Apache[gnocchi_wsgi]/Apache::Vhost[gnocchi_wsgi]/Concat[10-gnocchi_wsgi.conf]/File[/etc/httpd/conf.d/10-gnocchi_wsgi.conf]/ensure: defined content as '{md5}221a6da9b094e622419cde1242660a7f' Notice: /Stage[main]/Aodh::Wsgi::Apache/Openstacklib::Wsgi::Apache[aodh_wsgi]/Apache::Vhost[aodh_wsgi]/Concat[10-aodh_wsgi.conf]/File[/etc/httpd/conf.d/10-aodh_wsgi.conf]/ensure: defined content as '{md5}f10fec407d63a6ffa887fec69f77dbca' Notice: /Stage[main]/Apache/Concat[/etc/httpd/conf/ports.conf]/File[/etc/httpd/conf/ports.conf]/content: content changed '{md5}49bacec5e46b17ef1ac23107638101c1' to '{md5}a62bf86d22650349b101bdde1b3a79c0' Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::config::end]: Triggered 'refresh' from 5 events Notice: /Stage[main]/Keystone::Db::Sync/Exec[keystone-manage db_sync]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::dbsync::end]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Keystone/Exec[keystone-manage bootstrap]: Triggered 'refresh' from 1 events Notice: /Stage[main]/Keystone::Deps/Anchor[keystone::service::begin]: Triggered 'refresh' from 3 events Notice: /Stage[main]/Apache::Service/Service[httpd]: Triggered 'refresh' from 3 events Notice: Applied catalog in 244.44 seconds (truncated, view all with --long) deploy_stderr: | ... Mar 20 14:03:07 serviceapi-0.localdomain neutron-server[201359]: Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications". Mar 20 14:03:07 serviceapi-0.localdomain neutron-server[201359]: Could not load neutron.openstack.common.notifier.rpc_notifier Mar 20 14:03:09 serviceapi-0.localdomain neutron-server[201359]: /usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py:200: FutureWarning: The access_policy argument is changing its default value to <class 'oslo_messaging.rpc.dispatcher.DefaultRPCAccessPolicy'> in version '?', please update the code to explicitly set None as the value: access_policy defaults to LegacyRPCAccessPolicy which exposes private methods. Explicitly set access_policy to DefaultRPCAccessPolicy or ExplicitRPCAccessPolicy. Mar 20 14:03:09 serviceapi-0.localdomain neutron-server[201359]: access_policy) Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Failed to start OpenStack Neutron Server. Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Unit neutron-server.service entered failed state. Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service failed. (truncated, view all with --long) Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-6.0.0-0.20170307170102.3134785.0rc2.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy OSP10 overcloud with custom role running systemd managed services 2. Upgrade the environment to OSP11 Actual results: Upgrade fails during major-upgrade-converge. Expected results: Upgrade succeeds. Additional info: Attaching sosreports from the serviceapi nodes.
Hi, So the neutron-server eventually started: Mar 20 14:05:21 serviceapi-0.localdomain systemd[1]: Starting OpenStack Neutron Server.. It tooks systemd ~2min: Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Failed to start OpenStack Neutron Server. The error suggests that the port is still in used and listening: Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds I could indeed repropduce the same error by trying to start the neutron server while another was running. All in all this exclude any parsing of ipv6 address, and it's really a system issue. We can't rebind to it because a process is still listening. After a while the socket is closed and systemd can restart neutron-server. In the log we can see that: 2017-03-20 14:03:06.625 189782 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2017-03-20 14:03:06.625 189784 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2017-03-20 14:03:06.625 189782 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137 2017-03-20 14:03:06.625 189784 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137 2017-03-20 14:03:06.625 189783 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting I think that messes up with a proper stop of neutron leading to this error. So first, it doesn't look ipv6 related (but I may be wrong), but more like a transient error. Second, there is no easy way around this problem and beside finding why one of the process didn't die properly. Asking for some help from networking.
I took a look and didn't see anything obvious. There seem to be quite a few DuplicateMessageErrors in the logs. I don't normally see those, but I also don't usually run with multiple rabbit servers. When doing systemctl stop neutron-server when I first logged in, it seemed to take a couple of minutes to finish. Successive tries went much faster. It looked like the logs showed all of the child processes exiting normally--though the RpcWorkers actually log when their stop() method is called. It might be nice to have the API processes do the same. This will take some more investigation. I'm going on PTO for the next couple of weeks, so Assaf will make sure someone else picks this up.
The error here appears that systemd fails to start Neutron because Neutron lacks connectivity to its depending services (like rabbit or database). This should have been solved by setting TimeoutStartSec="infinity" according https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStartSec= The problem is that el based systems don't support "infinity" as per man page: TimeoutStartSec= Configures the time to wait for start-up. If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again. Takes a unit-less value in seconds, or a time span value such as "5min 20s". Pass "0" to disable the timeout logic. Defaults to DefaultTimeoutStartSec= from the manager configuration file, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd- system.conf(5)). We fixed this in RDO but I didn't backport it to stable branches: https://review.rdoproject.org/r/#/c/5640/
patch proposed to ocata-rdo branch not yet merged
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245