Bug 1434009
Summary: | neutron: OSP10 -> OSP11 upgrade fails due to neutron-server fails to start when rabbit or db is inaccessible | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> |
Component: | openstack-neutron | Assignee: | Jakub Libosvar <jlibosva> |
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 11.0 (Ocata) | CC: | amuller, aschultz, chrisw, dbecker, jcoufal, jlibosva, jschluet, mburns, morazi, nyechiel, oblaut, rhel-osp-director-maint, sathlang, srevivo, twilson |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 11.0 (Ocata) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-neutron-10.0.0-8.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-05-17 20:09:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Marius Cornea
2017-03-20 14:24:10 UTC
Hi, So the neutron-server eventually started: Mar 20 14:05:21 serviceapi-0.localdomain systemd[1]: Starting OpenStack Neutron Server.. It tooks systemd ~2min: Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE Mar 20 14:03:40 serviceapi-0.localdomain systemd[1]: Failed to start OpenStack Neutron Server. The error suggests that the port is still in used and listening: Mar 20 14:03:40 serviceapi-0.localdomain neutron-server[201359]: ERROR: Could not bind to fd00:fd00:fd00:2000::10:9696 after trying for 30 seconds I could indeed repropduce the same error by trying to start the neutron server while another was running. All in all this exclude any parsing of ipv6 address, and it's really a system issue. We can't rebind to it because a process is still listening. After a while the socket is closed and systemd can restart neutron-server. In the log we can see that: 2017-03-20 14:03:06.625 189782 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2017-03-20 14:03:06.625 189784 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting 2017-03-20 14:03:06.625 189782 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137 2017-03-20 14:03:06.625 189784 DEBUG neutron.service [-] calling RpcWorker stop() stop /usr/lib/python2.7/site-packages/neutron/service.py:137 2017-03-20 14:03:06.625 189783 INFO oslo_service.service [-] Parent process has died unexpectedly, exiting I think that messes up with a proper stop of neutron leading to this error. So first, it doesn't look ipv6 related (but I may be wrong), but more like a transient error. Second, there is no easy way around this problem and beside finding why one of the process didn't die properly. Asking for some help from networking. I took a look and didn't see anything obvious. There seem to be quite a few DuplicateMessageErrors in the logs. I don't normally see those, but I also don't usually run with multiple rabbit servers. When doing systemctl stop neutron-server when I first logged in, it seemed to take a couple of minutes to finish. Successive tries went much faster. It looked like the logs showed all of the child processes exiting normally--though the RpcWorkers actually log when their stop() method is called. It might be nice to have the API processes do the same. This will take some more investigation. I'm going on PTO for the next couple of weeks, so Assaf will make sure someone else picks this up. The error here appears that systemd fails to start Neutron because Neutron lacks connectivity to its depending services (like rabbit or database). This should have been solved by setting TimeoutStartSec="infinity" according https://www.freedesktop.org/software/systemd/man/systemd.service.html#TimeoutStartSec= The problem is that el based systems don't support "infinity" as per man page: TimeoutStartSec= Configures the time to wait for start-up. If a daemon service does not signal start-up completion within the configured time, the service will be considered failed and will be shut down again. Takes a unit-less value in seconds, or a time span value such as "5min 20s". Pass "0" to disable the timeout logic. Defaults to DefaultTimeoutStartSec= from the manager configuration file, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd- system.conf(5)). We fixed this in RDO but I didn't backport it to stable branches: https://review.rdoproject.org/r/#/c/5640/ patch proposed to ocata-rdo branch not yet merged Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245 |