Bug 1635665 - nova_placement container stays unhealthy (never recovers), problem connecting to MySQL after many retries
Summary: nova_placement container stays unhealthy (never recovers), problem connecting...
Keywords:
Status: CLOSED DUPLICATE of bug 1630069
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 14.0 (Rocky)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Lee Yarwood
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-03 13:03 UTC by Waldemar Znoinski
Modified: 2023-03-21 19:04 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-18 20:42:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-23384 0 None None None 2023-03-21 19:04:15 UTC

Description Waldemar Znoinski 2018-10-03 13:03:22 UTC
Description of problem:
OSP14 + opendaylight deployment
(opendaylight is not relevant here I guess)


can't spwan new VMs on overcloud, it's a deployment that passed a few tempest tests but when tried to run it manually (after a few hours since tempest):


in nova-scheduler.log:

2018-10-03 12:47:09.796 24 DEBUG oslo_concurrency.lockutils [req-d0aa38c6-5579-40a3-a3f5-9706b6240242 a8deb6344f3a4546b786fd40da85c012 873e0d69004c4b0381c701bd2ac85bf6 - default default] Lock "placement_client" released by "nova.scheduler.client.report._create_client" :: held 0.016s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-10-03 12:47:10.677 24 ERROR nova.scheduler.client.report [req-d0aa38c6-5579-40a3-a3f5-9706b6240242 a8deb6344f3a4546b786fd40da85c012 873e0d69004c4b0381c701bd2ac85bf6 - default default] Failed to retrieve allocation candidates from placement API for filters: RequestGroup(use_same_provider=False, resources={DISK_GB:1, MEMORY_MB:512, VCPU:1}, traits=[], aggregates=[])
Got 500: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at 
 [no address given] to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
.
2018-10-03 12:47:10.678 24 DEBUG nova.scheduler.manager [req-d0aa38c6-5579-40a3-a3f5-9706b6240242 a8deb6344f3a4546b786fd40da85c012 873e0d69004c4b0381c701bd2ac85bf6 - default default] Got no allocation candidates from the Placement API. This could be due to insufficient resources or a temporary occurrence as compute nodes start up. select_destinations /usr/lib/python2.7/site-packages/nova/scheduler/manager.py:150
2018-10-03 12:47:10.706 24 ERROR oslo.messaging._drivers.impl_rabbit [req-d0aa38c6-5579-40a3-a3f5-9706b6240242 a8deb6344f3a4546b786fd40da85c012 873e0d69004c4b0381c701bd2ac85bf6 - default default] [3d3b34ca-779a-4eef-ac40-5a87c12c02ea] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: error: [Errno 111] ECONNREFUSED
2018-10-03 12:47:11.727 24 ERROR oslo.messaging._drivers.impl_rabbit [req-d0aa38c6-5579-40a3-a3f5-9706b6240242 a8deb6344f3a4546b786fd40da85c012 873e0d69004c4b0381c701bd2ac85bf6 - default default] [3d3b34ca-779a-4eef-ac40-5a87c12c02ea] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.: error: [Errno 111] ECONNREFUSED


### NOTE THE LONG LASTING UNHEALTHY STATE
[root@controller-0 ~]# docker ps -a | grep -i nova_placement 
ef1df61022c5        192.168.24.1:8787/rhosp14/openstack-nova-placement-api:2018-09-26.1            "kolla_start"            23 hours ago        Up 14 hours (unhealthy)                       nova_placement
8dac25359c8c        192.168.24.1:8787/rhosp14/openstack-nova-placement-api:2018-09-26.1            "/bin/bash -c 'cho..."   23 hours ago        Exited (0) 23 hours ago                       nova_placement_init_log


docker inspect nova_placement shows:
           "Health": {
                "Status": "unhealthy",
                "FailingStreak": 1742,
                "Log": [
                    {
                        "Start": "2018-10-03T13:27:13.152906462+01:00",
                        "End": "2018-10-03T13:27:13.453833088+01:00",
                        "ExitCode": 1,
                        "Output": "\n500 172.17.1.22:8778 0.010 seconds\ncurl: (22) The requested URL returned error: 500 Internal Server Error\n"
                    },
                    {
                        "Start": "2018-10-03T13:27:43.454237889+01:00",
                        "End": "2018-10-03T13:27:43.849069205+01:00",
                        "ExitCode": 1,
                        "Output": "\n500 172.17.1.22:8778 0.009 seconds\ncurl: (22) The requested URL returned error: 500 Internal Server Error\n"
                    },



it looks like nova placement 'died' after many failures to connect to mysql and never recovered:

2018-10-02 22:52:15.303 16 DEBUG nova.api.openstack.placement.wsgi [-] key_manager.fixed_key          = None log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:3031
2018-10-02 22:52:15.304 16 DEBUG nova.api.openstack.placement.wsgi [-] osapi_v21.project_id_regex     = None log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:3031
2018-10-02 22:52:15.305 16 DEBUG nova.api.openstack.placement.wsgi [-] ******************************************************************************** log_opt_values /usr/lib/python2.7/site-packages/oslo_config/cfg.py:3033
2018-10-02 22:52:15.340 16 WARNING keystonemiddleware.auth_token [-] AuthToken middleware is set with keystone_authtoken.service_token_roles_required set to False. This is backwards compatible but deprecated behaviour. Please set this to True.
2018-10-02 22:52:15.350 16 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "trait_sync" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2018-10-02 22:52:15.375 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 10 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:15.378 17 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 2 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:24.279 14 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 5 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:25.386 17 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 1 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:25.389 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 9 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:34.291 14 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 4 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:35.397 17 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "trait_sync" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-10-02 22:52:35.400 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 8 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:44.304 14 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 3 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:45.412 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 7 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:54.310 14 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 2 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:52:55.425 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 6 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:04.322 14 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 1 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:05.440 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 5 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:14.333 14 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "trait_sync" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
2018-10-02 22:53:15.453 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 4 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.16' ([Errno 111] Connection refused)") (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:26.463 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 3 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.16' ([Errno 113] No route to host)") (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:38.487 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 2 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.16' ([Errno 113] No route to host)") (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:53:50.511 16 WARNING oslo_db.sqlalchemy.engines [-] SQL connection failed. 1 attempts left.: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.16' ([Errno 113] No route to host)") (Background on this error at: http://sqlalche.me/e/e3q8)
2018-10-02 22:54:00.521 16 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "trait_sync" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228


NOTE: there was a serious networking issues (another bug) between overcloud controllers hence 'no route to host' failures, but nova placement should recover from this instead of staying in unhealthy state for long I think


Version-Release number of selected component (if applicable):
python2-novaclient-11.0.0-0.20180809174649.f1005ce.el7ost.noarch
python-nova-18.0.1-0.20180922012453.518fdc7.el7ost.noarch
openstack-nova-placement-api-18.0.1-0.20180922012453.518fdc7.el7ost.noarch
puppet-nova-13.3.1-0.20180917153244.6fdb591.el7ost.noarch
openstack-nova-common-18.0.1-0.20180922012453.518fdc7.el7ost.noarch


How reproducible:
100%



Steps to Reproduce:
1.
2.
3.

Actual results:
nova placement doesn't recover from 'unhealthy' container state

Expected results:
nova placement to recover

Additional info:

Comment 2 Waldemar Znoinski 2018-10-18 20:42:36 UTC

*** This bug has been marked as a duplicate of bug 1630069 ***


Note You need to log in before you can comment on or make changes to this bug.