Bug 1464588
Summary: | [osp11][minor update]update of non-ha overcloud stucked on controller update stage | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Artem Hrechanychenko <ahrechan> |
Component: | rhosp-director | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED NOTABUG | QA Contact: | Amit Ugol <augol> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 11.0 (Ocata) | CC: | dbecker, lbezdick, mburns, morazi, ohochman, rhel-osp-director-maint, sasha, sathlang, shardy, tvignaud |
Target Milestone: | rc | Keywords: | Triaged |
Target Release: | 12.0 (Pike) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-18 09:09:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Artem Hrechanychenko
2017-06-23 20:49:59 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release. Update failed after 4 hours: cmd: source ~/stackrc ; openstack stack failures list overcloud start: 2017-06-25 22:57:03.016532 end: 2017-06-25 22:57:19.857321 delta: 0:00:16.840789 stdout: overcloud.Controller.0.UpdateDeployment: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 1951898e-23d8-4eb1-8c85-bbe787f9e4c0 status: UPDATE_FAILED status_reason: | UPDATE aborted deploy_stdout: | Started yum_update.sh on server 39200262-621f-427d-98cf-229b49140c83 at Sun Jun 25 17:50:56 EDT 2017 Not running due to unset update_identifier deploy_stderr: | overcloud.Compute.0: resource_type: OS::TripleO::Compute physical_resource_id: 1a9fa9f1-0049-4231-afc2-cf091e0c2f27 status: UPDATE_FAILED status_reason: | UPDATE aborted [[ previous task time: 4:01:33.617169 = 14493.62s / 14808.38s ]] I had a look at the logs, and it seems rabbit crashed early in the update (or perhaps even before the update was started), with errors like: =ERROR REPORT==== 25-Jun-2017::22:59:31 === Error on AMQP connection <0.4605.0> (172.17.1.13:47914 -> 172.17.1.13:5672 - neutron-server:109031:f8f7b06a-abd1-463e-928a-406867f0a948, vhost: '/', user: 'guest', state: running), channel 0: operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'" The update then got stuck because the yum update triggers service restarts, which failed because rabbit wasn't working. So the question is why did rabbit fail, and was it working before the update was attempted? This is quite common for openstack services, we should be able to handle rabbitmq failure and die gracefully too. before update rabbit worked as expected. During updated was shut-down and didn't recoveded http://pastebin.test.redhat.com/497588 Deployment is with pacemaker which means we shut down pacemaker services on the controller node that runs yum but at it is single controller this efectively takes down rabbitmq and mysql. [root@controller-0 ~]# pcs status Error: cluster is not currently running on this node Doing so causes openstack services to loop on AMQP and mariadb and even though they get shutdown request they don't stop nor they restart. pcs cluster start and pcs resource cleanup could fix this but I'm pretty sure galera won't survive it anyway. Is this supported by HA/pacemaker team? Seventh attempt of redeployment and updates was succeed. again stucked no w/a. blocked us in testing non-ha upgrade from osp11->osp112 2017-06-27 21:04:47.617 80982 ERROR heat.engine.service [req-2701d8db-688b-4696-be24-6199b41380c2 - - - - -] Service 62b80531-5a95-4554-9339-ef8d51b1be85 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:47.617 80979 ERROR heat.engine.service [req-aa1cfb5a-07a8-4bfb-a9cb-77832046e1bc - - - - -] Service 70739846-253a-4816-bde7-8fa639862eae update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:47.618 80981 ERROR heat.engine.service [req-99cf4795-e6c6-46d0-8255-baae368813b7 - - - - -] Service 8eb3f8ef-06d0-4d8b-9686-62f737815ce5 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:50.621 80984 ERROR heat.engine.service [req-75774a56-aca8-4346-bf85-d3d43c4dec6c - - - - -] Service d50eafaa-2331-4ac7-bafe-db344f13b2b2 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:50.621 80980 ERROR heat.engine.service [req-431f04f5-28d2-4425-921b-f2a40c9f9edd - - - - -] Service 64658e0c-1d87-4d87-ac9b-16b3b1b31b91 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:50.621 80985 ERROR heat.engine.service [req-cba05be3-4214-494b-9b68-9b488467442f - - - - -] Service adf56641-803d-49f0-a264-ba88356dfa27 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:50.621 80978 ERROR heat.engine.service [req-2a580256-f07c-4d30-80fd-7850e663bdfa - - - - -] Service 1035c73f-d520-478b-8e66-8c6ac3ccf4d3 update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:50.621 80983 ERROR heat.engine.service [req-e4615cf4-798c-4104-a70e-afaf259d8916 - - - - -] Service 6dafb60a-54ee-4cec-b1eb-1430441b9b9b update failed: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.17.1.15' ([Errno 113] EHOSTUNREACH)") 2017-06-27 21:04:59.785 80978 ERROR oslo.messaging._drivers.impl_rabbit [-] [eea6961b-ffb3-45a8-a4b7-23fda988df3e] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45358 2017-06-27 21:04:59.786 80978 ERROR oslo.messaging._drivers.impl_rabbit [-] [6f5c7f3a-d0be-4b61-aa37-1058f12b7353] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:04:59.787 80978 ERROR oslo.messaging._drivers.impl_rabbit [-] [1cd0b13e-86b8-47e9-861f-6cff12f3149f] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45386 2017-06-27 21:04:59.792 80983 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd3be3e0-6288-47ea-bde9-ee01b5ec3832] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45384 2017-06-27 21:04:59.793 80983 ERROR oslo.messaging._drivers.impl_rabbit [-] [a533e0e3-6577-40e5-93fd-c7d74c649b1c] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45364 2017-06-27 21:04:59.794 80983 ERROR oslo.messaging._drivers.impl_rabbit [-] [350b7603-6228-465e-bec8-012bdbc6131b] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45350 2017-06-27 21:04:59.836 80984 ERROR oslo.messaging._drivers.impl_rabbit [-] [71f28f53-986c-49de-b8e6-1b1f26a3fcf0] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45348 2017-06-27 21:04:59.836 80984 ERROR oslo.messaging._drivers.impl_rabbit [-] [c8e685ec-894a-45d6-95f8-e56372372b86] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45368 2017-06-27 21:04:59.837 80984 ERROR oslo.messaging._drivers.impl_rabbit [-] [6e5eb076-0e91-4a76-a6e2-669335cf40c3] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45382 2017-06-27 21:04:59.901 80982 ERROR oslo.messaging._drivers.impl_rabbit [-] [6da7814b-aebf-4503-9bf1-c4d89b65b0f8] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45374 2017-06-27 21:04:59.902 80980 ERROR oslo.messaging._drivers.impl_rabbit [-] [a47deabb-c688-4c91-88a2-5249b2ea7b8c] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45344 2017-06-27 21:04:59.902 80982 ERROR oslo.messaging._drivers.impl_rabbit [-] [32b02e87-6008-42b9-b0af-fe20ebb06212] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45360 2017-06-27 21:04:59.902 80980 ERROR oslo.messaging._drivers.impl_rabbit [-] [d80d5bb1-675e-446c-b9df-37005bbba66b] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:04:59.903 80982 ERROR oslo.messaging._drivers.impl_rabbit [-] [f1ae847e-d26f-4f50-b82d-93607d63fdc5] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45346 2017-06-27 21:04:59.903 80980 ERROR oslo.messaging._drivers.impl_rabbit [-] [e3584be9-76b4-4081-9e90-bbd79febdb7c] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45362 2017-06-27 21:04:59.907 80981 ERROR oslo.messaging._drivers.impl_rabbit [-] [4aff3f62-64d6-4235-b4f3-55102e87b8c0] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:04:59.921 80981 ERROR oslo.messaging._drivers.impl_rabbit [-] [a9b8fea3-88b8-4f1c-850a-26889d2163d0] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:04:59.922 80981 ERROR oslo.messaging._drivers.impl_rabbit [-] [ad9d7ac4-a016-47ac-82eb-84f015bfdfd5] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45342 2017-06-27 21:05:00.038 80985 ERROR oslo.messaging._drivers.impl_rabbit [-] [d0376174-50da-41e7-a298-f3d09fafb1f4] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45352 2017-06-27 21:05:00.039 80979 ERROR oslo.messaging._drivers.impl_rabbit [-] [2fd77107-0f57-4700-b09f-e00ea9e61366] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: 45366 2017-06-27 21:05:00.047 80985 ERROR oslo.messaging._drivers.impl_rabbit [-] [6f1dba1a-906f-4405-885a-8cce351acc7b] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:05:00.047 80979 ERROR oslo.messaging._drivers.impl_rabbit [-] [d4a2ac11-8b30-4c0a-9e5b-901e5e3c5d06] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:05:00.055 80979 ERROR oslo.messaging._drivers.impl_rabbit [-] [2503334f-a607-4b04-8f26-698debe70e1d] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None 2017-06-27 21:05:00.055 80985 ERROR oslo.messaging._drivers.impl_rabbit [-] [fc6d1799-4576-4890-9a65-10a2c9289845] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None See https://bugzilla.redhat.com/show_bug.cgi?id=1464588#c6 last time it passed only because of running pcs cluster start and pcs resource cleanup during yum update on controller. It will _always_ fail otherwise. Confirmed. pcs cluster start and pcs resource cleanup during yum update on controller helps Hi Artem, as Lukas pointed out, non-ha with pacemaker is not supported unless you do some manual workaround. It's more a quick dev platform. I'm closing this as not a bug, but if you still think this should be support then we can have it as an RFE for next release I guess tracked in its own bz. Thanks, *** Bug 1463287 has been marked as a duplicate of this bug. *** |