Description of problem: While trying to create a large number of VM's, some of them are getting stuck in BUILD state. The only relevant error we can see is coming from nova-conductor and is: Unable to connect to AMQP server on HOSTNAME:5672 after None tries: 'NoneType' object has no attribute '__getitem__' A restart of all Nova services seems to have resolved this issue. However, we had a repeat over night and saw the same errors in the heat-engine.log. Version-Release number of selected component (if applicable): Red Hat OpenStack Platform release 13.0.7 (Queens) python2-oslo-messaging-5.35.4-1.el7ost.noarch How reproducible: Tough to reproduce. Steps to Reproduce: Initially, before restarting Nova, we could reproduce it by running: for i in {10..60}; do openstack server create --image a3806e23-1df5-40b0-baf3-f78ef7b1bf4c --nic net-id=54d8b69b-e8a0-4ce7-87ca-37b34fc368a2 --flavor VM1-flavor --availability-zone nova:compute-$i; done If we added a --wait in there, it worked a lot better. But we still observed 1 or two failures when doing it that way. Actual results: Some Instances would get stuck in BUILD state and not move. Expected results: Instances would complete or ERROR Additional info: After restarting all nova_* containers, we were able to execute: for i in {10..100}; do openstack server create --image a3806e23-1df5-40b0-baf3-f78ef7b1bf4c --nic net-id=54d8b69b-e8a0-4ce7-87ca-37b34fc368a2 --flavor VM1-flavor; done And have all instances build successfully.
This BZ is a clone of https://bugzilla.redhat.com/show_bug.cgi?id=1747226 I created this one to cross tags packages between OSP13 and OSP14 as if it is already the case for python-amqp. For further reading about this bug please take a look to comments of https://bugzilla.redhat.com/show_bug.cgi?id=1747226 My py-amqp fix is now merged [1]. python-amqp for OSP13 and OSP14 are cross tagged so I'll duplicate this BZ to backport my fix via OSP14 and then cross tag the version with OSP13 in a second time. As described in the title of this BZ the main topic of this BZ is to address the python-amqp issue fixed by my patch [1], else for hotfix the SSL part I already described how to hotfix that in my previous comments (cf. https://bugzilla.redhat.com/show_bug.cgi?id=1747226#c16 and https://bugzilla.redhat.com/show_bug.cgi?id=1733930#c29). If you want to hotfix the customer env with my py-amqp patch without bump the package version then you need to follow the instructions below. How to hotfix the "NoneType __getitem__" part ============================================= Apply the following patch to `/usr/lib/python2.7/site-packages/amqp/connection.py`: ``` 500,502c500,504 < return self.channels[channel_id].dispatch_method( < method_sig, payload, content, < ) --- > if self.channels is not None: > return self.channels[channel_id].dispatch_method( > method_sig, payload, content, > ) > raise RecoverableConnectionError('Connection already closed') ``` After applying the patch your file should be similar to: ``` $ vi /usr/lib/python2.7/site-packages/amqp/connection.py +499 def on_inbound_method(self, channel_id, method_sig, payload, content): if self.channels is not None: return self.channels[channel_id].dispatch_method( method_sig, payload, content, ) raise RecoverableConnectionError('Connection already closed') ``` To apply this patch to all your containers you can follow the same approach that I described in my comment https://bugzilla.redhat.com/show_bug.cgi?id=1747226#c16 (cf. the docker part etc...). I think you need to patch all your services who use python-amqp on controlers and computes since this kind of issue can occur after a network issue and so all the services can be impacted by (cf. your logs with nova, heat, neutron, etc...). [1] https://github.com/celery/py-amqp/pull/289
Fixed in version python-amqp-2.3.2-5.el7ost https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=23466544
Verified , (undercloud) [stack@undercloud-0 ~]$ rhos-release -L Installed repositories (rhel-7.7): 14 ceph-3 ceph-osd-3 rhel-7.7 (undercloud) [stack@undercloud-0 ~]$ cat core_puddle_version 2019-10-21.1(undercloud) [stack@undercloud-0 ~]$ rpm -qa | grep amqp python2-amqp-2.3.2-5.el7ost.noarch more than 60 vm's were created , the computes were hard rebooted while vms were being created , and the env didn't show the bug characteristics: (overcloud) [stack@undercloud-0 ~]$ ansible controller -mshell -b -a'grep -ir NoneType /var/log/containers/nova/||echo "no NoneType errors found in nova"' [WARNING]: Found both group and host with same name: undercloud controller-1 | SUCCESS | rc=0 >> no NoneType errors found in nova controller-2 | SUCCESS | rc=0 >> no NoneType errors found in nova controller-0 | SUCCESS | rc=0 >> no NoneType errors found in nova #test instance creation with disruptions $ for i in {1..86}; do openstack server create --image cirros --flavor m1.nano compute-$i; done meanwhile ssh to computes and do echo b >/proc/sysrq-trigger check number of created instance : (overcloud) [stack@undercloud-0 ~]$ nova list|wc -l 86 (overcloud) [stack@undercloud-0 ~]$ nova list|grep BUILD||echo "no instances stuck in BUILD mode" no instances stuck in BUILD mode (overcloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+------------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+------------+--------+------------+-------------+----------+ | 0a25b50c-9482-42fa-92e2-01fc3939e83a | compute-10 | ACTIVE | - | Running | | | 2f99780e-aba7-43b0-950e-c865b9830bb2 | compute-10 | ACTIVE | - | Running | | | 7dd82f33-4e9e-4427-acbe-2ba4708322e0 | compute-10 | ACTIVE | - | Running | | | e9046cec-55b7-44fe-a429-b900502b97c9 | compute-10 | ACTIVE | - | Running | | | 22db217d-0f02-4b6e-8fa7-4f0af999717b | compute-11 | ACTIVE | - | Running | | | 8d5fcc35-6d96-45e8-8ac4-87940235eb8e | compute-11 | ACTIVE | - | Running | | | 9972df49-bf73-4bb0-861a-fc26d0409cfb | compute-11 | ERROR | - | NOSTATE | | | b1b7fffb-7494-4e03-b076-0e52646b6fe0 | compute-11 | ACTIVE | - | Running | | | 0e439eae-5207-48ab-9b10-2f68c05fa6fc | compute-12 | ACTIVE | - | Running | | | 682d8532-e7ba-455f-9172-5e74946876d0 | compute-12 | ACTIVE | - | Running | | | aaeb6364-e75e-439c-8839-93044141f82f | compute-12 | ACTIVE | - | Running | | | d01d3f3b-4ccf-4af3-b400-31e91f10987a | compute-12 | ACTIVE | - | Running | | | 2f69751f-723d-4300-a2a2-2bf17c568931 | compute-13 | ACTIVE | - | Running | | | 5cd5c640-19a1-4f9d-ae78-99be667c6cf1 | compute-13 | ACTIVE | - | Running | | | 90b2f0c2-d767-4bfa-b6f8-05ba269a3d82 | compute-13 | ACTIVE | - | Running | | | cf5088cb-2eed-40a8-adec-0596092d3583 | compute-13 | ACTIVE | - | Running | | | 2b9ce7bb-ac0a-447c-8741-bca3e7363820 | compute-14 | ACTIVE | - | Running | | | 85ad57af-531d-4b26-805f-0133344fd870 | compute-14 | ACTIVE | - | Running | | | a447621d-8f6b-47b0-935e-4001b82fcd9c | compute-14 | ACTIVE | - | Running | | | eeea258c-3b05-4b25-8db4-e1912bcbf234 | compute-14 | ACTIVE | - | Running | | | 07db7ca9-0f13-4cc7-91ba-f3d198632fae | compute-15 | ACTIVE | - | Running | | | a550d027-c3d0-46ca-aac6-f54c36156acb | compute-15 | ACTIVE | - | Running | | | b0002e0b-97d3-44df-93ec-5012a15d06e3 | compute-15 | ACTIVE | - | Running | | | b24b6faa-5f54-4133-aa3b-043133df8afe | compute-15 | ACTIVE | - | Running | | | 120361e6-8729-484f-bbff-c32668f26402 | compute-16 | ACTIVE | - | Running | | | 2ae4d1c5-976e-424d-a1fd-0c32da007184 | compute-16 | ACTIVE | - | Running | | | 8c157eff-1cd1-4d3a-937b-7fef62c10502 | compute-16 | ACTIVE | - | Running | | | aa15b5ad-6185-4986-866c-c808c19b741c | compute-16 | ACTIVE | - | Running | | | 89f68275-3099-4bda-bf17-b01b4104cd85 | compute-17 | ACTIVE | - | Running | | | 8f8fb2b7-a927-4058-90ac-c1c9e5e01cc3 | compute-17 | ACTIVE | - | Running | | | b2cfad18-5fc2-46d0-8739-c5eabb756115 | compute-17 | ACTIVE | - | Running | | | e40f14fb-5ee1-491a-957a-1e3e2cf17e73 | compute-17 | ACTIVE | - | Running | | | 21a3f47b-87e7-4e5b-b8a3-d4f529eefdf6 | compute-18 | ACTIVE | - | Running | | | 2c44ce44-edc8-4f3d-9b9f-30d8c9fc0125 | compute-18 | ACTIVE | - | Running | | | 45240480-1c8a-450c-974a-0647dc6ef9ff | compute-18 | ACTIVE | - | Running | | | 4df70bf2-6e40-49fc-8bce-d929059e6837 | compute-18 | ACTIVE | - | Running | | | 37c3b44a-4be4-475f-a28c-52d81c224299 | compute-19 | ACTIVE | - | Running | | | e6d43b7c-e0d3-43bd-a0d5-556c4773f51d | compute-19 | ACTIVE | - | Running | | | f0f0a66e-0153-42b5-b3fa-7c976a78d8f7 | compute-19 | ACTIVE | - | Running | | | f725b389-00c0-4189-aa92-0bb0b7a3817b | compute-19 | ACTIVE | - | Running | | | 290430f6-6147-4ae9-97a8-1438627a6769 | compute-20 | ACTIVE | - | Running | | | 5024727c-9e58-40ec-924d-d50af8a4a42b | compute-20 | ACTIVE | - | Running | | | 9fbfaed3-ad8d-4580-9d34-ab40f9f66188 | compute-21 | ACTIVE | - | Running | | | bd893c95-4922-467f-a976-88ec37e08773 | compute-21 | ACTIVE | - | Running | | | 8ed1dd80-310d-4560-99d2-01876c9bc1b8 | compute-22 | ACTIVE | - | Running | | | cfa2a0a2-5e64-41bf-af88-f4587fb6ec3e | compute-22 | ACTIVE | - | Running | | | 4f487283-7c5f-41d8-a859-897ebcfeb05d | compute-23 | ACTIVE | - | Running | | | 6db33352-b91b-4128-8bb3-b77f1d7590f9 | compute-23 | ACTIVE | - | Running | | | 941b56e3-75e6-4fe4-a490-888307ddfe1d | compute-24 | ACTIVE | - | Running | | | bccc0273-8395-4624-a226-294dacae2a39 | compute-24 | ACTIVE | - | Running | | | 39a8d9bb-e5ec-4f45-8031-373256d3289c | compute-25 | ACTIVE | - | Running | | | fe771534-8bf0-4da8-9060-048f3177c09c | compute-25 | ACTIVE | - | Running | | | 097ed0e9-021d-47c3-ae33-7b12c47ff93f | compute-26 | ACTIVE | - | Running | | | ec0b2521-d740-4e37-9925-de6314f37b18 | compute-26 | ACTIVE | - | Running | | | 47643512-4f96-43e1-9d3f-b5591285ed50 | compute-27 | ACTIVE | - | Running | | | ec0bb841-c526-4bd6-9d37-94fda5daa822 | compute-27 | ACTIVE | - | Running | | | 07d1dce9-5b68-43f5-8b28-b10aa75f9ff4 | compute-28 | ACTIVE | - | Running | | | 6459bbc2-ad4f-4ea6-9dc8-3e3f13e73755 | compute-28 | ACTIVE | - | Running | | | 6413ec10-6364-47cf-8a01-2957cfc6f446 | compute-29 | ACTIVE | - | Running | | | 87c0ef18-f758-405b-b4a9-5ae6c7abb2ee | compute-29 | ACTIVE | - | Running | | | 94868ba0-296b-4ef4-86f9-f8ff9bffbc31 | compute-30 | ACTIVE | - | Running | | | ec3b7fdb-614e-4f23-bc4b-eba57d176e3d | compute-30 | ACTIVE | - | Running | | | 47ada1bb-84e7-47ef-9655-e41c721edcea | compute-31 | ACTIVE | - | Running | | | 9c7d7efe-6326-4f34-9c94-3f89cc05cdae | compute-31 | ACTIVE | - | Running | | | 58b66775-9d3a-43e4-a981-6dc3e82714ef | compute-32 | ACTIVE | - | Running | | | 6175d213-8a7e-40a3-8fa1-3a0727e19a9f | compute-32 | ACTIVE | - | Running | | | 4b1c4c94-dba1-4cf9-801a-da781967ad57 | compute-33 | ACTIVE | - | Running | | | c5784af8-b2ca-4f23-9d0b-4ba162cf725c | compute-33 | ACTIVE | - | Running | | | 242c1e08-5e00-465a-b369-5e801246f4da | compute-34 | ACTIVE | - | Running | | | e7062ec4-d16e-445c-9daa-1d628a49d87b | compute-34 | ACTIVE | - | Running | | | 4556ba64-7853-4b8b-99cb-3b84f521375a | compute-35 | ACTIVE | - | Running | | | 5ba6f46f-3b87-4512-80b4-0a289e19dbe3 | compute-35 | ACTIVE | - | Running | | | 6e337292-7db6-40f9-b2fe-f8a9364cd8ae | compute-36 | ACTIVE | - | Running | | | 88f39712-9fd1-4424-abed-af37997c2557 | compute-36 | ACTIVE | - | Running | | | b4dbcfae-ac32-4c71-8552-fb572235c34e | compute-37 | ACTIVE | - | Running | | | da00a078-2cbf-4744-a45e-98c2feaef70d | compute-37 | ACTIVE | - | Running | | | 255db0e3-85e7-41e5-a8a1-b4b330111664 | compute-38 | ACTIVE | - | Running | | | 4e7386e2-4bfa-41a9-b05a-5497ace37219 | compute-38 | ACTIVE | - | Running | | | 8d2c9691-8e93-4b06-91ed-71498e3a38e4 | compute-39 | ACTIVE | - | Running | | | da2e98aa-fb63-49d9-8933-2b69a290663a | compute-39 | ACTIVE | - | Running | | | 17b7b048-1cf3-46ba-87d7-de14d8c80420 | compute-40 | ACTIVE | - | Running | | | 910910c6-64c8-402b-a91a-4a6266a923e5 | compute-40 | ACTIVE | - | Running | | +--------------------------------------+------------+--------+------------+-------------+----------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3747