Description of problem: After a minor update, it looks like the cloud is not functional. It is not possible to boot instances. +--------------------------------------+-------------+--------+----------+--------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------+--------+----------+--------+-----------+ | e765c4cc-aeb0-4179-a6e3-4fd587dc8beb | VM_net-64-1 | BUILD | | rhel75 | m1.medium | | 32d8839a-7f2e-4824-a035-41557f7bdf0c | VM_net-64-2 | BUILD | | rhel74 | m1.medium | +--------------------------------------+-------------+--------+----------+--------+-----------+ "host_instance" released by "nova.scheduler.host_manager.update_instance_info" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:10:12.618 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:10:12.668 1 INFO nova.scheduler.host_manager [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] The instance sync for host 'compute-0.localdomain' did not match. Re-created its InstanceList. 2018-07-12 06:10:12.669 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" released by "nova.scheduler.host_manager.sync_instance_info" :: held 0.051s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:12:27.910 37 DEBUG neutron.db.agents_db [req-7b641e2c-41c1-478e-abf4-b47d67921c17 - - - - -] Agent healthcheck: found 0 active agents agent_health_check /usr/lib/python2.7/site-packages/neutron/db/agents_db.py:264 2018-07-12 06:12:09.154 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0008988 2018-07-12 06:12:09.761 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.15', 47868) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:09.764 26 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0014350 2018-07-12 06:12:10.174 36 DEBUG futurist.periodics [-] Submitting periodic callback 'networking_ovn.common.maintenance.DBInconsistenciesPeriodics.check_for_inconsistencies' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639 2018-07-12 06:12:10.578 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 47908) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:10.956 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.26', 54022) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:10.960 26 INFO neutron.wsgi [-] 172.17.1.26 "GET / HTTP/1.1" status: 200 len: 229 time: 0.0019550 2018-07-12 06:12:11.153 27 DEBUG neutron.wsgi [-] (27) accepted ('172.17.1.18', 47412) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.154 29 DEBUG neutron.wsgi [-] (29) accepted ('172.17.1.26', 54032) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.156 27 INFO neutron.wsgi [-] 172.17.1.18 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0013189 2018-07-12 06:12:11.156 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0011420 2018-07-12 06:12:11.457 28 INFO neutron.wsgi [req-dda0757f-fa80-47c6-aa0e-b10d37c95b2f a76ae3493a9948b3ae3e48fdc9c05915 875abe74b8074fff98e4ed6132083b94 - default default] 172.17.1.15 "GET /v2.0/ports?device_id=e765c4cc-aeb0-4179-a6e3-4fd587dc8beb&device_id=32d8839a-7f2e-4824-a035-41557f7bdf0c HTTP/1.1" status: 200 len: 186 time: 0.8782690 2018-07-12 06:12:11.764 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 48324) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.766 28 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0012259 2018-07-12T06:14:08.206Z|19953|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:10.594Z|19954|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.096Z|19955|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.207Z|19956|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.208Z|19957|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:15.599Z|19958|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:15.599Z|19959|ovsdb_jsonrpc_server|ERR|Dropped 51 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate 2018-07-12T06:14:15.599Z|19960|ovsdb_jsonrpc_server|ERR|ptcp:6641:172.17.1.16: listen failed: Cannot assign requested address 2018-07-12 06:17:23.608 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [ff32bd29-3dbd-4ed4-b8ad-202255fff77f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38912. 2018-07-12 06:17:36.168 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out 2018-07-12 06:17:37.193 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38916. 2018-07-12 06:17:44.118 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out 2018-07-12 06:17:45.136 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 38676. b8c111a1e33 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2018-07-11.1 "/docker_puppet_ap..." 10 hours ago Exited (0) 10 hours ago setup_ovs_manager b6d1fef20f16 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2018-07-11.1 "/bin/bash -c 'CMS..." 10 hours ago Exited (0) 10 hours ago configure_cms_options Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. run minor update job : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ 2.we have 42 failures on scenario tests that try to boot instances. 3. Actual results: failed Expected results: all tests should pass Additional info: sos-report in the first comment
It looks like the nova-compute service is down: (overcloud) [stack@undercloud-0 ~]$ openstack compute service list +----+------------------+--------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +----+------------------+--------------------------+----------+---------+-------+----------------------------+ | 2 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:46.000000 | | 5 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 | | 8 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:46.000000 | | 11 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:40.000000 | | 14 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:47.000000 | | 17 | nova-compute | compute-1.localdomain | nova | enabled | up | 2018-07-12T10:20:44.000000 | | 20 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:43.000000 | | 26 | nova-compute | compute-0.localdomain | nova | enabled | down | 2018-07-12T08:20:12.000000 | | 41 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 | | 50 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 | | 65 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:44.000000 | +----+------------------+--------------------------+----------+---------+-------+----------------------------+ Went to check the logs in nova_compute container inside compute-0, and found the following backtrace error: Running command: '/usr/bin/nova-compute ' /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py:332: NotSupportedWarning: Configuration option(s) ['use_tpool'] not supported exception.NotSupportedWarning + sudo -E kolla_set_configs INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json INFO:__main__:Validating config file INFO:__main__:Kolla config strategy set to: COPY_ALWAYS INFO:__main__:Copying service configuration files INFO:__main__:Deleting /etc/libvirt/libvirtd.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/libvirtd.conf to /etc/libvirt/libvirtd.conf INFO:__main__:Deleting /etc/libvirt/passwd.db INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/passwd.db to /etc/libvirt/passwd.db INFO:__main__:Deleting /etc/libvirt/qemu.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/qemu.conf to /etc/libvirt/qemu.conf INFO:__main__:Deleting /etc/my.cnf.d/tripleo.cnf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/my.cnf.d/tripleo.cnf to /etc/my.cnf.d/tripleo.cnf INFO:__main__:Deleting /etc/nova/migration/authorized_keys INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/authorized_keys to /etc/nova/migration/authorized_keys INFO:__main__:Deleting /etc/nova/migration/identity INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/identity to /etc/nova/migration/identity INFO:__main__:Deleting /etc/nova/nova.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/nova.conf to /etc/nova/nova.conf INFO:__main__:Deleting /etc/sasl2/libvirt.conf INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/sasl2/libvirt.conf to /etc/sasl2/libvirt.conf INFO:__main__:Deleting /etc/ssh/sshd_config INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/ssh/sshd_config to /etc/ssh/sshd_config INFO:__main__:Deleting /var/lib/nova/.ssh/config ERROR:__main__:Unexpected error: Traceback (most recent call last): File "/usr/local/bin/kolla_set_configs", line 411, in main execute_config_strategy(config) File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy copy_config(config) File "/usr/local/bin/kolla_set_configs", line 306, in copy_config config_file.copy() File "/usr/local/bin/kolla_set_configs", line 150, in copy self._merge_directories(source, dest) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories os.path.join(dest, to_copy)) File "/usr/local/bin/kolla_set_configs", line 99, in _merge_directories self._copy_file(source, dest) File "/usr/local/bin/kolla_set_configs", line 75, in _copy_file self._delete_path(dest) File "/usr/local/bin/kolla_set_configs", line 108, in _delete_path os.remove(path) OSError: [Errno 2] No such file or directory: '/var/lib/nova/.ssh/config' + sudo -E kolla_set_configs However, the tempest test cases managed to create some other instances and delete them for what I could see in the nova-compute logs: 2018-07-11 23:47:50.839 1 INFO nova.compute.manager [req-b0b534ed-48f8-413e-a3c2-2d9bafd579fa 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Started (Lifecycle Event) 2018-07-11 23:48:05.947 1 INFO nova.compute.manager [req-88904359-c26a-479c-bd86-7cd5699a5603 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Resumed (Lifecycle Event) 2018-07-11 23:49:07.760 1 INFO nova.scheduler.client.report [req-70ec49ce-1342-4527-b13b-9fe0f82e8c10 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] Deleted allocation for instance 3bc365c4-c8fd-4cdc-be69-2431b1944f1e For that reason, I wonder if this is not a side effect of the tempest running tests. Anyway, I will continue debugging the issue.
According to Neutron ci update job, It looks like the cloud is not functional after minor update even in ml2/ovs environments and the issue is not related to OVN. ML2/OVS job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-vlan-composable/ ML2/OVN job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/24/testReport/
There is a lot of DuplicateMessageError messages on controllers in /var/log/containers/nova/nova-scheduler.log. That's one of the main symptoms of bug 1592528 so i wonder if we're hitting that one.
After restarting the rabbitmq-bundle the issue disappeared. looks like it's Duplicate bug 1592528 *** This bug has been marked as a duplicate of bug 1592528 ***