Bug 1600373
| Summary: | after a minor update it's not possible to boot instances | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Eran Kuris <ekuris> |
| Component: | python-networking-ovn | Assignee: | Assaf Muller <amuller> |
| Status: | CLOSED DUPLICATE | QA Contact: | Eran Kuris <ekuris> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 13.0 (Queens) | CC: | apevec, jfrancoa, lhh, majopela, nyechiel |
| Target Milestone: | z1 | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-07-12 13:08:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
It looks like the nova-compute service is down:
(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+----+------------------+--------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+
| 2 | nova-scheduler | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:46.000000 |
| 5 | nova-consoleauth | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 |
| 8 | nova-scheduler | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:46.000000 |
| 11 | nova-conductor | controller-1.localdomain | internal | enabled | up | 2018-07-12T10:20:40.000000 |
| 14 | nova-consoleauth | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:47.000000 |
| 17 | nova-compute | compute-1.localdomain | nova | enabled | up | 2018-07-12T10:20:44.000000 |
| 20 | nova-conductor | controller-2.localdomain | internal | enabled | up | 2018-07-12T10:20:43.000000 |
| 26 | nova-compute | compute-0.localdomain | nova | enabled | down | 2018-07-12T08:20:12.000000 |
| 41 | nova-scheduler | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 |
| 50 | nova-consoleauth | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:41.000000 |
| 65 | nova-conductor | controller-0.localdomain | internal | enabled | up | 2018-07-12T10:20:44.000000 |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+
Went to check the logs in nova_compute container inside compute-0, and found the following backtrace error:
Running command: '/usr/bin/nova-compute '
/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py:332: NotSupportedWarning: Configuration option(s) ['use_tpool'] not supported
exception.NotSupportedWarning
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Deleting /etc/libvirt/libvirtd.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/libvirtd.conf to /etc/libvirt/libvirtd.conf
INFO:__main__:Deleting /etc/libvirt/passwd.db
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/passwd.db to /etc/libvirt/passwd.db
INFO:__main__:Deleting /etc/libvirt/qemu.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/qemu.conf to /etc/libvirt/qemu.conf
INFO:__main__:Deleting /etc/my.cnf.d/tripleo.cnf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/my.cnf.d/tripleo.cnf to /etc/my.cnf.d/tripleo.cnf
INFO:__main__:Deleting /etc/nova/migration/authorized_keys
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/authorized_keys to /etc/nova/migration/authorized_keys
INFO:__main__:Deleting /etc/nova/migration/identity
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/identity to /etc/nova/migration/identity
INFO:__main__:Deleting /etc/nova/nova.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/nova.conf to /etc/nova/nova.conf
INFO:__main__:Deleting /etc/sasl2/libvirt.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/sasl2/libvirt.conf to /etc/sasl2/libvirt.conf
INFO:__main__:Deleting /etc/ssh/sshd_config
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/ssh/sshd_config to /etc/ssh/sshd_config
INFO:__main__:Deleting /var/lib/nova/.ssh/config
ERROR:__main__:Unexpected error:
Traceback (most recent call last):
File "/usr/local/bin/kolla_set_configs", line 411, in main
execute_config_strategy(config)
File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy
copy_config(config)
File "/usr/local/bin/kolla_set_configs", line 306, in copy_config
config_file.copy()
File "/usr/local/bin/kolla_set_configs", line 150, in copy
self._merge_directories(source, dest)
File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
os.path.join(dest, to_copy))
File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
os.path.join(dest, to_copy))
File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
os.path.join(dest, to_copy))
File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
os.path.join(dest, to_copy))
File "/usr/local/bin/kolla_set_configs", line 99, in _merge_directories
self._copy_file(source, dest)
File "/usr/local/bin/kolla_set_configs", line 75, in _copy_file
self._delete_path(dest)
File "/usr/local/bin/kolla_set_configs", line 108, in _delete_path
os.remove(path)
OSError: [Errno 2] No such file or directory: '/var/lib/nova/.ssh/config'
+ sudo -E kolla_set_configs
However, the tempest test cases managed to create some other instances and delete them for what I could see in the nova-compute logs:
2018-07-11 23:47:50.839 1 INFO nova.compute.manager [req-b0b534ed-48f8-413e-a3c2-2d9bafd579fa 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Started (Lifecycle Event)
2018-07-11 23:48:05.947 1 INFO nova.compute.manager [req-88904359-c26a-479c-bd86-7cd5699a5603 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Resumed (Lifecycle Event)
2018-07-11 23:49:07.760 1 INFO nova.scheduler.client.report [req-70ec49ce-1342-4527-b13b-9fe0f82e8c10 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] Deleted allocation for instance 3bc365c4-c8fd-4cdc-be69-2431b1944f1e
For that reason, I wonder if this is not a side effect of the tempest running tests. Anyway, I will continue debugging the issue.
According to Neutron ci update job, It looks like the cloud is not functional after minor update even in ml2/ovs environments and the issue is not related to OVN. ML2/OVS job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-vlan-composable/ ML2/OVN job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/24/testReport/ There is a lot of DuplicateMessageError messages on controllers in /var/log/containers/nova/nova-scheduler.log. That's one of the main symptoms of bug 1592528 so i wonder if we're hitting that one. After restarting the rabbitmq-bundle the issue disappeared. looks like it's Duplicate bug 1592528 *** This bug has been marked as a duplicate of bug 1592528 *** |
Description of problem: After a minor update, it looks like the cloud is not functional. It is not possible to boot instances. +--------------------------------------+-------------+--------+----------+--------+-----------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------+--------+----------+--------+-----------+ | e765c4cc-aeb0-4179-a6e3-4fd587dc8beb | VM_net-64-1 | BUILD | | rhel75 | m1.medium | | 32d8839a-7f2e-4824-a035-41557f7bdf0c | VM_net-64-2 | BUILD | | rhel74 | m1.medium | +--------------------------------------+-------------+--------+----------+--------+-----------+ "host_instance" released by "nova.scheduler.host_manager.update_instance_info" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:10:12.618 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273 2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:10:12.668 1 INFO nova.scheduler.host_manager [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] The instance sync for host 'compute-0.localdomain' did not match. Re-created its InstanceList. 2018-07-12 06:10:12.669 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" released by "nova.scheduler.host_manager.sync_instance_info" :: held 0.051s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285 2018-07-12 06:12:27.910 37 DEBUG neutron.db.agents_db [req-7b641e2c-41c1-478e-abf4-b47d67921c17 - - - - -] Agent healthcheck: found 0 active agents agent_health_check /usr/lib/python2.7/site-packages/neutron/db/agents_db.py:264 2018-07-12 06:12:09.154 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0008988 2018-07-12 06:12:09.761 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.15', 47868) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:09.764 26 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0014350 2018-07-12 06:12:10.174 36 DEBUG futurist.periodics [-] Submitting periodic callback 'networking_ovn.common.maintenance.DBInconsistenciesPeriodics.check_for_inconsistencies' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639 2018-07-12 06:12:10.578 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 47908) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:10.956 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.26', 54022) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:10.960 26 INFO neutron.wsgi [-] 172.17.1.26 "GET / HTTP/1.1" status: 200 len: 229 time: 0.0019550 2018-07-12 06:12:11.153 27 DEBUG neutron.wsgi [-] (27) accepted ('172.17.1.18', 47412) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.154 29 DEBUG neutron.wsgi [-] (29) accepted ('172.17.1.26', 54032) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.156 27 INFO neutron.wsgi [-] 172.17.1.18 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0013189 2018-07-12 06:12:11.156 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0011420 2018-07-12 06:12:11.457 28 INFO neutron.wsgi [req-dda0757f-fa80-47c6-aa0e-b10d37c95b2f a76ae3493a9948b3ae3e48fdc9c05915 875abe74b8074fff98e4ed6132083b94 - default default] 172.17.1.15 "GET /v2.0/ports?device_id=e765c4cc-aeb0-4179-a6e3-4fd587dc8beb&device_id=32d8839a-7f2e-4824-a035-41557f7bdf0c HTTP/1.1" status: 200 len: 186 time: 0.8782690 2018-07-12 06:12:11.764 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 48324) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883 2018-07-12 06:12:11.766 28 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200 len: 248 time: 0.0012259 2018-07-12T06:14:08.206Z|19953|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:10.594Z|19954|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.096Z|19955|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.207Z|19956|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:13.208Z|19957|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:15.599Z|19958|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address 2018-07-12T06:14:15.599Z|19959|ovsdb_jsonrpc_server|ERR|Dropped 51 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate 2018-07-12T06:14:15.599Z|19960|ovsdb_jsonrpc_server|ERR|ptcp:6641:172.17.1.16: listen failed: Cannot assign requested address 2018-07-12 06:17:23.608 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [ff32bd29-3dbd-4ed4-b8ad-202255fff77f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38912. 2018-07-12 06:17:36.168 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out 2018-07-12 06:17:37.193 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38916. 2018-07-12 06:17:44.118 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out 2018-07-12 06:17:45.136 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 38676. b8c111a1e33 192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2018-07-11.1 "/docker_puppet_ap..." 10 hours ago Exited (0) 10 hours ago setup_ovs_manager b6d1fef20f16 192.168.24.1:8787/rhosp13/openstack-ovn-controller:2018-07-11.1 "/bin/bash -c 'CMS..." 10 hours ago Exited (0) 10 hours ago configure_cms_options Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. run minor update job : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ 2.we have 42 failures on scenario tests that try to boot instances. 3. Actual results: failed Expected results: all tests should pass Additional info: sos-report in the first comment