1600373 – after a minor update it's not possible to boot instances

Bug 1600373 - after a minor update it's not possible to boot instances

Summary: after a minor update it's not possible to boot instances

Keywords:
Status:	CLOSED DUPLICATE of bug 1592528
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	python-networking-ovn
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z1
Target Release:	---
Assignee:	Assaf Muller
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-12 06:19 UTC by Eran Kuris
Modified:	2019-09-09 14:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-07-12 13:08:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Eran Kuris 2018-07-12 06:19:06 UTC

Description of problem:

After a minor update, it looks like the cloud is not functional.
It is not possible to boot instances.

+--------------------------------------+-------------+--------+----------+--------+-----------+
| ID                                   | Name        | Status | Networks | Image  | Flavor    |
+--------------------------------------+-------------+--------+----------+--------+-----------+
| e765c4cc-aeb0-4179-a6e3-4fd587dc8beb | VM_net-64-1 | BUILD  |          | rhel75 | m1.medium |
| 32d8839a-7f2e-4824-a035-41557f7bdf0c | VM_net-64-2 | BUILD  |          | rhel74 | m1.medium |
+--------------------------------------+-------------+--------+----------+--------+-----------+

"host_instance" released by "nova.scheduler.host_manager.update_instance_info" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-07-12 06:10:12.618 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" acquired by "nova.scheduler.host_manager.sync_instance_info" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:273
2018-07-12 06:10:12.628 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "94565c36-5f96-40ca-88d6-b789b91da1e8" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285
2018-07-12 06:10:12.668 1 INFO nova.scheduler.host_manager [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] The instance sync for host 'compute-0.localdomain' did not match. Re-created its InstanceList.
2018-07-12 06:10:12.669 1 DEBUG oslo_concurrency.lockutils [req-953f865f-36e7-4f84-8294-e58af94512a3 - - - - -] Lock "host_instance" released by "nova.scheduler.host_manager.sync_instance_info" :: held 0.051s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:285


2018-07-12 06:12:27.910 37 DEBUG neutron.db.agents_db [req-7b641e2c-41c1-478e-abf4-b47d67921c17 - - - - -] Agent healthcheck: found 0 active agents agent_health_check /usr/lib/python2.7/site-packages/neutron/db/agents_db.py:264


2018-07-12 06:12:09.154 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0008988
2018-07-12 06:12:09.761 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.15', 47868) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:09.764 26 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0014350
2018-07-12 06:12:10.174 36 DEBUG futurist.periodics [-] Submitting periodic callback 'networking_ovn.common.maintenance.DBInconsistenciesPeriodics.check_for_inconsistencies' _process_scheduled /usr/lib/python2.7/site-packages/futurist/periodics.py:639
2018-07-12 06:12:10.578 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 47908) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:10.956 26 DEBUG neutron.wsgi [-] (26) accepted ('172.17.1.26', 54022) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:10.960 26 INFO neutron.wsgi [-] 172.17.1.26 "GET / HTTP/1.1" status: 200  len: 229 time: 0.0019550
2018-07-12 06:12:11.153 27 DEBUG neutron.wsgi [-] (27) accepted ('172.17.1.18', 47412) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:11.154 29 DEBUG neutron.wsgi [-] (29) accepted ('172.17.1.26', 54032) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:11.156 27 INFO neutron.wsgi [-] 172.17.1.18 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0013189
2018-07-12 06:12:11.156 29 INFO neutron.wsgi [-] 172.17.1.26 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0011420
2018-07-12 06:12:11.457 28 INFO neutron.wsgi [req-dda0757f-fa80-47c6-aa0e-b10d37c95b2f a76ae3493a9948b3ae3e48fdc9c05915 875abe74b8074fff98e4ed6132083b94 - default default] 172.17.1.15 "GET /v2.0/ports?device_id=e765c4cc-aeb0-4179-a6e3-4fd587dc8beb&device_id=32d8839a-7f2e-4824-a035-41557f7bdf0c HTTP/1.1" status: 200  len: 186 time: 0.8782690
2018-07-12 06:12:11.764 28 DEBUG neutron.wsgi [-] (28) accepted ('172.17.1.15', 48324) server /usr/lib/python2.7/site-packages/eventlet/wsgi.py:883
2018-07-12 06:12:11.766 28 INFO neutron.wsgi [-] 172.17.1.15 "OPTIONS / HTTP/1.0" status: 200  len: 248 time: 0.0012259


2018-07-12T06:14:08.206Z|19953|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:10.594Z|19954|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:13.096Z|19955|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:13.207Z|19956|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:13.208Z|19957|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:15.599Z|19958|socket_util|ERR|6641:172.17.1.16: bind: Cannot assign requested address
2018-07-12T06:14:15.599Z|19959|ovsdb_jsonrpc_server|ERR|Dropped 51 log messages in last 60 seconds (most recently, 3 seconds ago) due to excessive rate
2018-07-12T06:14:15.599Z|19960|ovsdb_jsonrpc_server|ERR|ptcp:6641:172.17.1.16: listen failed: Cannot assign requested address


2018-07-12 06:17:23.608 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [ff32bd29-3dbd-4ed4-b8ad-202255fff77f] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38912.
2018-07-12 06:17:36.168 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-07-12 06:17:37.193 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [341b1bfb-dbd7-4399-9127-2b4aaebd3254] Reconnected to AMQP server on controller-1.internalapi.localdomain:5672 via [amqp] client with port 38916.
2018-07-12 06:17:44.118 1 ERROR oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: None: timeout: timed out
2018-07-12 06:17:45.136 1 INFO oslo.messaging._drivers.impl_rabbit [req-17dc2377-930a-49a7-9ab0-b31b33db0aef 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [1a9a1cc8-dffe-48b0-a61f-c96f70650ac8] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 38676.


b8c111a1e33        192.168.24.1:8787/rhosp13/openstack-neutron-metadata-agent-ovn:2018-07-11.1   "/docker_puppet_ap..."   10 hours ago        Exited (0) 10 hours ago                       setup_ovs_manager
b6d1fef20f16        192.168.24.1:8787/rhosp13/openstack-ovn-controller:2018-07-11.1               "/bin/bash -c 'CMS..."   10 hours ago        Exited (0) 10 hours ago                       configure_cms_options


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. run  minor update job : 
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/
2.we have 42 failures on scenario tests that try to boot instances.
3.

Actual results:
failed 

Expected results:
all tests should pass 

Additional info:
sos-report in the first comment

Comment 4 Jose Luis Franco 2018-07-12 10:52:11 UTC

It looks like the nova-compute service is down:

(overcloud) [stack@undercloud-0 ~]$ openstack compute service list
+----+------------------+--------------------------+----------+---------+-------+----------------------------+
| ID | Binary           | Host                     | Zone     | Status  | State | Updated At                 |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+
|  2 | nova-scheduler   | controller-1.localdomain | internal | enabled | up    | 2018-07-12T10:20:46.000000 |
|  5 | nova-consoleauth | controller-1.localdomain | internal | enabled | up    | 2018-07-12T10:20:41.000000 |
|  8 | nova-scheduler   | controller-2.localdomain | internal | enabled | up    | 2018-07-12T10:20:46.000000 |
| 11 | nova-conductor   | controller-1.localdomain | internal | enabled | up    | 2018-07-12T10:20:40.000000 |
| 14 | nova-consoleauth | controller-2.localdomain | internal | enabled | up    | 2018-07-12T10:20:47.000000 |
| 17 | nova-compute     | compute-1.localdomain    | nova     | enabled | up    | 2018-07-12T10:20:44.000000 |
| 20 | nova-conductor   | controller-2.localdomain | internal | enabled | up    | 2018-07-12T10:20:43.000000 |
| 26 | nova-compute     | compute-0.localdomain    | nova     | enabled | down  | 2018-07-12T08:20:12.000000 |
| 41 | nova-scheduler   | controller-0.localdomain | internal | enabled | up    | 2018-07-12T10:20:41.000000 |
| 50 | nova-consoleauth | controller-0.localdomain | internal | enabled | up    | 2018-07-12T10:20:41.000000 |
| 65 | nova-conductor   | controller-0.localdomain | internal | enabled | up    | 2018-07-12T10:20:44.000000 |
+----+------------------+--------------------------+----------+---------+-------+----------------------------+


Went to check the logs in nova_compute container inside compute-0, and found the following backtrace error:

Running command: '/usr/bin/nova-compute '
/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py:332: NotSupportedWarning: Configuration option(s) ['use_tpool'] not supported
  exception.NotSupportedWarning
+ sudo -E kolla_set_configs
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Copying service configuration files
INFO:__main__:Deleting /etc/libvirt/libvirtd.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/libvirtd.conf to /etc/libvirt/libvirtd.conf
INFO:__main__:Deleting /etc/libvirt/passwd.db
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/passwd.db to /etc/libvirt/passwd.db
INFO:__main__:Deleting /etc/libvirt/qemu.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/libvirt/qemu.conf to /etc/libvirt/qemu.conf
INFO:__main__:Deleting /etc/my.cnf.d/tripleo.cnf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/my.cnf.d/tripleo.cnf to /etc/my.cnf.d/tripleo.cnf
INFO:__main__:Deleting /etc/nova/migration/authorized_keys
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/authorized_keys to /etc/nova/migration/authorized_keys
INFO:__main__:Deleting /etc/nova/migration/identity
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/migration/identity to /etc/nova/migration/identity
INFO:__main__:Deleting /etc/nova/nova.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/nova/nova.conf to /etc/nova/nova.conf
INFO:__main__:Deleting /etc/sasl2/libvirt.conf
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/sasl2/libvirt.conf to /etc/sasl2/libvirt.conf
INFO:__main__:Deleting /etc/ssh/sshd_config
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/ssh/sshd_config to /etc/ssh/sshd_config
INFO:__main__:Deleting /var/lib/nova/.ssh/config
ERROR:__main__:Unexpected error:
Traceback (most recent call last):
  File "/usr/local/bin/kolla_set_configs", line 411, in main
    execute_config_strategy(config)
  File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy
    copy_config(config)
  File "/usr/local/bin/kolla_set_configs", line 306, in copy_config
    config_file.copy()
  File "/usr/local/bin/kolla_set_configs", line 150, in copy
    self._merge_directories(source, dest)
  File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
    os.path.join(dest, to_copy))
  File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
    os.path.join(dest, to_copy))
  File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
    os.path.join(dest, to_copy))
  File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
    os.path.join(dest, to_copy))
  File "/usr/local/bin/kolla_set_configs", line 99, in _merge_directories
    self._copy_file(source, dest)
  File "/usr/local/bin/kolla_set_configs", line 75, in _copy_file
    self._delete_path(dest)
  File "/usr/local/bin/kolla_set_configs", line 108, in _delete_path
    os.remove(path)
OSError: [Errno 2] No such file or directory: '/var/lib/nova/.ssh/config'
+ sudo -E kolla_set_configs


However, the tempest test cases managed to create some other instances and delete them for what I could see in the nova-compute logs:

2018-07-11 23:47:50.839 1 INFO nova.compute.manager [req-b0b534ed-48f8-413e-a3c2-2d9bafd579fa 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Started (Lifecycle Event)

2018-07-11 23:48:05.947 1 INFO nova.compute.manager [req-88904359-c26a-479c-bd86-7cd5699a5603 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] [instance: 3bc365c4-c8fd-4cdc-be69-2431b1944f1e] VM Resumed (Lifecycle Event)

2018-07-11 23:49:07.760 1 INFO nova.scheduler.client.report [req-70ec49ce-1342-4527-b13b-9fe0f82e8c10 84d1c3f3bb2840cebd562c9518a3a31d a88b57c748ab477a90052a31ea1bc48e - default default] Deleted allocation for instance 3bc365c4-c8fd-4cdc-be69-2431b1944f1e

For that reason, I wonder if this is not a side effect of the tempest running tests. Anyway, I will continue debugging the issue.

Comment 5 Eran Kuris 2018-07-12 10:53:06 UTC

According to Neutron ci update job, 
It looks like the cloud is not functional after minor update even in ml2/ovs environments and the issue is not related to OVN.

ML2/OVS job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-vlan-composable/

ML2/OVN job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-13_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/24/testReport/

Comment 6 Jiri Stransky 2018-07-12 10:54:22 UTC

There is a lot of DuplicateMessageError messages on controllers in /var/log/containers/nova/nova-scheduler.log. That's one of the main symptoms of bug 1592528 so i wonder if we're hitting that one.

Comment 7 Eran Kuris 2018-07-12 13:08:58 UTC

After restarting the rabbitmq-bundle the issue disappeared. 
looks like it's Duplicate bug 1592528

*** This bug has been marked as a duplicate of bug 1592528 ***

Note You need to log in before you can comment on or make changes to this bug.