2112909 – [OVS] VM status ERROR

Bug 2112909 - [OVS] VM status ERROR

Summary: [OVS] VM status ERROR

Keywords:
Status:	CLOSED DUPLICATE of bug 2115383
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	17.0 (Wallaby)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Slawek Kaplonski
QA Contact:	Eran Kuris
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-08-01 13:32 UTC by Fiorella Yanac
Modified:	2022-08-04 14:55 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-04 14:55:13 UTC
Target Upstream Version:
Embargoed:
Flags:	fyanac: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-17933	0	None	None	None	2022-08-01 13:38:24 UTC

Description Fiorella Yanac 2022-08-01 13:32:18 UTC

Description of problem:


Version-Release number of selected component (if applicable):
core_puddle: RHOS-17.0-RHEL-9-20220727.n.0
core_puddle: RHOS-17.0-RHEL-9-20220721.n.1

How reproducible:
After run job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/

Steps to Reproduce:
1. Create manually VM on compute1, and then VM status is ERROR
2. after rebooted compute1, It can be created without problems.

Its state active on both compute (0&1) when run metalsmith list

vm description-logs:

 | {'code': 500, 'created': '2022-08-02T08:09:16Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance d18
c95a5-05f0-4924-bad0-90ed0e26c4cc.', 'details': 'Traceback (most recent call last):\n  File "/usr/lib/python3.9/site-packages/nova/conductor/manager.py", line 665, in build_instances\n    raise exception.MaxRetr
iesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance d18c95a5-05f0-4924-bad0-90ed0e26c4cc.\n'} 


On logs:
INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection

Jenkins jobs, compute1 - logs:

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/45/compute-1/var/log/containers/nova/nova-compute.log.gz

Comment 1 Slawek Kaplonski 2022-08-02 14:53:21 UTC

I took a look at the env today and it seems for me that it can be some issue with e.g. podman (or something else in the operating system on host). On the env which I got access to neutron-ovs-agent on compute-0 and compute-1 were stucked and also nova-compute on compute-1 was stucked.
By "stucked" I mean that: process was visible in "ps aux output", container was visible as running and healthy in the "podman ps" output and "systemctl status tripleo_neutron_ovs_agent" was showing that process is Active. But there was nothing in the log file, process wasn't sending any messages to the server using RPC, so it was marked as down by Neutron (or Nova in case of nova-compute).
We tried strace for that process and it was stucked on something like:

strace: Process 25934 attached
futex(0x7fb984000b60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY^Cstrace: Process 25934 detached                                                                             
 <detached ...>

and nothing more happend there.

When we restarted container with "podman restart neutron_ovs_agent" command, it immediately started to working properly.

Comment 2 Artom Lifshitz 2022-08-02 15:17:48 UTC

> After run job: https://rhos-ci-jenkins.lab.eng.tlv2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/

That's just the overall job that failed, is there a specific run # that you can link where the test failure is apparent?

In general, the triage process would looks something like:

1. Identify tempest (or tobiko) test that failed in a specific CI run.
2. Look through the code (unfortunate, but Tempest sucks in that sense) to see what REST API request failed. It'll have a request ID.
3. Track that request ID through the Nova logs (all the way from the API down to the compute if necessary) to understand why it failed. There'll usually be a traceback somewhere towards the end. The tool `os-log-merger` can help display all the nova logs in a chronological fashion.

Comment 4 Slawek Kaplonski 2022-08-03 10:54:30 UTC

Thx Artom and Sean to take a look at it. I also don't think it is nova or neutron issue really. It is also not related to one specific test which I can point to. It's more like (probably) after tobiko faults tests run, some containers are stucked (sometimes it's nova-compute and sometimes some neutron related agents) and because of that later those services are down thus spawning new vms is not possible.
I'm still investigating that issue. I tried strace on the stucked container's process and I see only something like:

strace -p 26653
strace: Process 26653 attached
restart_syscall(<... resuming interrupted read ...>^Cstrace: Process 26653 detached
 <detached ...>


And nothing more.

I also check stdout log from the failed container(s) and in each case I checked last thing which I saw there was:

2022-08-02T17:57:09.057380740+00:00 stderr F Traceback (most recent call last):
2022-08-02T17:57:09.057380740+00:00 stderr F   File "/usr/lib/python3.9/site-packages/eventlet/hubs/hub.py", line 476, in fire_timers
2022-08-02T17:57:09.057380740+00:00 stderr F     timer()
2022-08-02T17:57:09.057380740+00:00 stderr F   File "/usr/lib/python3.9/site-packages/eventlet/hubs/timer.py", line 59, in __call__
2022-08-02T17:57:09.057380740+00:00 stderr F     cb(*args, **kw)
2022-08-02T17:57:09.057380740+00:00 stderr F   File "/usr/lib/python3.9/site-packages/eventlet/semaphore.py", line 152, in _do_acquire
2022-08-02T17:57:09.057380740+00:00 stderr F     waiter.switch()
2022-08-02T17:57:09.057380740+00:00 stderr F greenlet.error: cannot switch to a different thread

Comment 5 smooney 2022-08-03 13:34:31 UTC

one thing I did notice was there is at least one uncaught exception in nova form a failed live migrate due to post copy.
I don't think that would causes the issues we are seeing just noting that the env is trying to test live migration without any of the workaround or the fix that is in flight.

http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/45/compute-1/var/log/containers/stdouts/nova_compute.log.gz

one thing you mentioned on irc was that you restarted the container with Podman to try and fix it.
that is generally unsafe and can cause deadlocks.

container must be restarted using the systemd service files only and never via Podman or paunch.

im not sure if tobiko is using Podman directly but that could cause issues if ti was.

for example restarting nova_libvirt containers with Podman and cause it to fail to start as it will leak a reference to the pid file
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/45/compute-1/etc/systemd/system/tripleo_nova_virtqemud.service.gz
the neutron l2 agent also uses a pid file the same way so its unsafe to restart it with podman
http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/rcj/DFG-network-neutron-17.0_director-rhel-virthost-3cont_2comp-ml2ovs-ipv4-vxlan-tobiko-tempest-dvr/45/compute-1/etc/systemd/system/tripleo_neutron_ovs_agent.service.gz

Comment 6 Slawek Kaplonski 2022-08-03 14:24:13 UTC

Thx Sean. That is very helpful and indeed, giving that it seems to be only happening in jobs which runs tobiko tests and the fact that tobiko indeed is using "podman restart ..." to restart containers, it may be the case here.
So I'm moving it to the Tobiko Trello board https://trello.com/c/fP6osOu4 as it seems to be a Tobiko bug for now and closing this BZ.

Note You need to log in before you can comment on or make changes to this bug.