1733930 – [OSP13] stack creation struck in CREATE_IN_PROGRESS and its deleting after timeout

Bug 1733930 - [OSP13] stack creation struck in CREATE_IN_PROGRESS and its deleting after timeout

Summary: [OSP13] stack creation struck in CREATE_IN_PROGRESS and its deleting after ti...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	z11
Target Release:	13.0 (Queens)
Assignee:	Matthew Booth
QA Contact:	Paras Babbar
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1749447 1894699 (view as bug list)
Depends On:	1754479 1754482
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-29 08:57 UTC by Nilesh
Modified:	2024-06-13 22:11 UTC (History)
CC List:	36 users (show)
Fixed In Version:	openstack-nova-17.0.12-2.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1754479 1754482 (view as bug list)
Environment:
Last Closed:	2020-03-10 11:27:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1840912	None	None	None	2019-08-21 11:48:21 UTC
OpenStack gerrit	677736	'None'	MERGED	libvirt: Fix service-wide pauses caused by un-proxied libvirt calls	2021-02-17 17:33:18 UTC
OpenStack gerrit	683930	'None'	ABANDONED	libvirt: Fix service-wide pauses caused by un-proxied libvirt calls	2021-02-17 17:33:19 UTC
Red Hat Bugzilla	1740681	high	CLOSED	heartbeats missed and connection timeout	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2020:0759	None	None	None	2020-03-10 11:28:35 UTC

Comment 63 Matthew Booth 2019-08-20 15:22:12 UTC

comment 62 strongly suggests that something occurring in the instance loop in LibvirtDriver._get_disk_over_committed_size_total() is preventing the db servicegroup's report_state() from being executed. This is executed in its own greenthread leads us back more strongly to a theory about something starving other greenthreads.

Comment 64 Matthew Booth 2019-08-20 15:23:01 UTC

Interestingly, there's an explicit scheduling point in that loop which was added by:

https://review.opendev.org/#/c/12335/

It looks like somebody has hit this before.

Comment 67 Matthew Booth 2019-08-21 09:47:55 UTC

As we're currently suspecting a nova/libvirt interaction issue, I took a look into how the libvirt threadpool works. It's done here:

        return tpool.proxy_call(
            (libvirt.virDomain, libvirt.virConnect),
            libvirt.openAuth, uri, auth, flags)

This translates to: Using a threadpool, execute libvirt.openAuth(uri, auth, flags). If the returned object is a libvirt.virDomain or libvirt.virConnect, wrap the returned object in a threadpool Proxy with the same autowrap rules.

However, we are using more libvirt objects than this. After discussion with Dan Berrangé, we think we need to add at least: libvirt.virNodeDevice, libvirt.virSecret, libvirt.NWFilter. Given that this whitelist is obviously fragile, we 

However, the problem appears be worse than this. I note that the loop I identified above as being the probable source of the pause iterates over a list of instances generated by list_instance_domains, which was ultimately returned by libvirt's listAllDomains(). Note that this returns a *list* of virDomain. However, the tpool proxy won't match this from our autowrap list:

https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

This means we're iterating over a bunch of un-proxied libvirt domains. A quick test of libvirt and tpool confirms this to be the case. At this point I'm almost certain this is the root cause of the issue.

Comment 75 Matthew Booth 2019-09-06 13:35:16 UTC

*** Bug 1749447 has been marked as a duplicate of this bug. ***

Comment 76 Andreas Karis 2019-09-06 16:43:26 UTC

Hi,

Given that 1749447 is a duplicate of this case and given that my customer needs visibility into this BZ, I made it public. I'm going through relevant comments on this very BZ which can be public and don't expose sensitive customer information so that other customers can follow the progress. Please keep this in mind when posting follow-ups and data, here.

THanks,

Andreas

Comment 77 Andreas Karis 2019-09-06 16:47:39 UTC

Hi,

Do we have a rough ETA for this for my customer from BZ 1749447?

Thanks,

Andreas

Comment 78 Andreas Karis 2019-09-06 16:48:08 UTC

Or would it make sense to confirm this with a test build?

Comment 79 Andreas Karis 2019-09-09 23:00:29 UTC

Hi,

Do we have a test build and or ETA and/or other way of verifying that my customer indeed hits this issue?

Thanks,

Andreas

Comment 80 Andreas Karis 2019-09-13 19:15:31 UTC

Hi,

Do we have a rough ETA for when this fix will be released? The customer is still hitting this issue and I'd like to send them a test build for verification, or ideally a hotfix.

Thanks,

Andreas

Comment 88 John Eckersberg 2019-09-26 21:17:27 UTC

First thing I see worth mentioning in the latest sosreport...

It looks like this change is halfway implemented:

https://review.opendev.org/#/c/503788/

I can see the net_ticktime option is present in rabbitmq.config:

# grep ticktime ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq.config
    {net_ticktime, 15}

However, I also see the low-level tuning parameters are still present in rabbitmq-env.conf:

# grep raw,6,18 ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf 
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<30000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<30000:64/native>>}]"

That line should be changed (per the upstream patch linked above) to read:

RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true}]"

I suspect this may be exacerbating problems in conjunction with the blocking behavior of nova-compute.

Those flags set the TCP_USER_TIMEOUT option on the socket to 30 seconds.  This causes the connection to be reset if a packet is sent but no acknowledgement is received within 30 seconds.  It *also* triggers if a packet cannot be sent because it is queued on the sending side for 30 seconds.

A blocked client on the other end of the connection can cause this timeout to trigger.  The sequence is:

- nova-compute gets blocked for $reasons
- the event loop in nova-compute no longer triggers task switching to call recv on the socket from rabbitmq
- rabbitmq is still sending messages down the wire to nova-compute on the other end
- the kernel receives and acknowledges the packets and stores them in the recv buffer for the nova-compute socket
- eventually the recv buffer for the nova-compute process fills up so the tcp window goes to 0 and no more packets can be sent from rabbitmq
- 30 seconds pass and the socket is reset on the rabbitmq side of the connection by the kernel

I doubt this change will significantly improve the stability of the environment, but it should help cut down on some of the noise in the logs and reduce the number of disconnect/reconnect cycles.

I see 49 instances in the rabbitmq logs that match "{inet_error,etimedout}" which is how connections log when they fail in this manner.  This should make those errors stop.

Comment 93 Andreas Karis 2019-10-07 15:44:38 UTC

Hi,

With regards to: https://bugs.launchpad.net/nova/+bug/1840912

I forwarded a container with the build from: https://bugzilla.redhat.com/show_bug.cgi?id=1733930#c81 to the customer and it seems to have fixed the issue. 

I'm going to attach the files for building the container. 

The customer would like to know when this fix will be available. It's still on upstream, I see.

Thanks,

Andreas

Comment 96 Brian 2019-10-10 18:45:12 UTC

The change is already upstream, working on backports

Comment 99 jhardee 2019-11-13 14:12:51 UTC

Any update on the hotfix?

Comment 100 Matthew Booth 2019-11-19 15:40:37 UTC

(In reply to jhardee from comment #99)
> Any update on the hotfix?

This has been built since 15th October!

Comment 104 jhardee 2020-01-16 15:53:03 UTC

Hi team,

The patched worked.

Thank you.

Comment 112 errata-xmlrpc 2020-03-10 11:27:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0759

Comment 113 Lee Yarwood 2020-11-16 16:04:54 UTC

*** Bug 1894699 has been marked as a duplicate of this bug. ***

Comment 114 Red Hat Bugzilla 2023-09-18 00:16:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

aiyengar
akaris
aschultz
bbowen
cmuresan
dasmith
dhill
ebarrera
eglynn
fiezzi
gfidente
ggrimaux
gkadam
hberaud
ipetrova
jeckersb
jhakimra
jhardee
jlibosva
kchamart
kmehta
lhh
lmiccini
ltamagno
lyarwood
mbooth
michele
msufiyan
mwitt
pbabbar
plemenko
pveiga
ramishra
sbauza
schhabdi
vkoul