Bug 1733930 - [OSP13] stack creation struck in CREATE_IN_PROGRESS and its deleting after timeout [NEEDINFO]
Summary: [OSP13] stack creation struck in CREATE_IN_PROGRESS and its deleting after ti...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: All
OS: All
urgent
urgent
Target Milestone: z11
: 13.0 (Queens)
Assignee: Matthew Booth
QA Contact: Paras Babbar
URL:
Whiteboard:
: 1749447 (view as bug list)
Depends On: 1754479 1754482
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-29 08:57 UTC by Nilesh
Modified: 2020-02-20 18:11 UTC (History)
36 users (show)

Fixed In Version: openstack-nova-17.0.12-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1754479 1754482 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:
akaris: needinfo? (jhardee)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Launchpad 1840912 None None None 2019-08-21 11:48:21 UTC
OpenStack gerrit 677736 'None' MERGED libvirt: Fix service-wide pauses caused by un-proxied libvirt calls 2020-02-17 17:41:09 UTC
OpenStack gerrit 683930 'None' NEW libvirt: Fix service-wide pauses caused by un-proxied libvirt calls 2020-02-17 17:41:09 UTC
Red Hat Bugzilla 1740681 'high' 'CLOSED' 'heartbeats missed and connection timeout' 2019-12-06 02:08:57 UTC

Comment 63 Matthew Booth 2019-08-20 15:22:12 UTC
comment 62 strongly suggests that something occurring in the instance loop in LibvirtDriver._get_disk_over_committed_size_total() is preventing the db servicegroup's report_state() from being executed. This is executed in its own greenthread leads us back more strongly to a theory about something starving other greenthreads.

Comment 64 Matthew Booth 2019-08-20 15:23:01 UTC
Interestingly, there's an explicit scheduling point in that loop which was added by:

https://review.opendev.org/#/c/12335/

It looks like somebody has hit this before.

Comment 67 Matthew Booth 2019-08-21 09:47:55 UTC
As we're currently suspecting a nova/libvirt interaction issue, I took a look into how the libvirt threadpool works. It's done here:

        return tpool.proxy_call(
            (libvirt.virDomain, libvirt.virConnect),
            libvirt.openAuth, uri, auth, flags)

This translates to: Using a threadpool, execute libvirt.openAuth(uri, auth, flags). If the returned object is a libvirt.virDomain or libvirt.virConnect, wrap the returned object in a threadpool Proxy with the same autowrap rules.

However, we are using more libvirt objects than this. After discussion with Dan Berrangé, we think we need to add at least: libvirt.virNodeDevice, libvirt.virSecret, libvirt.NWFilter. Given that this whitelist is obviously fragile, we 

However, the problem appears be worse than this. I note that the loop I identified above as being the probable source of the pause iterates over a list of instances generated by list_instance_domains, which was ultimately returned by libvirt's listAllDomains(). Note that this returns a *list* of virDomain. However, the tpool proxy won't match this from our autowrap list:

https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149

This means we're iterating over a bunch of un-proxied libvirt domains. A quick test of libvirt and tpool confirms this to be the case. At this point I'm almost certain this is the root cause of the issue.

Comment 75 Matthew Booth 2019-09-06 13:35:16 UTC
*** Bug 1749447 has been marked as a duplicate of this bug. ***

Comment 76 Andreas Karis 2019-09-06 16:43:26 UTC
Hi,

Given that 1749447 is a duplicate of this case and given that my customer needs visibility into this BZ, I made it public. I'm going through relevant comments on this very BZ which can be public and don't expose sensitive customer information so that other customers can follow the progress. Please keep this in mind when posting follow-ups and data, here.

THanks,

Andreas

Comment 77 Andreas Karis 2019-09-06 16:47:39 UTC
Hi,

Do we have a rough ETA for this for my customer from BZ 1749447?

Thanks,

Andreas

Comment 78 Andreas Karis 2019-09-06 16:48:08 UTC
Or would it make sense to confirm this with a test build?

Comment 79 Andreas Karis 2019-09-09 23:00:29 UTC
Hi,

Do we have a test build and or ETA and/or other way of verifying that my customer indeed hits this issue?

Thanks,

Andreas

Comment 80 Andreas Karis 2019-09-13 19:15:31 UTC
Hi,

Do we have a rough ETA for when this fix will be released? The customer is still hitting this issue and I'd like to send them a test build for verification, or ideally a hotfix.

Thanks,

Andreas

Comment 88 John Eckersberg 2019-09-26 21:17:27 UTC
First thing I see worth mentioning in the latest sosreport...

It looks like this change is halfway implemented:

https://review.opendev.org/#/c/503788/

I can see the net_ticktime option is present in rabbitmq.config:

# grep ticktime ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq.config
    {net_ticktime, 15}

However, I also see the low-level tuning parameters are still present in rabbitmq-env.conf:

# grep raw,6,18 ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf 
RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<30000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<30000:64/native>>}]"

That line should be changed (per the upstream patch linked above) to read:

RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true}]"

I suspect this may be exacerbating problems in conjunction with the blocking behavior of nova-compute.

Those flags set the TCP_USER_TIMEOUT option on the socket to 30 seconds.  This causes the connection to be reset if a packet is sent but no acknowledgement is received within 30 seconds.  It *also* triggers if a packet cannot be sent because it is queued on the sending side for 30 seconds.

A blocked client on the other end of the connection can cause this timeout to trigger.  The sequence is:

- nova-compute gets blocked for $reasons
- the event loop in nova-compute no longer triggers task switching to call recv on the socket from rabbitmq
- rabbitmq is still sending messages down the wire to nova-compute on the other end
- the kernel receives and acknowledges the packets and stores them in the recv buffer for the nova-compute socket
- eventually the recv buffer for the nova-compute process fills up so the tcp window goes to 0 and no more packets can be sent from rabbitmq
- 30 seconds pass and the socket is reset on the rabbitmq side of the connection by the kernel

I doubt this change will significantly improve the stability of the environment, but it should help cut down on some of the noise in the logs and reduce the number of disconnect/reconnect cycles.

I see 49 instances in the rabbitmq logs that match "{inet_error,etimedout}" which is how connections log when they fail in this manner.  This should make those errors stop.

Comment 93 Andreas Karis 2019-10-07 15:44:38 UTC
Hi,

With regards to: https://bugs.launchpad.net/nova/+bug/1840912

I forwarded a container with the build from: https://bugzilla.redhat.com/show_bug.cgi?id=1733930#c81 to the customer and it seems to have fixed the issue. 

I'm going to attach the files for building the container. 

The customer would like to know when this fix will be available. It's still on upstream, I see.

Thanks,

Andreas

Comment 96 Brian 2019-10-10 18:45:12 UTC
The change is already upstream, working on backports

Comment 99 jhardee 2019-11-13 14:12:51 UTC
Any update on the hotfix?

Comment 100 Matthew Booth 2019-11-19 15:40:37 UTC
(In reply to jhardee from comment #99)
> Any update on the hotfix?

This has been built since 15th October!

Comment 104 jhardee 2020-01-16 15:53:03 UTC
Hi team,

The patched worked.

Thank you.


Note You need to log in before you can comment on or make changes to this bug.