comment 62 strongly suggests that something occurring in the instance loop in LibvirtDriver._get_disk_over_committed_size_total() is preventing the db servicegroup's report_state() from being executed. This is executed in its own greenthread leads us back more strongly to a theory about something starving other greenthreads.
Interestingly, there's an explicit scheduling point in that loop which was added by: https://review.opendev.org/#/c/12335/ It looks like somebody has hit this before.
As we're currently suspecting a nova/libvirt interaction issue, I took a look into how the libvirt threadpool works. It's done here: return tpool.proxy_call( (libvirt.virDomain, libvirt.virConnect), libvirt.openAuth, uri, auth, flags) This translates to: Using a threadpool, execute libvirt.openAuth(uri, auth, flags). If the returned object is a libvirt.virDomain or libvirt.virConnect, wrap the returned object in a threadpool Proxy with the same autowrap rules. However, we are using more libvirt objects than this. After discussion with Dan Berrangé, we think we need to add at least: libvirt.virNodeDevice, libvirt.virSecret, libvirt.NWFilter. Given that this whitelist is obviously fragile, we However, the problem appears be worse than this. I note that the loop I identified above as being the probable source of the pause iterates over a list of instances generated by list_instance_domains, which was ultimately returned by libvirt's listAllDomains(). Note that this returns a *list* of virDomain. However, the tpool proxy won't match this from our autowrap list: https://github.com/eventlet/eventlet/blob/ca8dd0748a1985a409e9a9a517690f46e05cae99/eventlet/tpool.py#L149 This means we're iterating over a bunch of un-proxied libvirt domains. A quick test of libvirt and tpool confirms this to be the case. At this point I'm almost certain this is the root cause of the issue.
*** Bug 1749447 has been marked as a duplicate of this bug. ***
Hi, Given that 1749447 is a duplicate of this case and given that my customer needs visibility into this BZ, I made it public. I'm going through relevant comments on this very BZ which can be public and don't expose sensitive customer information so that other customers can follow the progress. Please keep this in mind when posting follow-ups and data, here. THanks, Andreas
Hi, Do we have a rough ETA for this for my customer from BZ 1749447? Thanks, Andreas
Or would it make sense to confirm this with a test build?
Hi, Do we have a test build and or ETA and/or other way of verifying that my customer indeed hits this issue? Thanks, Andreas
Hi, Do we have a rough ETA for when this fix will be released? The customer is still hitting this issue and I'd like to send them a test build for verification, or ideally a hotfix. Thanks, Andreas
First thing I see worth mentioning in the latest sosreport... It looks like this change is halfway implemented: https://review.opendev.org/#/c/503788/ I can see the net_ticktime option is present in rabbitmq.config: # grep ticktime ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq.config {net_ticktime, 15} However, I also see the low-level tuning parameters are still present in rabbitmq-env.conf: # grep raw,6,18 ./var/lib/config-data/puppet-generated/rabbitmq/etc/rabbitmq/rabbitmq-env.conf RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true},{raw,6,18,<<30000:64/native>>}] -kernel inet_default_listen_options [{raw,6,18,<<30000:64/native>>}]" That line should be changed (per the upstream patch linked above) to read: RABBITMQ_SERVER_ERL_ARGS="+K true +P 1048576 -kernel inet_default_connect_options [{nodelay,true}]" I suspect this may be exacerbating problems in conjunction with the blocking behavior of nova-compute. Those flags set the TCP_USER_TIMEOUT option on the socket to 30 seconds. This causes the connection to be reset if a packet is sent but no acknowledgement is received within 30 seconds. It *also* triggers if a packet cannot be sent because it is queued on the sending side for 30 seconds. A blocked client on the other end of the connection can cause this timeout to trigger. The sequence is: - nova-compute gets blocked for $reasons - the event loop in nova-compute no longer triggers task switching to call recv on the socket from rabbitmq - rabbitmq is still sending messages down the wire to nova-compute on the other end - the kernel receives and acknowledges the packets and stores them in the recv buffer for the nova-compute socket - eventually the recv buffer for the nova-compute process fills up so the tcp window goes to 0 and no more packets can be sent from rabbitmq - 30 seconds pass and the socket is reset on the rabbitmq side of the connection by the kernel I doubt this change will significantly improve the stability of the environment, but it should help cut down on some of the noise in the logs and reduce the number of disconnect/reconnect cycles. I see 49 instances in the rabbitmq logs that match "{inet_error,etimedout}" which is how connections log when they fail in this manner. This should make those errors stop.
Hi, With regards to: https://bugs.launchpad.net/nova/+bug/1840912 I forwarded a container with the build from: https://bugzilla.redhat.com/show_bug.cgi?id=1733930#c81 to the customer and it seems to have fixed the issue. I'm going to attach the files for building the container. The customer would like to know when this fix will be available. It's still on upstream, I see. Thanks, Andreas
The change is already upstream, working on backports
Any update on the hotfix?
(In reply to jhardee from comment #99) > Any update on the hotfix? This has been built since 15th October!
Hi team, The patched worked. Thank you.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0759
*** Bug 1894699 has been marked as a duplicate of this bug. ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days