Bug 617310

Summary:	dbus_connection_send_with_reply_and_block() takes longer than expected in a virtualized environment
Product:	Red Hat Enterprise Linux 6	Reporter:	Andrew Beekhof <abeekhof>
Component:	dbus	Assignee:	David King <dking>
Status:	CLOSED WONTFIX	QA Contact:	Desktop QE <desktop-qa-list>
Severity:	medium	Docs Contact:
Priority:	low
Version:	6.0	Keywords:	RHELNAK
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-06 11:51:42 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Beekhof 2010-07-22 18:44:04 UTC

Description of problem:

ricci uses dbus to perform various jobs, the jobs themselves take approx. 3ms to complete and on bare-metal the call to dbus_connection_send_with_reply_and_block() returns after 10-20ms.

we've noticed however, that the same functionality takes much longer when performed on a virtual machine.  the actions still take 3ms or so, but it seems dbus_connection_send_with_reply_and_block() is slow to find out the job is done.

I rebuilt dbus with  --enable-verbose-mode, added timestamps to the logs and added a few extra log points.

As far as I can tell, the client is in socket_do_iteration() calling:
      poll_res = _dbus_poll (&poll_fd, 1, poll_timeout);

and the dbus server is also calling poll

  time in ms    pid            log message
4205254211 6380: unlock socket_do_iteration pre poll
4205254211 6380:   UNLOCK: _dbus_connection_unlock
4205254211 6380: _dbus_poll real polling
... more messages from 5592...
4205254222 5592: _dbus_poll real polling
4205255238 5592: _dbus_connection_handle_watch start
... more messages from 5592...
4205255254 6380: lock socket_do_iteration post poll

An strace (using the -Tv options) of the server would also seem to confirm this:

poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 6, -1) = 1 ([{fd=10, revents=POLLIN}]) <1.012872>

"help?"
 
Version-Release number of selected component (if applicable):

dbus-libs-1.2.24-2.el6.x86_64
dbus-1.2.24-2.el6.x86_64

How reproducible:

Every time

Steps to Reproduce:
1.  Install strace, ricci and modcluster in a virtual machine running rhel6

2. Run: service messagebus start

3. Run: service ricci start

4.  Paste the following into /tmp/job

<batch status="1"><module name="cluster"><request API_version="1.0"><function_call name="get_cluster.conf"></function_call></request></module></batch>

5.  strace -Tv -p $pidofdbus

6.  Run /usr/libexec/ricci/ricci-worker -f /tmp/job
  
Actual results:

One call to poll will take > 1s
Look for the pattern "<1."
For example:
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}, {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}], 6, -1) = 1 ([{fd=10, revents=POLLIN}]) <1.012872>


Expected results:

No call to poll takes anywhere near 1s

Additional info:

Comment 1 RHEL Program Management 2010-07-22 18:58:30 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 2 Andrew Beekhof 2010-07-22 20:07:58 UTC

Sorry, I forgot to mention that /etc/cluster/cluster.conf needs to exist before
executing step 6.
Try pasting the following...

[root@pcmk-4 ~]# cat /etc/cluster/cluster.conf 
<cluster config_version="12" name="blahblah">  
  <fence_daemon clean_start="0" post_fail_delay="1" post_join_delay="3"/>  
  <clusternodes>    
    <clusternode name="pcmk-2.beekhof.net" nodeid="1" votes="1"/>      
    <clusternode name="pcmk-3.beekhof.net" nodeid="2" votes="1"/>      
  </clusternodes>  
  <cman expected_votes="1" two_node="1"/>  
  <fencedevices>    
    <fencedevice agent="fence_xvm" name="virt-fence"/>    
  </fencedevices>  
  <rm>    
    <failoverdomains/>      
  </rm>  
</cluster>

Comment 6 Colin Walters 2010-08-06 16:06:03 UTC

First, don't call send_with_reply_and_block if you don't actually need the reply immediately.  Instead do asynchronous calls with _send_with_reply, or if you don't need the reply at all, just don't request one and use _send.

There's some unavoidable latency for round trips with dbus because it requires a few context switches (message sender, bus daemon, message receipient, bus daemon, message sender); in a virtualized environment this is likely to be magnified because the entire virtual OS needs to compete for scheduling with other OSes.

But again, your dbus experience is going to be significantly better if you avoid blocking while waiting for a response and instead do as many requests in parallel as possible.  For example, if you need to call a GetStatus() method on 15 objects, this should be obviously paralellizable.

Comment 7 Andrew Beekhof 2010-09-06 08:52:54 UTC

(In reply to comment #6)
> First, don't call send_with_reply_and_block if you don't actually need the
> reply immediately. 

But we do.  There's no parallelism in the caller, its a single process with a single job to execute.

> There's some unavoidable latency for round trips with dbus because it requires
> a few context switches (message sender, bus daemon, message receipient, bus
> daemon, message sender); in a virtualized environment this is likely to be
> magnified because the entire virtual OS needs to compete for scheduling with
> other OSes.

This was on a 4 core machine with a single KVM guest running.

> But again, your dbus experience is going to be significantly better if you
> avoid blocking while waiting for a response and instead do as many requests in
> parallel as possible.  For example, if you need to call a GetStatus() method on
> 15 objects, this should be obviously paralellizable.

Not really relevant to our use-case I'm afraid.

Comment 8 RHEL Program Management 2011-01-07 15:37:22 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 9 Jan Kurik 2017-12-06 11:51:42 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/