787196 – libqb spin in qb_loop with timerfd

Bug 787196 - libqb spin in qb_loop with timerfd

Summary: libqb spin in qb_loop with timerfd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libqb
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Angus Salkeld
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-02-03 13:37 UTC by Fabio Massimo Di Nitto
Modified:	2012-02-07 07:58 UTC (History)
CC List:	2 users (show)
Fixed In Version:	libqb-0.9.0-2.fc16
Clone Of:
Environment:
Last Closed:	2012-02-07 07:58:08 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace from node2 (3.24 MB, text/plain) 2012-02-03 13:38 UTC, Fabio Massimo Di Nitto	no flags	Details
View All

Description Fabio Massimo Di Nitto 2012-02-03 13:37:15 UTC

as we discussed on IRC, always reproducible in rawhide.

corosync spins 100% cpu in certain conditions when timerfd is used.

After I built libqb without timerfd also the startup logging.debug spinning is gone.

This is a consistent reproducer, using dlm_controld.

Install libqb master/corosync master rpms and dlm rpm from fedora rawhide.

corosync.conf relevant bits:

compatibility: whitetank

quorum {
    provider: corosync_votequorum
    two_node: 1
    wait_for_all: 0
    last_man_standing: 0
    auto_tie_breaker: 0
}

nodelist {
        node {
                ring0_addr: 192.168.2.193
                nodeid: 1
        }
        node {
                ring0_addr: 192.168.2.194
                nodeid: 2
        }
}

logging {
        # Log the source file and line where messages are being
        # generated. When in doubt, leave off. Potentially useful for
        # debugging.
        fileline: off
        # Log to standard error. When in doubt, set to no. Useful when
        # running in the foreground (when invoking "corosync -f")
        to_stderr: yes
        # Log to a log file. When set to "no", the "logfile" option
        # must not be set.
        to_logfile: yes
        logfile: /var/log/cluster/corosync.log
        # Log to the system log daemon. When in doubt, set to yes.
        to_syslog: yes
        # Log debug messages (very verbose). When in doubt, leave off.
        debug: on
        # Log messages with time stamps. When in doubt, set to on
        # (unless you are only logging to syslog, where double
        # timestamps can be annoying).
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: on
        }
        logger_subsys {
                subsys: VOTEQ
                debug: on
        }
}

On both nodes:
modprobe dlm
cd /dev
ln -sf . misc

start corosync on both nodes. I use "corosync -f" to see logging on stderr.

on node1: start dlm_controld -f0 -D

and now the spinning:

allow dlm_controld to settle on node1.

start dlm_controld -f0 -D on node2.

corosync on node2 will start spinning 100% CPU.

killing dlm_controld on node2 will NOT solve the problem.

Comment 1 Fabio Massimo Di Nitto 2012-02-03 13:38:48 UTC

Created attachment 559294 [details]
strace from node2

strace from node2

Comment 2 Fabio Massimo Di Nitto 2012-02-06 12:32:31 UTC

Tested with the new patches, so far I haven´t been able to trigger the spinning.

Comment 3 Angus Salkeld 2012-02-06 22:18:44 UTC

Should be fixed in 0.9.0-2

Comment 4 Fedora Update System 2012-02-06 22:32:10 UTC

libqb-0.9.0-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/libqb-0.9.0-2.fc16

Comment 5 Fedora Update System 2012-02-07 07:58:08 UTC

libqb-0.9.0-2.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.