Bug 787196

Summary: libqb spin in qb_loop with timerfd
Product: [Fedora] Fedora Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: libqbAssignee: Angus Salkeld <asalkeld>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rawhideCC: asalkeld, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libqb-0.9.0-2.fc16 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-07 07:58:08 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
strace from node2 none

Description Fabio Massimo Di Nitto 2012-02-03 13:37:15 UTC
as we discussed on IRC, always reproducible in rawhide.

corosync spins 100% cpu in certain conditions when timerfd is used.

After I built libqb without timerfd also the startup logging.debug spinning is gone.

This is a consistent reproducer, using dlm_controld.

Install libqb master/corosync master rpms and dlm rpm from fedora rawhide.

corosync.conf relevant bits:

compatibility: whitetank

quorum {
    provider: corosync_votequorum
    two_node: 1
    wait_for_all: 0
    last_man_standing: 0
    auto_tie_breaker: 0
}

nodelist {
        node {
                ring0_addr: 192.168.2.193
                nodeid: 1
        }
        node {
                ring0_addr: 192.168.2.194
                nodeid: 2
        }
}

logging {
        # Log the source file and line where messages are being
        # generated. When in doubt, leave off. Potentially useful for
        # debugging.
        fileline: off
        # Log to standard error. When in doubt, set to no. Useful when
        # running in the foreground (when invoking "corosync -f")
        to_stderr: yes
        # Log to a log file. When set to "no", the "logfile" option
        # must not be set.
        to_logfile: yes
        logfile: /var/log/cluster/corosync.log
        # Log to the system log daemon. When in doubt, set to yes.
        to_syslog: yes
        # Log debug messages (very verbose). When in doubt, leave off.
        debug: on
        # Log messages with time stamps. When in doubt, set to on
        # (unless you are only logging to syslog, where double
        # timestamps can be annoying).
        timestamp: on
        logger_subsys {
                subsys: QUORUM
                debug: on
        }
        logger_subsys {
                subsys: VOTEQ
                debug: on
        }
}

On both nodes:
modprobe dlm
cd /dev
ln -sf . misc

start corosync on both nodes. I use "corosync -f" to see logging on stderr.

on node1: start dlm_controld -f0 -D

and now the spinning:

allow dlm_controld to settle on node1.

start dlm_controld -f0 -D on node2.

corosync on node2 will start spinning 100% CPU.

killing dlm_controld on node2 will NOT solve the problem.

Comment 1 Fabio Massimo Di Nitto 2012-02-03 13:38:48 UTC
Created attachment 559294 [details]
strace from node2

strace from node2

Comment 2 Fabio Massimo Di Nitto 2012-02-06 12:32:31 UTC
Tested with the new patches, so far I haven“t been able to trigger the spinning.

Comment 3 Angus Salkeld 2012-02-06 22:18:44 UTC
Should be fixed in 0.9.0-2

Comment 4 Fedora Update System 2012-02-06 22:32:10 UTC
libqb-0.9.0-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/libqb-0.9.0-2.fc16

Comment 5 Fedora Update System 2012-02-07 07:58:08 UTC
libqb-0.9.0-2.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.