Bug 801987

Summary:

epoll appears to be unfair in rawhide

Product:

[Fedora] Fedora

Reporter:

Steven Dake <sdake>

Component:

libqb

Assignee:

Ryan McCabe <rmccabe>

Status:

CLOSED EOL

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

CC:

fdinitto, jbrouer, jfriesse, jpokorny, knoel, lhh

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

800865

Environment:

Last Closed:

2015-02-17 14:10:00 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

800865

Bug Blocks:

Attachments:

Description	Flags
corosync_glibc_backtrace_VM_quick_pause_resume.txt	none

Description Steven Dake 2012-03-10 03:27:22 UTC

+++ This bug was initially created as a clone of Bug #800865 +++

Description of problem:

this is a regression from flatiron.

i can´t say for sure where the problem is located but it appears that when there is a high IPC load, network packets are not processed till the load ends.

Version-Release number of selected component (if applicable):

master

How reproducible:

2 nodes are enough to reproduce this problem.

config:

quorum {
    provider: corosync_votequorum
    expected_votes: 2
    votes: 1
    two_node: 0
    wait_for_all: 0
    last_man_standing: 0
    auto_tie_breaker: 0
}

Steps to Reproduce:
1. start corosync -f on both nodes
2. check corosync-quorumtool -s

Membership information
----------------------
    Nodeid      Votes Name
3238176960          1 fedora-master-node1.int.fabbione.net
3254954176          1 fedora-master-node2.int.fabbione.net

3. stop one node

Membership information
----------------------
    Nodeid      Votes Name
3254954176          1 fedora-master-node2.int.fabbione.net

4. start cpgbench or cpgverify on the remaining node

5. start corosync -f on the other node
  
Actual results:

the node running test/cpg* will never see the other node till cpgverify
will stop.

Expected results:

the other node should be able to join at any given time.

Additional info:

--- Additional comment from fdinitto on 2012-03-07 09:47:19 EST ---

this appears to be a problem in libqb.

With libqb 0.11.0 i can see that sometime the nodes rejoin, other times they just can´t see each other.

I believe that this could be poll or timer related since the problem doesn´t show in RHEL6 with the same builds.

--- Additional comment from fdinitto on 2012-03-08 00:08:42 EST ---

After N times testing with latest and greatest, I noticed that:

The nodes rejoins about 60% of the time after 5/10 seconds.

The remaining 40% they don't rejoin at all till IPC load is off.

It smells as if the timeout in polling the fd is not working properly.

This is still rawhide/latest and greatest.

--- Additional comment from asalkeld on 2012-03-08 02:41:51 EST ---

Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that.

I am getting a lot of:
Mar 08 14:18:51 warning [QB    ] conn:0x1 Nothing in q but got POLLIN on fd:17 (res2:1)

Which is not good - I'll look into that.

1) Do you have your usual 2 cpu'ed vm's?
2) Do you get the log above?

--- Additional comment from fdinitto on 2012-03-08 03:48:33 EST ---

(In reply to comment #3)
> Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that.

rawhide != f17.

> 
> I am getting a lot of:
> Mar 08 14:18:51 warning [QB    ] conn:0x1 Nothing in q but got POLLIN on fd:17
> (res2:1

> 
> Which is not good - I'll look into that.
> 
> 1) Do you have your usual 2 cpu'ed vm's?

Yes 2 CPU's per VM with 4GB of ram each.

> 2) Do you get the log above?

Yes several of those too.

--- Additional comment from asalkeld on 2012-03-09 06:26:51 EST ---

I don't think this is a libqb bug (or corosync):

try the following:
2 nodes (f1 and f2)

on f1 - in different shells:
corosync -f
tcpdump -v host f2
cpgbench

on f2 after the above has started:
corosync -f

now as soon as cpgbench exits (or you kill it) tcpdump will burst into
life showing incoming packets.

It seems to me that the messages are not getting delivered during
high cpu usage.

Note: I have found some unrelated libqb bugs from this exercise
so I'll update libqb soon with those fixes.

--- Additional comment from fdinitto on 2012-03-09 09:02:10 EST ---

What I can observe with tcpdump is slightly different, but it appears to be something happening within the kernel with the load.

i start tcpdump on node1 and node2

start corosync -f on both nodes (so they join the mcast group)

stop corosync on one node

quickly start cpgbench on the remaining node

start corosync -f on the other node

I can see mcast packets entering the interface on the node that is running cpgbench but those packets are not delivered to corosync.

At this point i also suspect an issue with the kernel.

--- Additional comment from asalkeld on 2012-03-09 18:32:56 EST ---

corosync -r seems to cause corosync to get these messages

-r is setting real time priority.

(not saying that this is the correct solution, just FYI).

--- Additional comment from asalkeld on 2012-03-09 18:54:42 EST ---

I configured libqb to use poll (and not epoll) all works perfectly.

./configure ac_cv_func_epoll_create1=no ac_cv_func_epoll_create=no

really snappy responses to join/leaves.

I'll investigate the epoll usage in libqb.

--- Additional comment from sdake on 2012-03-09 21:58:54 EST ---

Fabio can you confirm Comment #1, that libqb/corosync on f16 or RHEL6 don't exhibit this problem?  Since this issue has been identified with the epoll system call, this may be a kernel epoll problem in f17.  I would like to try f17 on bare metal to see if virtualization related, but beaker doesn't support f17.  Anyone have some bare metal to try?

Comment 1 Steven Dake 2012-03-10 03:29:08 UTC

Ryan,

If you have some interested in getting involved in libqb and more systems programming, getting to the root cause of this bug (is the implementation in libqb broken or is kernel.org broken, and if so in what way?) would be of great help.

Regards
-steve

Comment 2 Jesper Brouer 2012-03-12 08:52:34 UTC

Hi Jason,

I'm adding you to this case, as it seems you have more inside into epoll, and have some recent upstream commits/fixes that might solve this starvation epoll issue?

--Jesper Brouer (new RH network kernel guy)

Comment 3 Jesper Brouer 2012-03-12 08:59:18 UTC

Perhaps this upstream commit fixes the issue?

UPSTREAM commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de
Author: Jason Baron <jbaron>
Date:   Thu Jan 12 17:17:43 2012 -0800

    epoll: limit paths
    
    The current epoll code can be tickled to run basically indefinitely in
    both loop detection path check (on ep_insert()), and in the wakeup paths.
    The programs that tickle this behavior set up deeply linked networks of
    epoll file descriptors that cause the epoll algorithms to traverse them
    indefinitely.  A couple of these sample programs have been previously
    posted in this thread: https://lkml.org/lkml/2011/2/25/297.
    [...cut...]

git describe --contains 28d82dc1c4edbc352129f97f4ca22624d1fe61de
v3.3-rc1~81^2~80

Comment 4 Jesper Brouer 2012-03-12 09:07:36 UTC

Or perhaps its an regression introduced in upstream commit 0884d7aa24 (v3.2-rc5~56^2~23), which gotten fixed by Eric Dumazet in upstream commit 6f01fd6e6f680 (v3.3-rc2~6^2).

commit 6f01fd6e6f6809061b56e78f1e8d143099716d70
Author: Eric Dumazet <eric.dumazet>
Date:   Sat Jan 28 16:11:03 2012 +0000

    af_unix: fix EPOLLET regression for stream sockets
    
    Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from
    a stream socket) added a regression for epoll() in Edge Triggered mode
    (EPOLLET)
    
    Appropriate fix is to use skb_peek()/skb_unlink() instead of
    skb_dequeue(), and only call skb_unlink() when skb is fully consumed.
    
    This remove the need to requeue a partial skb into sk_receive_queue head
    and the extra sk->sk_data_ready() calls that added the regression.
    
    This is safe because once skb is given to sk_receive_queue, it is not
    modified by a writer, and readers are serialized by u->readlock mutex.
    
    This also reduce number of spinlock acquisition for small reads or
    MSG_PEEK users so should improve overall performance.
    
    Reported-by: Nick Mathewson <nickm>
    Signed-off-by: Eric Dumazet <eric.dumazet>
    Cc: Alexey Moiseytsev <himeraster>
    Signed-off-by: David S. Miller <davem>

Comment 5 Fabio Massimo Di Nitto 2012-03-12 11:20:36 UTC

I tested with kernel 3.3.0-rc7+ (HEAD: fde7d9049e55ab85a390be7f415d74c9f62dd0f9)
that appears to include the above fixes and I can still reproduce the problem.

libqb is libqb-0.11.1-1.fc18

The version included in fedora builds without epoll support and it works.

Rebuilding the same source with epoll triggers the issue again.

corosync is at git head: c00502a70a8a0b4907f9daf370e3ff1e54c7100a

Contrary to what Angus saw in his tests, starting corosync with real time priority
does not help.

Comment 6 Fabio Massimo Di Nitto 2012-03-12 13:56:57 UTC

Further investigation:

I tested kernel 3.2.9-x from F16 on rawhide and I can still reproduce the problem.

Running the same test in pure F16 does not trigger the issue.

At this point i suspect there are either multiple issues in the whole epoll chain or something is going wrong with the toolchain. This kind of debugging is far too advanced for me.

Angus do we have a simple test case to reproduce the problem that does not involve running multiple nodes and corosync?

Comment 7 Angus Salkeld 2012-03-12 22:11:18 UTC

(In reply to comment #6)
> Further investigation:
> 
> I tested kernel 3.2.9-x from F16 on rawhide and I can still reproduce the
> problem.
> 
> Running the same test in pure F16 does not trigger the issue.
> 
> At this point i suspect there are either multiple issues in the whole epoll
> chain or something is going wrong with the toolchain. This kind of debugging is
> far too advanced for me.
> 
> Angus do we have a simple test case to reproduce the problem that does not
> involve running multiple nodes and corosync?

Nope, sorry.

Comment 8 Jan Pokorný [poki] 2012-03-14 13:54:57 UTC

Re comment 6:
Fabio, did you used respective fresh version of modules as well?
I mean. e.g. virtio_net, if you used VM for your test.

Comment 9 Fabio Massimo Di Nitto 2012-03-14 14:05:48 UTC

(In reply to comment #8)
> Re comment 6:
> Fabio, did you used respective fresh version of modules as well?
> I mean. e.g. virtio_net, if you used VM for your test.

The host is running F16 and VM fedora rawhide, but the problem is not on the network.

Comment 10 Jan Pokorný [poki] 2012-03-19 19:06:28 UTC

I've tried to cover one of many possibilities to reduce the space
by trying rawhide kernel 3.3.0-0.rc7.git0.5 [1] built on my F16
box (details below) back on rawhide VMs and cannot reproduce any
of the two testcases from  2012-03-09 above (asalkeld, fdinitto).

You can test it on your own, please find such built kernel at
temporary location [2] (I've rebooted back to old anaconda kernel,
uninstalled current kernel via yum and installed all these packages
via rpm, then rebooted on top of it [at least as per uname]).

If this kernel (meaning complete, incl. modules, see comment 8,
I have no idea if, e.g., drivers can affect epoll) is proved to work
well, than rawhide toolchain may suffer from a defect.


Details: toolchain used to build [2] of F16 box

$ rpm -q gcc glibc binutils
gcc-4.6.3-2.fc16.x86_64
glibc-2.14.90-24.fc16.6.x86_64
binutils-2.21.53.0.1-6.fc16.x86_64

Details: rawhide packages used for testing

$ rpm -q corosync libqb corosynclib corosynclib corosynclib-devel \
  corosynclib-devel libqb-devel
corosync-1.99.8-1.fc18.x86_64
libqb-0.11.1-1.fc18.x86_64
corosynclib-1.99.8-1.fc18.x86_64
corosynclib-1.99.8-1.fc18.x86_64
corosynclib-devel-1.99.8-1.fc18.x86_64
corosynclib-devel-1.99.8-1.fc18.x86_64
libqb-devel-0.11.1-1.fc18.x86_64

Details: testing rawhide VMs
- uname -r -v: 3.0-0.rc7.git0.5.fc16.x86_64 #1 SMP Fri Mar 16 23:29:59 CET 2012
- single CPU, 1024 MB RAM, virtio NIC
- corosync: corosync.conf.example with adjusted bindnetaddr (to match the
            network), mcastaddr and mcastport

[1] http://kojipkgs.fedoraproject.org/packages/kernel/3.3.0/0.rc7.git0.5.fc18/src/kernel-3.3.0-0.rc7.git0.5.fc18.src.rpm
[2] http://jpokorny.fedorapeople.org/temp/kernel_F16_toolchain/

Comment 11 Jan Pokorný [poki] 2012-03-19 19:22:49 UTC

Created attachment 571214 [details]
corosync_glibc_backtrace_VM_quick_pause_resume.txt

Aside: when cleaning up the environment on the VMs, I paused of them
shortly, realized I wanted to continue, so resumed back in a short time
and in terminal for the respective node with running "corosync -f",
I got "double free or corruption" (see attached backtrace).

Clearly, stopping and resuming VM in a quick sequence is probably quite
careless, but one might expect better robustness of whatever side that
caused this.

Comment 12 Jan Pokorný [poki] 2012-03-19 19:33:52 UTC

Re comment 11:  these are /var/log/messages records from around that time
                (haven't checked the time of accident, sorry)

[TOTEM ] A processor failed, forming new configuration.
[QB    ] poll_mod : can't modify entry already deleted
[TOTEM ] A processor joined or left the membership and a new membership
         was formed.
[MAIN  ] Completed service synchronization, ready to provide service.
[QB    ] poll_mod : can't modify entry already deleted

(after 19 seconds)

[SERV  ] Unloading all Corosync service engines.
[QB    ] withdrawing server sockets
[SERV  ] Service engine unloaded: corosync configuration map access
[QB    ] Could not accept client connection from fd:11: Bad file
         descriptor (9)
[QB    ] withdrawing server sockets
[SERV  ] Service engine unloaded: corosync configuration service
[QB    ] Could not accept client connection from fd:12: Bad file
         descriptor (9)

Comment 13 Jan Pokorný [poki] 2012-03-26 15:54:12 UTC

I've tried the same as in comment 10 with original
3.3.0-0.rc7.git0.5.fc18.x86_64 kernel Fedora build [3],
this time on both single and two CPUs per VM, and still was
not able to reproduce the issue.  The newly initiated node
always joined successfully and immediately.

The only thing I noticed was that while running cpgbench,
100x more UDP datagrams were interchanged per second as per
tcpdump (I have no idea if this is expected or these were
forced retransmits due to timeouts).

Probably cannot help more here.  Maybe "double free" as per
comment 11 would deserve some check if it is by any chance
apparent in the code (qb_ipcs_disconnect).

[3] http://koji.fedoraproject.org/koji/buildinfo?buildID=306760

Comment 14 Fedora End Of Life 2013-04-03 18:07:17 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 15 Fedora End Of Life 2015-01-09 17:03:40 UTC

This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Fedora End Of Life 2015-02-17 14:10:00 UTC

Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.