Hide Forgot
+++ This bug was initially created as a clone of Bug #800865 +++ Description of problem: this is a regression from flatiron. i can´t say for sure where the problem is located but it appears that when there is a high IPC load, network packets are not processed till the load ends. Version-Release number of selected component (if applicable): master How reproducible: 2 nodes are enough to reproduce this problem. config: quorum { provider: corosync_votequorum expected_votes: 2 votes: 1 two_node: 0 wait_for_all: 0 last_man_standing: 0 auto_tie_breaker: 0 } Steps to Reproduce: 1. start corosync -f on both nodes 2. check corosync-quorumtool -s Membership information ---------------------- Nodeid Votes Name 3238176960 1 fedora-master-node1.int.fabbione.net 3254954176 1 fedora-master-node2.int.fabbione.net 3. stop one node Membership information ---------------------- Nodeid Votes Name 3254954176 1 fedora-master-node2.int.fabbione.net 4. start cpgbench or cpgverify on the remaining node 5. start corosync -f on the other node Actual results: the node running test/cpg* will never see the other node till cpgverify will stop. Expected results: the other node should be able to join at any given time. Additional info: --- Additional comment from fdinitto on 2012-03-07 09:47:19 EST --- this appears to be a problem in libqb. With libqb 0.11.0 i can see that sometime the nodes rejoin, other times they just can´t see each other. I believe that this could be poll or timer related since the problem doesn´t show in RHEL6 with the same builds. --- Additional comment from fdinitto on 2012-03-08 00:08:42 EST --- After N times testing with latest and greatest, I noticed that: The nodes rejoins about 60% of the time after 5/10 seconds. The remaining 40% they don't rejoin at all till IPC load is off. It smells as if the timeout in polling the fd is not working properly. This is still rawhide/latest and greatest. --- Additional comment from asalkeld on 2012-03-08 02:41:51 EST --- Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that. I am getting a lot of: Mar 08 14:18:51 warning [QB ] conn:0x1 Nothing in q but got POLLIN on fd:17 (res2:1) Which is not good - I'll look into that. 1) Do you have your usual 2 cpu'ed vm's? 2) Do you get the log above? --- Additional comment from fdinitto on 2012-03-08 03:48:33 EST --- (In reply to comment #3) > Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that. rawhide != f17. > > I am getting a lot of: > Mar 08 14:18:51 warning [QB ] conn:0x1 Nothing in q but got POLLIN on fd:17 > (res2:1 > > Which is not good - I'll look into that. > > 1) Do you have your usual 2 cpu'ed vm's? Yes 2 CPU's per VM with 4GB of ram each. > 2) Do you get the log above? Yes several of those too. --- Additional comment from asalkeld on 2012-03-09 06:26:51 EST --- I don't think this is a libqb bug (or corosync): try the following: 2 nodes (f1 and f2) on f1 - in different shells: corosync -f tcpdump -v host f2 cpgbench on f2 after the above has started: corosync -f now as soon as cpgbench exits (or you kill it) tcpdump will burst into life showing incoming packets. It seems to me that the messages are not getting delivered during high cpu usage. Note: I have found some unrelated libqb bugs from this exercise so I'll update libqb soon with those fixes. --- Additional comment from fdinitto on 2012-03-09 09:02:10 EST --- What I can observe with tcpdump is slightly different, but it appears to be something happening within the kernel with the load. i start tcpdump on node1 and node2 start corosync -f on both nodes (so they join the mcast group) stop corosync on one node quickly start cpgbench on the remaining node start corosync -f on the other node I can see mcast packets entering the interface on the node that is running cpgbench but those packets are not delivered to corosync. At this point i also suspect an issue with the kernel. --- Additional comment from asalkeld on 2012-03-09 18:32:56 EST --- corosync -r seems to cause corosync to get these messages -r is setting real time priority. (not saying that this is the correct solution, just FYI). --- Additional comment from asalkeld on 2012-03-09 18:54:42 EST --- I configured libqb to use poll (and not epoll) all works perfectly. ./configure ac_cv_func_epoll_create1=no ac_cv_func_epoll_create=no really snappy responses to join/leaves. I'll investigate the epoll usage in libqb. --- Additional comment from sdake on 2012-03-09 21:58:54 EST --- Fabio can you confirm Comment #1, that libqb/corosync on f16 or RHEL6 don't exhibit this problem? Since this issue has been identified with the epoll system call, this may be a kernel epoll problem in f17. I would like to try f17 on bare metal to see if virtualization related, but beaker doesn't support f17. Anyone have some bare metal to try?
Ryan, If you have some interested in getting involved in libqb and more systems programming, getting to the root cause of this bug (is the implementation in libqb broken or is kernel.org broken, and if so in what way?) would be of great help. Regards -steve
Hi Jason, I'm adding you to this case, as it seems you have more inside into epoll, and have some recent upstream commits/fixes that might solve this starvation epoll issue? --Jesper Brouer (new RH network kernel guy)
Perhaps this upstream commit fixes the issue? UPSTREAM commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de Author: Jason Baron <jbaron> Date: Thu Jan 12 17:17:43 2012 -0800 epoll: limit paths The current epoll code can be tickled to run basically indefinitely in both loop detection path check (on ep_insert()), and in the wakeup paths. The programs that tickle this behavior set up deeply linked networks of epoll file descriptors that cause the epoll algorithms to traverse them indefinitely. A couple of these sample programs have been previously posted in this thread: https://lkml.org/lkml/2011/2/25/297. [...cut...] git describe --contains 28d82dc1c4edbc352129f97f4ca22624d1fe61de v3.3-rc1~81^2~80
Or perhaps its an regression introduced in upstream commit 0884d7aa24 (v3.2-rc5~56^2~23), which gotten fixed by Eric Dumazet in upstream commit 6f01fd6e6f680 (v3.3-rc2~6^2). commit 6f01fd6e6f6809061b56e78f1e8d143099716d70 Author: Eric Dumazet <eric.dumazet> Date: Sat Jan 28 16:11:03 2012 +0000 af_unix: fix EPOLLET regression for stream sockets Commit 0884d7aa24 (AF_UNIX: Fix poll blocking problem when reading from a stream socket) added a regression for epoll() in Edge Triggered mode (EPOLLET) Appropriate fix is to use skb_peek()/skb_unlink() instead of skb_dequeue(), and only call skb_unlink() when skb is fully consumed. This remove the need to requeue a partial skb into sk_receive_queue head and the extra sk->sk_data_ready() calls that added the regression. This is safe because once skb is given to sk_receive_queue, it is not modified by a writer, and readers are serialized by u->readlock mutex. This also reduce number of spinlock acquisition for small reads or MSG_PEEK users so should improve overall performance. Reported-by: Nick Mathewson <nickm> Signed-off-by: Eric Dumazet <eric.dumazet> Cc: Alexey Moiseytsev <himeraster> Signed-off-by: David S. Miller <davem>
I tested with kernel 3.3.0-rc7+ (HEAD: fde7d9049e55ab85a390be7f415d74c9f62dd0f9) that appears to include the above fixes and I can still reproduce the problem. libqb is libqb-0.11.1-1.fc18 The version included in fedora builds without epoll support and it works. Rebuilding the same source with epoll triggers the issue again. corosync is at git head: c00502a70a8a0b4907f9daf370e3ff1e54c7100a Contrary to what Angus saw in his tests, starting corosync with real time priority does not help.
Further investigation: I tested kernel 3.2.9-x from F16 on rawhide and I can still reproduce the problem. Running the same test in pure F16 does not trigger the issue. At this point i suspect there are either multiple issues in the whole epoll chain or something is going wrong with the toolchain. This kind of debugging is far too advanced for me. Angus do we have a simple test case to reproduce the problem that does not involve running multiple nodes and corosync?
(In reply to comment #6) > Further investigation: > > I tested kernel 3.2.9-x from F16 on rawhide and I can still reproduce the > problem. > > Running the same test in pure F16 does not trigger the issue. > > At this point i suspect there are either multiple issues in the whole epoll > chain or something is going wrong with the toolchain. This kind of debugging is > far too advanced for me. > > Angus do we have a simple test case to reproduce the problem that does not > involve running multiple nodes and corosync? Nope, sorry.
Re comment 6: Fabio, did you used respective fresh version of modules as well? I mean. e.g. virtio_net, if you used VM for your test.
(In reply to comment #8) > Re comment 6: > Fabio, did you used respective fresh version of modules as well? > I mean. e.g. virtio_net, if you used VM for your test. The host is running F16 and VM fedora rawhide, but the problem is not on the network.
I've tried to cover one of many possibilities to reduce the space by trying rawhide kernel 3.3.0-0.rc7.git0.5 [1] built on my F16 box (details below) back on rawhide VMs and cannot reproduce any of the two testcases from 2012-03-09 above (asalkeld, fdinitto). You can test it on your own, please find such built kernel at temporary location [2] (I've rebooted back to old anaconda kernel, uninstalled current kernel via yum and installed all these packages via rpm, then rebooted on top of it [at least as per uname]). If this kernel (meaning complete, incl. modules, see comment 8, I have no idea if, e.g., drivers can affect epoll) is proved to work well, than rawhide toolchain may suffer from a defect. Details: toolchain used to build [2] of F16 box $ rpm -q gcc glibc binutils gcc-4.6.3-2.fc16.x86_64 glibc-2.14.90-24.fc16.6.x86_64 binutils-2.21.53.0.1-6.fc16.x86_64 Details: rawhide packages used for testing $ rpm -q corosync libqb corosynclib corosynclib corosynclib-devel \ corosynclib-devel libqb-devel corosync-1.99.8-1.fc18.x86_64 libqb-0.11.1-1.fc18.x86_64 corosynclib-1.99.8-1.fc18.x86_64 corosynclib-1.99.8-1.fc18.x86_64 corosynclib-devel-1.99.8-1.fc18.x86_64 corosynclib-devel-1.99.8-1.fc18.x86_64 libqb-devel-0.11.1-1.fc18.x86_64 Details: testing rawhide VMs - uname -r -v: 3.0-0.rc7.git0.5.fc16.x86_64 #1 SMP Fri Mar 16 23:29:59 CET 2012 - single CPU, 1024 MB RAM, virtio NIC - corosync: corosync.conf.example with adjusted bindnetaddr (to match the network), mcastaddr and mcastport [1] http://kojipkgs.fedoraproject.org/packages/kernel/3.3.0/0.rc7.git0.5.fc18/src/kernel-3.3.0-0.rc7.git0.5.fc18.src.rpm [2] http://jpokorny.fedorapeople.org/temp/kernel_F16_toolchain/
Created attachment 571214 [details] corosync_glibc_backtrace_VM_quick_pause_resume.txt Aside: when cleaning up the environment on the VMs, I paused of them shortly, realized I wanted to continue, so resumed back in a short time and in terminal for the respective node with running "corosync -f", I got "double free or corruption" (see attached backtrace). Clearly, stopping and resuming VM in a quick sequence is probably quite careless, but one might expect better robustness of whatever side that caused this.
Re comment 11: these are /var/log/messages records from around that time (haven't checked the time of accident, sorry) [TOTEM ] A processor failed, forming new configuration. [QB ] poll_mod : can't modify entry already deleted [TOTEM ] A processor joined or left the membership and a new membership was formed. [MAIN ] Completed service synchronization, ready to provide service. [QB ] poll_mod : can't modify entry already deleted (after 19 seconds) [SERV ] Unloading all Corosync service engines. [QB ] withdrawing server sockets [SERV ] Service engine unloaded: corosync configuration map access [QB ] Could not accept client connection from fd:11: Bad file descriptor (9) [QB ] withdrawing server sockets [SERV ] Service engine unloaded: corosync configuration service [QB ] Could not accept client connection from fd:12: Bad file descriptor (9)
I've tried the same as in comment 10 with original 3.3.0-0.rc7.git0.5.fc18.x86_64 kernel Fedora build [3], this time on both single and two CPUs per VM, and still was not able to reproduce the issue. The newly initiated node always joined successfully and immediately. The only thing I noticed was that while running cpgbench, 100x more UDP datagrams were interchanged per second as per tcpdump (I have no idea if this is expected or these were forced retransmits due to timeouts). Probably cannot help more here. Maybe "double free" as per comment 11 would deserve some check if it is by any chance apparent in the code (qb_ipcs_disconnect). [3] http://koji.fedoraproject.org/koji/buildinfo?buildID=306760
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.