800865 – [master/regression] Unable to process membership/network packets on high IPC load

Bug 800865 - [master/regression] Unable to process membership/network packets on high IPC load

Summary: [master/regression] Unable to process membership/network packets on high IPC ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libqb
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Assignee:	Angus Salkeld
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	801987
TreeView+	depends on / blocked

Reported:	2012-03-07 12:20 UTC by Fabio Massimo Di Nitto
Modified:	2012-04-12 02:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:	libqb-0.11.1-1.fc17
Clone Of:
Clones:	801987 (view as bug list)
Environment:
Last Closed:	2012-04-12 02:40:46 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Fabio Massimo Di Nitto 2012-03-07 12:20:20 UTC

Description of problem:

this is a regression from flatiron.

i can´t say for sure where the problem is located but it appears that when there is a high IPC load, network packets are not processed till the load ends.

Version-Release number of selected component (if applicable):

master

How reproducible:

2 nodes are enough to reproduce this problem.

config:

quorum {
    provider: corosync_votequorum
    expected_votes: 2
    votes: 1
    two_node: 0
    wait_for_all: 0
    last_man_standing: 0
    auto_tie_breaker: 0
}

Steps to Reproduce:
1. start corosync -f on both nodes
2. check corosync-quorumtool -s

Membership information
----------------------
    Nodeid      Votes Name
3238176960          1 fedora-master-node1.int.fabbione.net
3254954176          1 fedora-master-node2.int.fabbione.net

3. stop one node

Membership information
----------------------
    Nodeid      Votes Name
3254954176          1 fedora-master-node2.int.fabbione.net

4. start cpgbench or cpgverify on the remaining node

5. start corosync -f on the other node
  
Actual results:

the node running test/cpg* will never see the other node till cpgverify
will stop.

Expected results:

the other node should be able to join at any given time.

Additional info:

Comment 1 Fabio Massimo Di Nitto 2012-03-07 14:47:19 UTC

this appears to be a problem in libqb.

With libqb 0.11.0 i can see that sometime the nodes rejoin, other times they just can´t see each other.

I believe that this could be poll or timer related since the problem doesn´t show in RHEL6 with the same builds.

Comment 2 Fabio Massimo Di Nitto 2012-03-08 05:08:42 UTC

After N times testing with latest and greatest, I noticed that:

The nodes rejoins about 60% of the time after 5/10 seconds.

The remaining 40% they don't rejoin at all till IPC load is off.

It smells as if the timeout in polling the fd is not working properly.

This is still rawhide/latest and greatest.

Comment 3 Angus Salkeld 2012-03-08 07:41:51 UTC

Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that.

I am getting a lot of:
Mar 08 14:18:51 warning [QB    ] conn:0x1 Nothing in q but got POLLIN on fd:17 (res2:1)

Which is not good - I'll look into that.

1) Do you have your usual 2 cpu'ed vm's?
2) Do you get the log above?

Comment 4 Fabio Massimo Di Nitto 2012-03-08 08:48:33 UTC

(In reply to comment #3)
> Fabio I have 2 nodes f17 (latest) one cpu'ed VMs and can't reproduce on that.

rawhide != f17.

> 
> I am getting a lot of:
> Mar 08 14:18:51 warning [QB    ] conn:0x1 Nothing in q but got POLLIN on fd:17
> (res2:1

> 
> Which is not good - I'll look into that.
> 
> 1) Do you have your usual 2 cpu'ed vm's?

Yes 2 CPU's per VM with 4GB of ram each.

> 2) Do you get the log above?

Yes several of those too.

Comment 5 Angus Salkeld 2012-03-09 11:26:51 UTC

I don't think this is a libqb bug (or corosync):

try the following:
2 nodes (f1 and f2)

on f1 - in different shells:
corosync -f
tcpdump -v host f2
cpgbench

on f2 after the above has started:
corosync -f

now as soon as cpgbench exits (or you kill it) tcpdump will burst into
life showing incoming packets.

It seems to me that the messages are not getting delivered during
high cpu usage.

Note: I have found some unrelated libqb bugs from this exercise
so I'll update libqb soon with those fixes.

Comment 6 Fabio Massimo Di Nitto 2012-03-09 14:02:10 UTC

What I can observe with tcpdump is slightly different, but it appears to be something happening within the kernel with the load.

i start tcpdump on node1 and node2

start corosync -f on both nodes (so they join the mcast group)

stop corosync on one node

quickly start cpgbench on the remaining node

start corosync -f on the other node

I can see mcast packets entering the interface on the node that is running cpgbench but those packets are not delivered to corosync.

At this point i also suspect an issue with the kernel.

Comment 7 Angus Salkeld 2012-03-09 23:32:56 UTC

corosync -r seems to cause corosync to get these messages

-r is setting real time priority.

(not saying that this is the correct solution, just FYI).

Comment 8 Angus Salkeld 2012-03-09 23:54:42 UTC

I configured libqb to use poll (and not epoll) all works perfectly.

./configure ac_cv_func_epoll_create1=no ac_cv_func_epoll_create=no

really snappy responses to join/leaves.

I'll investigate the epoll usage in libqb.

Comment 9 Steven Dake 2012-03-10 02:58:54 UTC

Fabio can you confirm Comment #1, that libqb/corosync on f16 or RHEL6 don't exhibit this problem?  Since this issue has been identified with the epoll system call, this may be a kernel epoll problem in f17.  I would like to try f17 on bare metal to see if virtualization related, but beaker doesn't support f17.  Anyone have some bare metal to try?

Comment 10 Fabio Massimo Di Nitto 2012-03-10 03:33:00 UTC

(In reply to comment #9)
> Fabio can you confirm Comment #1, that libqb/corosync on f16 or RHEL6 don't
> exhibit this problem?  Since this issue has been identified with the epoll
> system call, this may be a kernel epoll problem in f17.  I would like to try
> f17 on bare metal to see if virtualization related, but beaker doesn't support
> f17.  Anyone have some bare metal to try?

Yes I tested those two and they work.

Comment 11 Fabio Massimo Di Nitto 2012-03-10 03:34:19 UTC

I don't have F17 bare metal to test.

Comment 12 Steven Dake 2012-03-10 03:37:01 UTC

Angus,

Can you use this bz to just use poll in libqb for rawhide for the moment until Ryan root causes Bug #801987?

If epoll is broke in rawhide kernel, probably others will point it out as well :)

Comment 13 Fabio Massimo Di Nitto 2012-03-10 03:37:41 UTC

(In reply to comment #7)
> corosync -r seems to cause corosync to get these messages
> 
> -r is setting real time priority.
> 
> (not saying that this is the correct solution, just FYI).

I think we all agree that it is a kernel scheduling issue at this point.

I have already spoken to Linda and I will have a kernel networking guy passing by my office next week to look at it. If nothing we should be able to provide proper info for a kernel bugzilla.

Comment 14 Fedora Update System 2012-03-11 12:05:39 UTC

libqb-0.11.1-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/libqb-0.11.1-1.fc17

Comment 15 Fedora Update System 2012-03-11 18:25:59 UTC

Package libqb-0.11.1-1.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libqb-0.11.1-1.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-3562/libqb-0.11.1-1.fc17
then log in and leave karma (feedback).

Comment 16 Jesper Brouer 2012-03-12 08:41:19 UTC

(In reply to comment #13)

> I think we all agree that it is a kernel scheduling issue at this point.
> 
> I have already spoken to Linda and I will have a kernel networking guy passing
> by my office next week to look at it. If nothing we should be able to provide
> proper info for a kernel bugzilla.

Looking into the case, but I don't have the machines/setup reproduce the bug.
So, I'll start by reviewing the recent epoll kernel code changes.

What kernel version is "rawhide" using?
(I will appreciate a link to the git tree)

Comment 17 Fabio Massimo Di Nitto 2012-03-12 08:46:32 UTC

(In reply to comment #16)
> (In reply to comment #13)
> 
> > I think we all agree that it is a kernel scheduling issue at this point.
> > 
> > I have already spoken to Linda and I will have a kernel networking guy passing
> > by my office next week to look at it. If nothing we should be able to provide
> > proper info for a kernel bugzilla.
> 
> Looking into the case, but I don't have the machines/setup reproduce the bug.

That´s why it´s easier to come here sometime this week ;)

> So, I'll start by reviewing the recent epoll kernel code changes.
> 
> What kernel version is "rawhide" using?
> (I will appreciate a link to the git tree)

Linux fedora-master-node2.int.fabbione.net 3.3.0-0.rc6.git2.2.fc18.x86_64 #1 SMP Wed Mar 7 06:26:50 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

fedpkg co kernel

master branch of course

Comment 20 Fedora Update System 2012-04-12 02:40:46 UTC

libqb-0.11.1-1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.