Bug 582313
Summary: | stuck on sem_timedwait | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Teigland <teigland> | ||||||
Component: | corosync | Assignee: | Jan Friesse <jfriesse> | ||||||
Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | rawhide | CC: | agk, fdinitto, sdake | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 582326 (view as bug list) | Environment: | |||||||
Last Closed: | 2010-04-26 16:18:02 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 582326 | ||||||||
Attachments: |
|
Description
David Teigland
2010-04-14 15:48:08 UTC
My cluster is busy with something else at the moment, but to reproduce this I'd first try: service cman start on node1 and node2 cman_tool kill -n node1 from node2 check if fenced is stuck on node1 If that doesn't do it, I'd try service cman start on node1, node2, node3 create network partition: node1 | node2, node3 remove network partition node2 or node3 should kill node1 check if fenced is stuck on node1 Verified that cman_tool kill will reproduce the problem (only difference is that I have four nodes in my cluster). I repeated the test twice, the problem reproduced on the second try. Also note that I'm using the latest cpg patch adding the totem/ringid callbacks. Jan, good point, the fenced I'm using is updated to use the new cpg_model_initialize api. I'll send you a patch with the fenced changes. Created attachment 407079 [details]
Proposed patch
Patch which handles POLLNVAL. Also return value of poll is now better handled.
Thanks, I'll try the patch. Sorry I didn't get you the fenced patch I'm using, I was too busy debugging it and forgot. Honza, using the patch, I've tried both tests above a couple times and have not see fenced get stuck. I'll try a few more times next week and let you know. Created attachment 407208 [details]
fenced patch using new cpg api
Here's the fenced version that I was seeing troubles with in case you'd like to try it.
Dave, I was trying reproduce the bug (without patch I sent and WITH fenced patch you sent) - unsuccessfully. Are you using Fedora rawhide? If so, it looks for me like incompatibility in way how poll works and what returns in new kernel?/glibc?/???. Anyway that coroipcc part was not very well written, so patch should be included in corosync. Tried this several more times using the patch and haven't seen the hang, so I suggest we call it a fix. (Using F12 with recent devel kernel.) Patch is now included in upstream as svn revision 2789, so I'm closing this bug. |