| Summary: | RHCS cannot handle node restart, when cluster timeouts are high | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Michal Markowski <markowski> | ||||||
| Component: | openais | Assignee: | Jan Friesse <jfriesse> | ||||||
| Status: | CLOSED WONTFIX | QA Contact: | Cluster QE <mspqa-list> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 5.6 | CC: | cluster-maint, edamato, ekuric, grimme, hlawatschek, jwest, markowski, sdake | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-10-05 14:39:55 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Michal Markowski
2011-03-15 08:14:18 UTC
Clarification: "restart a node" means hard reset of the machine, such as kernel panic or short power outage. We encountered the problem on a real-life, productive system. We reproduced the same behavior on a small cluster and use hard reset to simulate a typical failure case. Timeout settings are based on SAN timeouts and Red Hat recommendations, such as: https://access.redhat.com/kb/docs/DOC-37204 https://access.redhat.com/kb/docs/DOC-35071 The full process of reproducing the issue takes ca. 450 seconds. Created attachment 484459 [details]
Log from the "stable" node
Created attachment 484461 [details]
Log from the node that has been reset
Any luck in reproducing / analyzing the problem? If you need more info, ask right ahead - the issue is prio-1 for me. Hello Lon, The bug is flagged as [NEEDINFO]. Is there any info I can provide? [I don't see any answers on this bug. I see my description and my 4 comments. Is this right?] Best regards, Michal Markowski ATIX AG Is this related to 645299 ? Not related to 545299, more likely Bug #533369. Honza, I noticed you did all the work on this bug - can you take a look at what it would take to backport. Thanks (In reply to comment #9) > Not related to 545299, more likely Bug #533369. > > Honza, I noticed you did all the work on this bug - can you take a look at what > it would take to backport. > > Thanks Steve, Bug #533369 is kernel "irq 9: nobody cared" after suspend to ram, so this is typo. Can you please send me correct number? Because after brief lookout to bug, it doesn't seems familiar for me. correct Bug ID is Bug 553369 Steve, backport doesn't seems to be totally impossible, but it means change/add simply too much code. I would rather not take that risk especially in such late product lifecycle (5.8) |