Bug 360541
Summary: | On a two nodes cluster, when a member reboot, services are not switching on the active node | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Eric MAGALLON <eric.magallon-external> | ||||
Component: | clumanager | Assignee: | Lon Hohberger <lhh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 3 | CC: | bmr, cluster-maint, rkenna | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHBA-2008-0091 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-02-07 18:31:42 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Eric MAGALLON
2007-10-31 16:48:07 UTC
Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> Evaluating service foo6, state started, owner member1 Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> Service foo6 request 3 Oct 31 14:59:44 magenta clusvcmgrd[28362]: <debug> [C] Pid 28362 handling start request for service foo6 Oct 31 14:59:44 magenta clusvcmgrd[28362]: <debug> Starting service 6 - flags 0x00000000 Oct 31 14:59:44 magenta clusvcmgrd[28362]: <debug> Handling start request for service foo6 Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> [M] Pid 28362 -> start for service foo6 Oct 31 14:59:44 magenta clusvcmgrd[28362]: <debug> Service is running on member member1. Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> Evaluating service foo7, state started, owner member1 Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> Service foo7 request 3 Oct 31 14:59:44 magenta clusvcmgrd[28363]: <debug> [C] Pid 28363 handling start request for service foo7 Oct 31 14:59:44 magenta clusvcmgrd[2150]: <debug> [M] Pid 28363 -> start for service foo7 Oct 31 14:59:44 magenta clusvcmgrd[2150]: <info> State change: member0 DOWN Oct 31 14:59:44 magenta clusvcmgrd[28363]: <debug> Starting service 7 - flags 0x00000000 Oct 31 14:59:44 magenta clusvcmgrd[28363]: <debug> Handling start request for service foo7 Oct 31 14:59:44 magenta clusvcmgrd[28363]: <debug> Service is running on member member1. So, it evaluated the service incorrectly - it looks like the previously-recorded panic mask is used, which needs to be cleared in this case after getting a real node-down event from the quorum daemon. Oct 31 15:07:32 magenta clumembd[2097]: <info> Membership View #33:0x00000001 Oct 31 15:07:33 magenta cluquorumd[2079]: <warning> Membership reports #1 as down, but disk reports as up: State uncertain! Oct 31 15:07:33 magenta cluquorumd[2079]: <warning> --> Commencing STONITH <-- Oct 31 15:07:34 magenta clusvcmgrd[2125]: <info> Quorum Event: View #74 0x00000001 Oct 31 15:07:34 magenta clusvcmgrd[2125]: <warning> Member member1's state is uncertain: Some services may be unavailable! Oct 31 15:07:39 magenta kernel: e100: eth1: e100_watchdog: link down Oct 31 15:07:54 magenta cluquorumd[2079]: <warning> --> Commencing STONITH <-- Oct 31 15:07:56 magenta cluquorumd[2079]: Power to NPS outlet(s) 6 turned /Off. Oct 31 15:07:56 magenta cluquorumd[2079]: <notice> STONITH: member1 has been fenced! Oct 31 15:07:57 magenta cluquorumd[2079]: Power to NPS outlet(s) 6 turned /On. Oct 31 15:07:57 magenta cluquorumd[2079]: <notice> STONITH: member1 is no longer fenced off. Oct 31 15:07:58 magenta clusvcmgrd[2125]: <info> Quorum Event: View #76 0x00000001 Oct 31 15:07:58 magenta clusvcmgrd[3362]: <notice> Taking over service foo3 from down member member1 Oct 31 15:07:58 magenta clusvcmgrd[3374]: <notice> Taking over service foo4 from down member member1 Oct 31 15:07:58 magenta clusvcmgrd: [3363]: <notice> service notice: Starting service foo3 ... Oct 31 15:07:58 magenta clusvcmgrd[3415]: <notice> Taking over service foo5 from down member member1 Oct 31 15:07:58 magenta clusvcmgrd[3440]: <notice> Taking over service foo6 from down member member1 Oct 31 15:07:58 magenta clusvcmgrd: [3404]: <notice> service notice: Starting service foo4 ... Oct 31 15:07:58 magenta clusvcmgrd[2125]: <info> State change: member0 DOWN Oct 31 15:07:58 magenta clusvcmgrd[3473]: <notice> Taking over service foo7 from down member member1 Oct 31 15:07:59 magenta clusvcmgrd: [3425]: <notice> service notice: Starting service foo5 ... Oct 31 15:07:59 magenta clusvcmgrd: [3470]: <notice> service notice: Starting service foo6 ... Oct 31 15:07:59 magenta clusvcmgrd: [3479]: <notice> service notice: Starting service foo7 ... Oct 31 15:07:59 magenta clusvcmgrd: [3363]: <notice> service notice: Started service foo3 ... Oct 31 15:07:59 magenta clusvcmgrd: [3404]: <notice> service notice: Started service foo4 ... Oct 31 15:07:59 magenta clusvcmgrd: [3425]: <notice> service notice: Started service foo5 ... Oct 31 15:07:59 magenta clusvcmgrd: [3479]: <notice> service notice: Started service foo7 ... Oct 31 15:08:00 magenta clusvcmgrd: [3470]: <notice> service notice: Started service foo6 ... Logs with patch Note that my reproduction case was not tied to any normal reboot case: I had to first kill the node's ethernet link then either kill -9 all of the cluster daemons or do 'reboot -fn'. Created attachment 244881 [details]
Fix.
Unofficial test packages here (fixes 2 other bugs too): http://people.redhat.com/lhh/packages.html http://people.redhat.com/lhh/patches.html Thanks Lon for your quickly fix :) I will install and test it next monday on our test platform. I will make you a return on test result. Debuginfo packages are there, too, but not on the page: http://people.redhat.com/lhh/clumanager-debuginfo-1.2.36-1.i386.rpm http://people.redhat.com/lhh/clumanager-debuginfo-1.2.36-1.x86_64.rpm Lon, The link to package i686 is broken : http://people.redhat.com/lhh/clumanager-1.2.36-1.i686.rpm Can you provide it to me ? Thanks. Lon, Finally, I have downloaded the package, his true name is clumanager-1.2.36-1.i386.rpm. After installation, the tests are successful for reboot and shutdown. Can you inform us when an official delivery will be available from Redhat ? Regards. Sorry about that; I fixed the link. Because Red Hat Enterprise Linux 3 and associated layered products (including Red Hat Cluster Suite 3) are in maintenance mode, you need to make a request through Red Hat Support if you need official packages: http://www.redhat.com/apps/support/ This will be going out as an erratum shortly. The package will be 1.2.37-1.1; watch for it on RHN. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0091.html |