Bug 728041

Summary: ccs --start/--stop should not change chkconfig services
Product: Red Hat Enterprise Linux 6 Reporter: Etsuji Nakai <enakai>
Component: ricciAssignee: Chris Feist <cfeist>
Status: CLOSED NOTABUG QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 6.1CC: cluster-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-08-08 21:56:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Etsuji Nakai 2011-08-04 01:32:04 UTC
Description of problem:
When starting and stopping the cluster using ccs --start/--stop, chkconfig services are changed on/off accordingly. But it'd be better not to change the chkconfig status. It should just start/stop the services.

The problem is regarding the following scenario.

In the customer's cluster:

- They start/stop the cluster with starting/stopping the services directly.(Not using ccs at the moment.)
- They set chkconfig off for the cluster services (cman, rgmanger etc.)
- They force-reboot the failed node with the fence device.

In this setting, when a node is force-rebooted with some problem such as kernel panic, for example, the node doesn't automatically join the cluster. Then the customer logs-in to the node and investigates the problem. When they are sure that the problem is resolved, they start the cluster services on this node again.

Now, the problem is that this customer cannot adopt the new ccs tool for the cluster operation. Under the ccs operation, when the failed node is force-rebooted, it automatically tries to join the cluster as chkcfonig is on although the potential problem is not yet investigated and resolved by the customer. It's not desirable to get the failed node to join the cluster automatically.

Comment 3 Chris Feist 2011-08-08 21:56:58 UTC
Unfortunately, this works as designed (and I believe this is exactly how luci does it as well).  When you stop a node, you're permanently stopping it until you want to manually bring it back into service.

When you manually stop a node, you need to manually start it back up again.

There may be room for some documentation improvement.  I'll take a look at it and see if I need to file a bug to specify that the --start & --stop enable/disable cluster nodes.

Comment 4 Etsuji Nakai 2011-08-08 22:14:08 UTC
No, the problem is not that I need to manually start the node.

The problem is that once I started the node with --start, it automatically starts again when it is force-rebooted by the fence device. Don't you think it is a bad design that a failed (force-rebooted) node automatically joins the cluster again?

Comment 5 Chris Feist 2011-08-08 22:58:57 UTC
I think the issue is that --start/--stop should really be called --enable&start --disable&stop.  ccs just calls ricci/clustermon and ricci/clustermon executes the command.  Currently there is no command in ricci/clustermon for only starting or stopping without turning the service on/off.

This will definitely work differently on RHEL7, but right now this behavior is consistent with how things worked in RHEL4/5.

Comment 6 Etsuji Nakai 2011-08-08 23:50:55 UTC
Yes, it makes sense. Logically, there could be the following combinations.

--enbale&start / --disable&stop => service start/stop & chkconfig on/off
--enable / --disable => chkconfig on/off
--start / --stop => service start/stop

I hope RHEL7 will allow customers to choose their preferred operation. Thanks.