490449 – domU's restart after cluster.conf update

Bug 490449 - domU's restart after cluster.conf update

Summary: domU's restart after cluster.conf update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	491654
TreeView+	depends on / blocked

Reported:	2009-03-16 14:27 UTC by Shane Bradley
Modified:	2018-10-20 01:59 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 11:03:37 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch ported from stable3 which disables status checks during reconfiguration (4.10 KB, patch) 2009-03-17 22:10 UTC, Lon Hohberger	no flags	Details \| Diff
debug log from ncldl38076 (54.10 KB, text/plain) 2009-03-18 16:38 UTC, Samuel Kielek	no flags	Details
Patch encompassing 3 additional patches: (6.34 KB, patch) 2009-03-18 21:25 UTC, Lon Hohberger	no flags	Details \| Diff
debug log from ncldl38077 (26.19 KB, text/plain) 2009-03-18 22:26 UTC, Samuel Kielek	no flags	Details
debug log from ncldl38077 (33.60 KB, text/plain) 2009-03-19 00:05 UTC, Samuel Kielek	no flags	Details
Fake virtual machine agent which can be used to reproduce/test issue (10.76 KB, text/plain) 2009-03-20 17:27 UTC, Lon Hohberger	no flags	Details
Simple utility to bump configuration version and print out the new version (1.13 KB, text/plain) 2009-03-20 17:28 UTC, Lon Hohberger	no flags	Details
Script which can be used to reproduce above behavior; edit as necessary (182 bytes, text/plain) 2009-03-20 17:29 UTC, Lon Hohberger	no flags	Details
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1339	0	normal	SHIPPED_LIVE	Low: rgmanager security, bug fix, and enhancement update	2009-09-01 10:42:29 UTC

Description Shane Bradley 2009-03-16 14:27:52 UTC

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.7) Gecko/2009030503 Fedora/3.0.7-1.fc10 Firefox/3.0.7

It's a 3 node cluster used exclusively for dom0 clustering. All nodes
are running RHEL 5.3 AP. The server hardware are HP DL-580, 4 socket
quad core (16 total cores), 64 GB of RAM, 2 4Gb/s Emulex HBA's. All
the VM's use block device pass through (no file backed guests).

If you look at the logs from one of the nodes (NCLDL58016) you will
see that the reconfig occurred on March 4th at 23:09:26

Mar 4 23:09:26 ncldl58016 ccsd[19819]: Update of cluster.conf complete
(version 1 -> 2).

And then 33 seconds later the vif's and block devs all start coming
down.

And finally at 23:11:09 the VM services start recovering.

In our test cluster, which very closely mirrors the production cluster
except for the number of guests, we have found that we can reproduce
this issue by simply incrementing the config_version in cluster.conf
and then updating the cluster with that config. So *any* cluster
reconfig will cause the VMs to get restarted."



Reproducible: Always

Steps to Reproduce:
1.Update cluster.conf
2.Save changes and send to notes
3.Then vm will restart
Actual Results:  
The vm service will restart

Expected Results:  
The cluster.conf should be saved and sent to all nodes and vms that are running should not restart.

Comment 4 Lon Hohberger 2009-03-17 21:41:00 UTC

Ok...

Here's something worth noting:

Mar  4 23:09:41 ncldl58016 clurgmgrd[21222]: <info> Stopping changed resources.
[VMs stop]
Mar  4 23:12:15 ncldl58016 clurgmgrd[21222]: <info> Restarting changed resources.
Mar  4 23:12:15 ncldl58016 clurgmgrd[21222]: <info> Starting changed resources.
Mar  4 23:12:15 ncldl58016 clurgmgrd[21222]: <notice> Initializing vm:lnxp0006
Mar  4 23:12:15 ncldl58016 clurgmgrd[21222]: <notice> vm:lnxp0006 was added to the config, but I am not initializing it.

That's 2.5 minutes of 'idling'.  If there's nothing to do in the stop-phase after a configuration update, rgmanager immediately moves to 'restarting' followed by 'starting'.

rgmanager *must* have gotten confused about the configuration update -- the question is how and why, and why does rg_test work?

Comment 5 Lon Hohberger 2009-03-17 22:09:44 UTC

There's a possibility that this particular issue was caused by a status-check vs. reconfiguration ordering problem, which I have fixed in the master and stable3 branches already.

Comment 6 Lon Hohberger 2009-03-17 22:10:42 UTC

Created attachment 335614 [details]
Patch ported from stable3 which disables status checks during reconfiguration

Comment 8 Lon Hohberger 2009-03-18 13:33:51 UTC

http://git.fedorahosted.org/git/?p=rgmanager.git;a=commit;h=6f0d24469715359fd592e9bbd674cf6f3fb79eb9

^^ Explanation of previously-attached patch.

There is another patch which is in stable3 (and I think this customer already has it) which also helped somewhat:

http://git.fedorahosted.org/git/?p=rgmanager.git;a=commit;h=15e48b31f2d7e3108f9b899a501a3015821e1e2c

Comment 9 Lon Hohberger 2009-03-18 13:36:17 UTC

Previous link was for master branches.  STABLE3 links follow:

Patch 1:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=6f0d24469715359fd592e9bbd674cf6f3fb79eb9

Patch 2:

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=b4939e9e1ded08594c3b86237b24318f3d23414f

Comment 10 Samuel Kielek 2009-03-18 16:38:04 UTC

Created attachment 335728 [details]
debug log from ncldl38076

Comment 13 Lon Hohberger 2009-03-18 21:25:36 UTC

Created attachment 335766 [details]
Patch encompassing 3 additional patches:

* Fix rare segfault in -USR1 dump code
* Ensure we don't reconfig on the same new config version twice
* Ensure we always call init_resource_groups() with the right # of parameters
* Block signals in worker threads so SIGHUP/SIGINT/etc. do not abort status checks erroneously.

Comment 14 Lon Hohberger 2009-03-18 21:49:25 UTC

Buildable .src.rpm:

http://people.redhat.com/lhh/rgmanager-2.0.46-4lhh.src.rpm

Comment 15 Samuel Kielek 2009-03-18 22:26:49 UTC

Created attachment 335774 [details]
debug log from ncldl38077

Comment 16 Lon Hohberger 2009-03-18 23:31:31 UTC

Updated SRPM

http://people.redhat.com/lhh/rgmanager-2.0.46-5lhh.src.rpm

This one includes a scratch debugging patch to see if rgmanager actually believed it should restart/stop a VM after a reconfiguration.

Comment 17 Samuel Kielek 2009-03-19 00:05:54 UTC

Created attachment 335786 [details]
debug log from ncldl38077

Comment 18 Lon Hohberger 2009-03-19 01:46:27 UTC

Ok, so the current status here is as follows:

With all the above (working) patches, rgmanager still occasionally thinks it needs to stop a VM (or VMs).  It literally thinks that the VM has changed configuration after the configuration update, and for some reason flags it as needing a restart (or worse: just a stop).

Comment 20 Lon Hohberger 2009-03-19 12:14:00 UTC

I added a patch here which has debugging log messages.

Additionally, the patch zaps all flags during the tree delta in case any were set that we weren't aware of.  This -should- prevent the last instance of the VMs restarting, but why the 'RF_NEEDSTOP' flag was set in the first place is still not known.

http://people.redhat.com/lhh/rgmanager-2.0.46-8lhh.src.rpm

Comment 21 Lon Hohberger 2009-03-19 14:31:54 UTC

I'm pretty sure the NEEDSTOP flag is getting set because of this:

* node A starts vm:foo.  Before starting vm:foo, it asks the rest of the cluster if they have seen vm:foo

* node B receives a status inquiry request from node A.  It then executes a status check on that VM to see if it is running.  It's not, so status returns 1.  At this point, node B sets a NEEDSTOP flag.

* Suppose you disable the VM on node A and start it on node B now.  At this point, the NEEDSTOP flag is still persisted on node B, but is ignored by the start/status checks.

* If you then do a configuration update, the NEEDSTOP flag is -still- there.  After a configuration update (or during a special "recover" operation", the NEEDSTOP flag is used by rgmanager to decide what resources need to be stopped or not.  Presence of this flag does NOT alter service state.

* Rgmanager does its reconfiguration, sees the NEESTOP flag, and stops the virtual machine.  Because the state has not actually changed according to rgmanager (NEEDSTOP is succeeded by NEEDSTART if a resource's parameters have changed, for example), the next status check causes a recovery of the VM and then the VM is restarted.

The previous patch masks this issue, but in the wrong way - it clears the NEEDSTOP flag during reconfiguration.  The NEEDSTOP flag should either be cleared on resource start or not set during the special status inquiry operation.

Comment 22 Lon Hohberger 2009-03-19 15:16:12 UTC

http://people.redhat.com/lhh/rgmanager-2.0.46-9lhh.src.rpm

Comment 23 Lon Hohberger 2009-03-19 15:19:36 UTC

http://people.redhat.com/lhh/rgmanager-2.0.46-9minimal.src.rpm

Same package which *only* fixes the problem described in comment #21

Comment 24 Samuel Kielek 2009-03-19 15:45:55 UTC

confirming that the minimal patch did resolve the issue on our test cluster

Comment 25 Lon Hohberger 2009-03-19 15:50:14 UTC

This is a general problem which can occur on any cluster following the steps in comment #21.

Comment 26 Lon Hohberger 2009-03-19 16:01:55 UTC

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=47b2d48660c7c70d41c479b45e33be94ed8b3e9a

Pushed.

Comment 29 Lon Hohberger 2009-03-20 17:27:03 UTC

Created attachment 336081 [details]
Fake virtual machine agent which can be used to reproduce/test issue

This agent fakes out rgmanager to make it think it's controlling a virtual machine.  This is not for use in production environments.  The only use for this attachment is testing the errant behavior present within the internals of rgmanager.

You can install this test utility in the following way:

  chmod -x /usr/share/cluster/vm.sh
  cp vm-test.sh /usr/share/cluster
  chmod +x /usr/share/cluster/vm-test.sh

You must distribute ssh keys between all hosts in the cluster in order for 'migrate' to work.

How To Test:

* create a virtual machine on at least a 2 node cluster (or use the fake script provided)
* Start rgmanager on all nodes.  It will get started on one node.
* Disable the virtual machine using 'clusvcadm -d'.
* Enable the virtual machine explicitly on another node using 'clusvcadm -e'
* Change the configuration file by incrementing the config_version.
* Distribute the configuration file using ccs_tool update
* Ensure configuration version consistency in cman by running 'cman_tool version -r <new_config_version>'

Old (broken) behavior:

* Virtual machine which was disabled and enabled on another node will restart.

Corrected behavior:

* Virtual machine which was disabled and enabled on another node will not restart.

Comment 30 Lon Hohberger 2009-03-20 17:28:30 UTC

Created attachment 336082 [details]
Simple utility to bump configuration version and print out the new version

Comment 31 Lon Hohberger 2009-03-20 17:29:10 UTC

Created attachment 336083 [details]
Script which can be used to reproduce above behavior; edit as necessary

Comment 32 Lon Hohberger 2009-03-20 17:30:12 UTC

To compile attachment noted in comment #30, perform the following:

   gcc -o config-bump-xml config-bump-xml.c -I/usr/include/libxml2 -lxml2 -ggdb

You will need the libxml2-devel package installed.

Comment 34 Marc Grimme 2009-04-08 07:49:11 UTC

I can also confirm that the minimal rpm seems to fix this problem.

Will this Bugfix get an hotfix or go into Z-Stream or is this focused for 5.4?

Cause it's not yet flagged in any such way.

Marc.

Comment 35 Lon Hohberger 2009-04-09 21:29:44 UTC

There's a z-stream bugzilla here:

https://bugzilla.redhat.com/show_bug.cgi?id=491654

Comment 38 Chris Ward 2009-07-03 18:27:17 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 40 errata-xmlrpc 2009-09-02 11:03:37 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html

Note You need to log in before you can comment on or make changes to this bug.