From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.10) Gecko/20070313 Fedora/1.5.0.10-5.fc6 Firefox/1.5.0.10 Description of problem: Working with a customer that requires for a services to restart locally. If the service is restarted locally for X amount of times in Y time frame. For example, 3 times in 1 hour. Then the services is relocated to a differnt node in the cluster. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Service has a problem, possibly due to hardware and is restarted. 2. The problem happens again and is restarted. 3. This creates a loop where the service is started over and over again but never relocated to a different machine. Actual Results: The services just keeps restarting and never relocates to a new server. This could be damanging as it requires manual intervention. Expected Results: You should be able to set times in the services tab that allow you to specify that if the service is restarted X amount of times within Y time frame. To just relocate the service to a seperate machine. Additional info: The X for times and Y for time frame values are a must as some customers might find that 3 restarts in an hour is is a good limit. While some customers might only want 2 restarts in a 24 hours period. Or possibly a week long period.
The restarts themselves are not tracked currently in rgmanager; that is, a restart itself is not recorded long-term; it is handled and never worried about again. In order to implement a 'time-based' limit on X restarts, we would either need to store more information in VF (such as an ancillary data block to record restart histories), store the information locally (other nodes shouldn't care about this information - since they're not running the service), or alter the semantics of how parts of the rg_state_t structure are used: typedef struct { char rs_name[64]; /**< Service name */ uint64_t rs_owner; /**< Member ID running service. */ uint64_t rs_last_owner; /**< Last member to run the service. */ uint32_t rs_state; /**< State of service. */ uint32_t rs_restarts; /**< Number of cluster-induced restarts */ uint64_t rs_transition; /**< Last service transition time */ uint32_t rs_id; /**< Service ID */ uint32_t rs_pad; /**< pad to 64-bit boundary */ } rg_state_t; (and utilize the rs_pad field for something...). Basically, changing the size of the above structure can not be done - it will break rolling upgrade to do so.
With a node-local recording of cluster-induced restarts, it is very easy to throttle restarts based on X in Y time.
Created attachment 162009 [details] Patch. pass 1.
Created attachment 162010 [details] Pass 2; adds support to the resource-agent so that it's picked up via the config Note: does not do time-based throttling; only a hard limit.
The rs_id and rs_pad fields are not used by rgmanager. We could use these fields as a "first-start" time.
(It's not even endian-swapped in reslist.h)
Pushed to RHEL4 git branch
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0791.html