Description of problem: this occured during cluster tests with the resource agent script oracledb that we used unmodified in this cluster to start up an Oracle DB instance. For test purposes we tentatively renamed the instance's pfile (i.e. the init.ora) so that the oracledb RA was doomed to fail bringing the instance back up after we had it shut down manually. We would have expected that after a limited number of failed restart attempts on local node that the rgmanager would fail down the whole service on the local node and try to fail over the whole service on the failover node which at the time was ready and enabled to take up the service's resources. However, this didn't happen. Instead the cluster service was hung in an infinite loop where the above RA would retry to bring the DB instance up locally ad infinitum. This was to my understanding a violation of the atomic nature of a resource group/service. If you cannot make the whole lot available on the current node go an try to fail it over to another ready node in the failover domain. I then tried, without much success, to add an attribute such as "max_restarts" to the resource tag of oracledb, similar to the one you may add to a service tag, which of course was unknown to the XML parser in this oracledb resource context, why the "ccs_tool update" command would fail. I mean these kind of service tag attributes concerning restart attempts: [root@aruba:~] # grep service.*lola /etc/cluster/cluster.conf <service name="lola" autostart="0" domain="baros-fod" exclusive="0" max_restarts="1" recovery="restart" restart_expire_time="0"> Here is the excerpt of the oracledb tag from our cluster.conf: <oracledb home="/app/oracle/product/11.2.0" name="LOLA" type="base" user="oracle" listener_name="L_LOLA"> <script name="oracle_em" file="/etc/cluster/itdz/script_oracle_em.sh"/> </oracledb> Here's the resources hierarchy of the affected service during start up: [root@aruba:~] # /usr/sbin/rg_test noop /etc/cluster/cluster.conf start service lola Running in test mode. Starting lola... [start] service:lola [start] lvm:VG lola dbf [start] fs:FS lola data01 [start] fs:FS lola data02 [start] fs:FS lola data03 [start] fs:FS lola data04 [start] lvm:VG lola log [start] fs:FS lola data05 [start] fs:FS lola data06 [start] fs:FS lola data07 [start] fs:FS lola reorg [start] ip:10.25.128.120 [start] oracledb:LOLA [start] script:oracle_em Start of lola complete rgmanager-2.0.52-28.el5 How reproducible: always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: The testing patch was already provided by jrummy and successfully tested by customer but we need to have this in supported fashion. In oracledb.sh we're looking for the ORA-XXXX error: grep -q "^ORA-" $logfile And this is the output we get: SQL> ORA-01078: failure in processing system parameters "ORA-" is not at the start of the line (^), so this doesn't match. We need to account for the possibility of "SQL> " at the start of the line. Better regex: grep -qE "^(SQL>)?\s*ORA-" Private branch: ################ private-jruemker-case711279 Scratch Build: ################ http://brewweb.devel.redhat.com/brew/taskinfo?taskID=4993514 Patch: ####### diff -up rgmanager-2.0.52/src/resources/oracledb.sh.case711279 rgmanager-2.0.52/src/resources/oracledb.sh --- rgmanager-2.0.52/src/resources/oracledb.sh.case711279 2012-10-18 10:42:12.078366058 -0400 +++ rgmanager-2.0.52/src/resources/oracledb.sh 2012-10-18 10:45:34.310620633 -0400 @@ -299,7 +299,7 @@ start_db() # rm -f $tmpfile - grep -q "^ORA-" $logfile + grep -qE "^(SQL>)?\s*ORA-" $logfile if [ $? -eq 0 ]; then rm -f $tmpfile echo "ORACLE_SID Incorrectly set?" @@ -348,7 +348,7 @@ stop_db() # If we see 'failure' in the log, we're done. # rm -f $tmpfile - grep -q "^ORA-" $logfile + grep -qE "^(SQL>)?\s*ORA-" $logfile if [ $? -eq 0 ]; then echo_failure echo
This should be fixed as a side effect of fixing Bug 670024 Please reopen if you still have problems with rgmanager-2.0.52-41.el5 or later. *** This bug has been marked as a duplicate of bug 670024 ***