Red Hat Bugzilla – Bug 1276052
[GSS](6.4.z) account for additional DB2 FATAL connection errors
Last modified: 2017-05-19 04:05:16 EDT
Description of problem:
Various version of pre 11.x DB2 drivers utilize the -99999 error code for a SQLException. Not all -99999 errors are fatal. For those variations that are known to be fatal, a check should be added to treat as such.
One example would be the -99999 error that indicates "Connection is closed"
This could be accomplished similar to the checks in the OracleExceptionSorter:
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Hi, db2 issues an error code -99999 when
the IBM Driver for JDBC has issued an error that does not yet have an error code. It is some sort of wildcard for db2 jdbc driver.
I'm not really sure if you have a list about what errors you want to consider "fatal" or not, but it would be helpful to provide a list and not just an example. You are proposing some sort of heuristic to determine if the type of error is fatal or not.
Due to issues in getting a definitive list of FATAL error messages associated with the -99999 error code, research and development has suggested a new exception-sorter be created: DB2With99999ExceptionSorter
DB2With99999ExceptionSorter will include a check treating any -99999 error as FATAL.
to be clear...DB2With99999ExceptionSorter will also include the same checks as the existing DB2ExceptionSorter. :)
I was asked in a support portal case to update this Bugzilla. From my perspective, RedHat claims to support DB2 integration. I believe there are several aspects to doing that properly, including testing the integrations, working with the vendor and providing "glue" classes where appropriate for proper container integration. From our perspective RedHat bearing that risk and responsibility is one of the big reasons we pay you for support.
I believe the exception sorter is one of those "glue" classes and that RedHat should stand behind it so that it functions as correctly as possible. I understand there is no consensus as to what the correct behavior is, but that is simply because we don't yet understand the DB2 JDBC driver well enough. There is in fact a canonical state of correct operation for the DB2 exception sorter that could be determined with appropriate information. Both proposed implementations are a known-incorrect compromise. Our expectation is that RedHat will invest appropriately to correct that, or else not make the claim of supporting DB2 integrations.
I propose instead of two known-incorrect DB2 exception sorters that additional investment is made with IBM to determine the most correct behavior of the DB2 exception sorter and implement that once.
Had discussion with the customer and they would like to see the issue addressed. They are concerned how well the EAP product is supporting IBM DB2. The connection exception handling should be in accordance with the context of whether the return value is a fatal event as defined by the 3rd party vendor.
The -99999 error did not conclusively indicate a FATAL connection. It also seems IBM has indicated this inconclusive behavior will not change with older versions of the driver.
Changing the default behavior for the existing DB2ExceptionSorter will affect any installations where the current behavior is expected with no confidence that the -99999 is actually FATAL.
The addition of DB2With99999ExceptionSorter (suggested here earlier) or using the now available ListExceptionSorter  would both be recommended options.
If the suggested DB2With99999ExceptionSorter is not a good alternative then I recommend this be set to Closed/Wontfix.
Talking with Rich it sounds like there was some discussion on this ticket to which I was not privy and that the case is being considered for closing without fixing. My impression is that I was asked to participate in this bugzilla to have an open discussion on the merits of the actual issue and unfortunately it seems like that has not happened. I am not sure why RedHat is not interested in an open discussion on the technical merits of this issue. Regardless, I will try to provide some counter point to issues that I understand have been raised but that I cannot see.
To clarify the need for a change, in EAP 5.x we had numerous production outages related to DB2 connectivity during mainframe IPL activity. Upon the main frame restart, some DB2 connections would be in a fatal state where, when they attempted to perform actual activity, a "-99999" exception code would be returned with:
"com.ibm.db2.jcc.b.SqlException: [ibm][db2][jcc] Invalid operation: Connection is closed."
The connection pool would not mark this exception as fatal and would therefore re-issue the connection causing multiple worker threads to fail over and over. The only recourse for resolving this was to restart the JBoss JVM.
We worked through the diagnostics of this issue through several support cases (01518315, 01177217 and 01533840 for starters, and the cases linked to those), and were issued patch JBPAPP-11232. That patch provided a simple and overly-broad solution to our production outages by treating all -99999 DB2 exceptions as fatal. That solution worked for us and stabilized our production systems during and after mainframe IPL activity, but I believe was recognized by all parties as not the ideal long term solution.
We (both the support customer I represent and RedHat) have since followed up with IBM support to request a comprehensive list of -99999 exceptions such that a complete solution could be developed. For various reasons, IBM support was not able to provide that list. They have told us that:
* -99999 error codes are not necessarily all fatal
* in the latest DB2 JDBC driver there are no longer any -99999 error codes
* they could not rule out the possibility of -99999 error codes in the future
Based on that, unless RedHat is going to withdraw support for all DB2 JDBC drivers except the latest, the best information we have is that there is the possibility of at least one -99999 error that is fatal, and that error has the text "Invalid operation: Connection is closed.".
If nothing is done in EAP 6.x and 7.x and the stock DB2ExeptionSorter is shipped, then when/if we migrate critical applications we will again encounter the nefarious persistent connection failures. Conversely, adding an exception sorter that treats all -99999 DB2 exceptions as fatal would be overly broad and could impact other customers in unanticipated ways.
It seems to me that the most correct behavior would be to code the DB2 exception sorter to specifically handle the -99999 error code with "Connection is closed" text and treat that as fatal. There is existing precedence for this sort of text parsing already in the OracleExceptionSorter, and so I do not understand the hesitation to implement this in the DB2ExceptionSorter.
Conversely, from Rich I understand that there may be an administrative option to configure specific exceptions as fatal for a given exception sorter. If that provides a facility to filter by error code and exception text, then that provides a reasonable path forward for this case. However, in that case RedHat is externalizing the cost of "DB2 support", i.e. figuring out the necessary configuration in order to make DB2 integration stable, to your customers and are in essence not actually supporting DB2 much beyond providing a generic configurable integration facility. Unfortunately, figuring out and providing the hard parts of the integration is a major part of the value of our support contract, and your DB2 support was a major factor in our purchase decision. As a result that direction would be disappointing at best.
As a result, rather than shipping one or two known-incomplete/broken DB2ExceptionSorter implementations, I propose coding a solution with the best information we have on hand. That is, a DB2ExceptionSorter that treats -99999 error codes with "Connection is closed" text as fatal.
My understanding, again from Rich and not any meaningful direct communication, is that there was further discussion on this issue and that RedHat is not willing to implement a combination of "-99999" error with "Connection is closed" text as a fatal exception, despite our definitive direct experience being that that exception represents a connection in a fatal condition. I would expect some consistency in decision making and in pursuit of that refer you to the org.jboss.resource.adapter.jdbc.vendor.OracleExceptionSorter, which includes the following code that casts an even wider net, across all non-user-defined error codes for certain strings:
final String error_text = (e.getMessage()).toUpperCase();
// Exclude oracle user defined error codes (20000 through 20999) from consideration when looking for
// certain strings.
if( ( error_code < 20000 || error_code >= 21000 ) &&
( (error_text.indexOf("SOCKET") > -1) //for control socket error
|| (error_text.indexOf("CONNECTION HAS ALREADY BEEN CLOSED") > -1)
|| (error_text.indexOf("BROKEN PIPE") > -1) ) )
Can you clarify that you have "definitive" information from Oracle that, for instance, any non-user-defined error codes containing the case-insensitive string "SOCKET" are fatal? If not, why the inconsistent decision making?
We would like to propose that we go with a combination of your recommendation of the "most correct behavior" and engineering's suggestion to use some configuration properties to allow for potential adaptations later.
The implementation for DB2ExceptionSorter would be changed to check for a -99999 ERROR and an associated "Connection is closed" message to identify a FATAL connection. Two configuration properties would be introduced that could modify this new behavior. Note that neither would need to be set, they are only to allow for additional changes in behavior specific to the -99999 ERROR.
CONSIDER_9999_FATAL (default FALSE): setting to TRUE would have all -99999 ERRORs return as FATAL regardless of the associated message.
99999_MESSAGES_FATAL (default "Connection is closed"): a comma separated list of error messages that would indicate the -99999 ERROR return as FATAL.
These configuration properties could be set similar to:
<datasource jndi-name="java:jboss/datasources/ExampleDS" pool-name="ExampleDS" enabled="true" use-java-context="true">
<config-property name="99999_MESSAGES_FATAL">Connection is closed</config-property>
That seems like a good solution.
What do you think about https://bugzilla.redhat.com/show_bug.cgi?id=1276052#c18 ?
it's a solution I've discussed privately w/ John before he would post it here. So I'm good w/ this
Hi Customer is still looking for an update.
Verified with EAP 6.4.15.CP.CR2
Released on May 18 as part of EAP 6.4.15.