1026088 – Deploying storage node fails if announcement of another node failed

Bug 1026088 - Deploying storage node fails if announcement of another node failed

Summary: Deploying storage node fails if announcement of another node failed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	John Sanda
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1026108
TreeView+	depends on / blocked

Reported:	2013-11-03 15:10 UTC by John Sanda
Modified:	2014-04-23 12:29 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Clones:	1026108 (view as bug list)
Environment:
Last Closed:	2014-04-23 12:29:44 UTC
Embargoed:

Attachments	(Terms of Use)
storageAnnounceError (131.05 KB, image/png) 2013-11-08 13:55 UTC, Armine Hovsepyan	no flags	Details
storageNodesAfterAnnounceFailed (131.32 KB, image/png) 2013-11-08 13:55 UTC, Armine Hovsepyan	no flags	Details
storageNewNodesAfterAnnounceFailed (137.68 KB, image/png) 2013-11-08 13:56 UTC, Armine Hovsepyan	no flags	Details
alert-addToMaintFailure (127.77 KB, image/png) 2013-11-08 13:57 UTC, Armine Hovsepyan	no flags	Details
storageNewNodesAfterAddToMainFailure (132.14 KB, image/png) 2013-11-08 13:58 UTC, Armine Hovsepyan	no flags	Details
View All

Description John Sanda 2013-11-03 15:10:04 UTC

Description of problem:
Start with a single node cluster. Deploy a second node, N2. Suppose the deployment fails during the announce phase which means N2 will have an operation mode of ANNOUNCE. Now try to deploy a third node, N3. Deployment of N3 will fail as well with an exception like,

09:54:52,310 ERROR [org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean] (EJB default - 3) Aborting storage node deployment due to unexpected error while announcing cluster nodes.: javax.ejb.EJBException: javax.persistence.NonUniqueResultException: result returns more than one elements
.
.
.
at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerLocal$$$view137.handleAnnounce(Unknown Source) [rhq-server.jar:4.10.0-SNAPSHOT]
at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.handleOperationUpdateIfNecessary(StorageNodeOperationsHandlerBean.java:304) [rhq-server.jar:4.10.0-SNAPSHOT]
.
.
.
Caused by: javax.persistence.NonUniqueResultException: result returns more than one elements
at org.hibernate.ejb.QueryImpl.getSingleResult(QueryImpl.java:293) [hibernate-entitymanager-4.2.0.CR1.jar:4.2.0.CR1]
at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.findStorageNodeByMode(StorageNodeOperationsHandlerBean.java:769) [rhq-server.jar:4.10.0-SNAPSHOT]
at org.rhq.enterprise.server.storage.StorageNodeOperationsHandlerBean.handleAnnounce(StorageNodeOperationsHandlerBean.java:391) [rhq-server.jar:4.10.0-SNAPSHOT]

When the announce operation finishes on N1 and the agent reports the results to the server, the server fetches N3 using JPA's Query.getSingleResult() and the result set includes both N2 and N3. The filtering in the query is too broad.

The (un)deployment code was designed and intended to allow deployments in scenarios like this where a previous deployment of another node may have failed and it is easier/faster to deploy a completely different node rather than retrying the failed deployment.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:

Comment 1 John Sanda 2013-11-03 22:42:04 UTC

This issue could also occur during the UNANNOUNCE and REMOVE_MAINTENANCE phase of undeployment and the ADD_MAINTENANCE phase of deployment.

Comment 2 John Sanda 2013-11-04 01:27:23 UTC

I have committed a fix to master. Code has been refactored to query for the storage node being deployed by address (instead of operation mode) to ensure we avoid a NonUniqueResultException.

master commit hash: f330eb0d

Comment 3 John Sanda 2013-11-06 02:44:51 UTC

The commit cited in comment 2 included a change that could result in trying to query for the node under deployment to soon, resulting in a NoResultException. It would not fail the deployment, but added a lot of noise in the logs that can and should be avoided. I have gone ahead and fixed when the query is executed.

master commit hash: 7035640df

Comment 4 Armine Hovsepyan 2013-11-08 13:54:23 UTC

verified in master : d3ea23b

verification scenario is:

* ANNOUNCE Fail
During the installation of N1 removing rhq-storage-auth config so N2 cannot be connected to N1.
Getting the rhq-storage-auth back and installing N3, which connects to N1.

* BOOTSTRAP Fail
i was unale to reproduce a scenario in which bootstrap would fail - it takes ~ 1-2 secs to start and finish bootstrap after announce 

*ADD_MAINTENANCE fail
killing nodes during the add maintenance, so N2 failed
installing N3 and connectint for N1

Please get corresponding screenshots attached.

Comment 5 Armine Hovsepyan 2013-11-08 13:55:03 UTC

Created attachment 821634 [details]
storageAnnounceError

Comment 6 Armine Hovsepyan 2013-11-08 13:55:43 UTC

Created attachment 821635 [details]
storageNodesAfterAnnounceFailed

Comment 7 Armine Hovsepyan 2013-11-08 13:56:11 UTC

Created attachment 821636 [details]
storageNewNodesAfterAnnounceFailed

Comment 8 Armine Hovsepyan 2013-11-08 13:57:13 UTC

Created attachment 821637 [details]
alert-addToMaintFailure

Comment 9 Armine Hovsepyan 2013-11-08 13:58:47 UTC

Created attachment 821638 [details]
storageNewNodesAfterAddToMainFailure

Comment 10 Heiko W. Rupp 2014-04-23 12:29:44 UTC

Bulk closing of 4.10 issues.

If an issue is not solved for you, please open a new BZ (or clone the existing one) with a version designator of 4.10.

Note You need to log in before you can comment on or make changes to this bug.