Bug 1141764

Summary: RHQ is broken on oracle - ORA-02049: timeout
Product: [Other] RHQ Project Reporter: Filip Brychta <fbrychta>
Component: Core Server, DatabaseAssignee: Nobody <nobody>
Status: VERIFIED --- QA Contact:
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.13CC: fbrychta, hrupp, jshaughn, tsegismo
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1141969 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1141969    
Attachments:
Description Flags
server log none

Description Filip Brychta 2014-09-15 12:12:29 UTC
Description of problem:
This issue was discovered by nightly automation where bundle, drift and some operation tests failed because of ORA-02049: timeout. There are no problems when running automation with postgresql.

Last succesfull automation run on oracle was on rhq build dcd5159 so this issues is probably caused by commit 2d48aa4

Not sure which exact step is causing this problem but here is what I did to reproduce it manually:
1 - deploy a few bundles to group of platforms
2 - wait ~ 30 minutes

Result:
Step 1 was working without errors but after ~ 30 minutes ORA-02049: timeout exceptions were visible in server.log even without any action.

Searching for better repro steps..

Version-Release number of selected component (if applicable):
Version :	
4.13.0-SNAPSHOT
Build Number :	
58023b7

How reproducible:
2/2

Comment 4 Filip Brychta 2014-09-16 08:23:02 UTC
This issue is not related to bundles directly. Failed bundles tests are result of already "broken" RHQ server. Here is a scenario which breaks the RHQ:
1- have a rhq with 2 agents, remote agent is monitoring EAP6
2- import all resource
3- waint until all resources are UP
4- uninventory eap6 standalone resource
5- inventory it again
6- wait for a while

Result:
First 'ORA-02049: timeout' messages are thrown to server.log.
From this point the RHQ is broken and subseqeuent operations (schedule some opreation, deploy bundle...) hit ORA-02049: timeout exception as well even though those operations worked earlier


First timeout exceptions are caused by
03:54:55,221 ERROR [org.rhq.enterprise.server.resource.ResourceManagerBean] (RHQScheduler_Worker-5) Bulk named query delete error for 'ResourceOperationHistory.deleteByResources' for [10101]:

Important point here is that resource with given resourceId '10101' doesn't exist.

See complete server.log (DEBUG level) and search for 'ORA-02049: timeout'

Comment 5 Filip Brychta 2014-09-16 08:23:46 UTC
Created attachment 937907 [details]
server log

Comment 6 Filip Brychta 2014-09-16 11:57:43 UTC
I reproduced this issue even with DISTRIBUTED_LOCK_TIMEOUT=300 (default is 60)

Comment 7 Jay Shaughnessy 2014-09-16 20:04:47 UTC
Filip, this should be fixed with master commit ed8b9aa1192066945fdd843e33ca3d7a059a23cf.  If so please close, otherwise let me know, thanks, Jay

Comment 8 Filip Brychta 2014-09-17 07:05:54 UTC
Manual test is Ok. Waiting for automation results..

Comment 9 Filip Brychta 2014-09-17 13:21:07 UTC
Verified on
Version :	
4.13.0-SNAPSHOT
Build Number :	
afb91cb