Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1083497

Summary:

After 3.2.0.GA update-1 patch installed "Performing hourly database maintenance" job throws exception

Product:

[JBoss] JBoss Operations Network

Reporter:

Jeeva Kandasamy <jkandasa>

Component:

Core Server

Assignee:

Simeon Pinder <spinder>

Status:

CLOSED DUPLICATE

QA Contact:

Mike Foley <mfoley>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

JON 3.2.1

CC:

hrupp, jkandasa, jshaughn, lkrejci, loleary, myarboro, rhatlapa, spinder

Target Milestone:

---

Keywords:

Regression, Triaged

Target Release:

JON 3.2.2

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2014-06-19 14:42:15 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
server.log	none
server.log	none

Description Jeeva Kandasamy 2014-04-02 10:58:32 UTC

Created attachment 881728 [details]
server.log

Description of problem:
"Performing hourly database maintenance" job throws exception,
17:00:14,821 ERROR [org.jboss.as.ejb3.invocation] (EJB default - 3) JBAS014134: EJB Invocation failed on component SystemManagerBean for method public abstract org.rhq.core.domain.common.composite.SystemSettings org.rhq.enterprise.server.system.SystemManagerLocal.getObfuscatedSystemSettings(boolean): javax.ejb.EJBTransactionRolledbackException: JBAS014373: EJB 3.1 PFD2 4.8.5.5.1 concurrent access timeout on org.jboss.invocation.InterceptorContext$Invocation@72e14cb0 - could not obtain lock within 5000MILLISECONDS



Version-Release number of selected component (if applicable):

JBoss Operations Network
Version : 3.2.0.GA update-1
Build Number : 13f0eea:c372b58
GWT Version : 2.5.0
SmartGWT Version : 3.0p

How reproducible:
Always

Steps to Reproduce:
1. Install 3.2.1 GA
2. Import all the available resources(Inventory)
3. Stop all the services (./rhqctl stop)
4. Apply update-1 patch
5. Start all the services (./rhqctl start)
6. Monitor server.log and wait till next hour


Actual results:
"Performing hourly database maintenance" job throws exception

Expected results:
"Performing hourly database maintenance" job should not throw any exception. Should run perfectly.

Additional info: Exception stack trace is attached

Comment 1 Simeon Pinder 2014-04-02 19:13:57 UTC

I can't reproduce this problem.  I have two different patched instances(fedora and windows 7) that have been successfully doing hourly database maintenance jobs for the last 12 hours and from install time and I don't see this.  

Looking around I did find this issue that looks related and does affect EAP 6.1.1(https://bugzilla.redhat.com/show_bug.cgi?id=990102). 

Can QE provide more details on:
- how many times this has occurred on the same box? (retested more than once?)
- was the JON 3.2.0.GA instance(before patch) shut down normally?
- what other steps I can take to reproduce this issue?

Comment 2 Larry O'Leary 2014-04-02 20:02:24 UTC

In addition to comment 1, can we get the database vendor/version for this? Perhaps that is very relevant considering our other Oracle issues.

Comment 4 Jeeva Kandasamy 2014-04-03 12:16:43 UTC

Created attachment 882246 [details]
server.log

Database: postgres (PostgreSQL) 8.4.13
OS: Red Hat Enterprise Linux Server release 6.4 (Santiago), Linux rhel6-vm 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
RAM: 2GB, CPU: 2

I am able to reproduce this issue in the following condition, 

- JON server/agent/storage all are installed on the same box.
- "Performing hourly database maintenance" is running hourly once,
- Stop jon server (./rhqctl stop)
- Change the system time to xx:58 (HH:mm)
- Start all the services
- If the this issue not reproduced, increase the system time to xx:58:30 (HH:mm:ss)


In my observation if we get the following lock release before triggering hourly job, we never get this this issue.
-------------snap----------------
11:59:16,588 INFO  [org.rhq.enterprise.server.util.concurrent.AvailabilityReportSerializer] (http-/0.0.0.0:7080-3) tid=247; agent=rhel6-vm: releasing write lock after being locked for millis=5443
-------------snap----------------

After this exception, in a few moment there was a lock release INFO, From this point there is no exception on hourly job.

-------------snap----------------
17:01:05,029 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (http-/0.0.0.0:7080-1) tid=225; agent=rhel6-vm: releasing write lock after being locked for millis=8946
17:01:45,654 INFO  [org.rhq.enterprise.server.util.concurrent.InventoryReportSerializer] (http-/0.0.0.0:7080-2) tid=251; agent=rhel6-vm: releasing write lock after being locked for millis=20650
-------------snap----------------

Detailed Log is attached.

Comment 5 Jay Shaughnessy 2014-04-03 19:27:37 UTC

I think I understand what is happening here. The StorageClientManagerBean executes a job every 90s to ensure that StorageSession information is normalized across HA nodes. As part of that work it calls SystemManager.getObfuscatedSystemSettings(). SystemManagerBean is declared with @Singleton. By default, if a request to a singleton bean can not be serviced within 5s this problem will occur, which is a ConcurrentAccessTimeoutException. Note that this is not a DB locking issue, it is a bean locking issue.

So, given that Singleton beans have an instance count of 1, it's critical that they perform invocations quickly or risk the aforementioned timeout. That is not always the case with SystemManagerBean, which performs some potentially very long invocations during (you guessed it) DataPurge. This problem can be seen when testing against a large and/or slow DB. In particular, DataPurge runs a DB analyze. In the attached log this analyze took 20s. It can also run things like vacuum and re-index, also potentially long-running.

If the StorageClient job executes while the analyze is active, this can occur. As well as during any other long-running method.

The good news is that the StorageClient job will re-execute 90s later, so things should stabilize. The bad news is that really any call into SystemManagerBean could suffer this problem during the hourly data purge (more likely at midnight where we do extra stuff).

What to do?

It wasn't long ago that SystemManagerBean was changed to a Singleton. It was in
release/jon3.2.x commit cb64f4333f2285a81bb9fc8229aa84b69c4aa16c. I don't know why because that aspect of the change is not discussed in the commit or Bug 743558.

Changing it back to Stateless would solve this problem but possible wreck that fairly complex work. So, I wouldn't recommend that approach.

Another approach would be to migrate these offending long-running methods to a different bean. They are all Local methods, so that makes it pretty easy to do, no issues with the public API.

Still another approach is to apply a generous timeout override and just let callers wait as long as it takes to access the Singleton bean. But this could cause other issues. For example, StorageClientManagerBean is itself a Singleton, so it would be tied up waiting and callers into *it* may timeout.

Doing nothing might be a short term solution as the StorageNode use case should repair itself, and other calls in should work when db maintenance ends. But overall I'd suggest we do something recommended above.

Comment 6 Simeon Pinder 2014-04-03 20:00:31 UTC

Comment 5 appears to hit the nail on the head.  If you look through the attached sever.log you'll see:

# hourly maint launched
- 17:00:10,921 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Performing hourly database maintenance
# exception
17:00:41,645 ERROR [org.jboss.as.ejb3] (EJB default - 3) javax.ejb.EJBTransactionRolledbackException: JBAS014373: EJB 
# retry
17:00:41,709 INFO .... will be retried
# success
17:00:42,660 INFO  [org.rhq.enterprise.server.scheduler.jobs.DataPurgeJob] (RHQScheduler_Worker-1) Database maintenance completed in [31745]ms

It would appear that this is not related to the cumulative patch at all and is also a problem in GA as well. It also looks worse than it actually is since the recovery happens quickly after. What remains is a question of how bad for other calls into SystemManagerBean.  
@Heiko can you comment on other effects here? I'm a bit concerned about other performance issues that could show up under heavier load. Maybe they're recovered  from just as quickly? Is this something we should get into the cumulative patch 1 anyway?

@Lukas, can you comment on Jay's possible solutions? Is there a different fix not yet mentioned that makes more sense but has less side effects?

Comment 7 Lukas Krejci 2014-04-03 21:37:26 UTC

I vaguely remember making SystemManagerBean a @Singleton purely for the caching reasons - I.e. to prevent several possibly outdated copies of cached DB rows in the memory - 1 per bean instance.

Not sure what would be the ramifications of making the cache again a static field on the bean as it was before, but it certainly sounds incorrect from the EJB spec perspective.

I wasn't aware of the timing issues for @Singleton beans.

Comment 8 Heiko W. Rupp 2014-04-04 10:25:44 UTC

EJB 3.1. spec stays in 4.8.5 that the default for Singletons is Container Managed Concurrency and then refines in 4.8.5.1

With Container Managed Concurrency demarcation, the container is responsible for controlling concur- rent access to the bean instance based on method-level locking metadata. Each business method or tim- eout method is associated with either a Read(shared) lock or Write (exclusive) lock.

If the container invokes a method associated with a Read lock, any number of other concurrent invoca- tions on Read methods are allowed to access the bean instance simultaneously.

If the container invokes a method associated with a Write lock, no other concurrent invocations will be allowed to proceed until the initial Write method’s processing completes.

A concurrent access attempt that is not allowed to proceed due to locking is blocked until it can make forward progress.

Going forward to 4.8.5.4 states:

By default, the value of the lock associated with a method of a bean with container managed concurrency demarcation is Write(exclusive), and the concur- rency lock attribute does not need to be explicitly specified in this case.

So basically by not explicitly saying that we want shared access the container forces exclusive locking and this is what we see.

So in my understanding we should add @Lock(READ) on class level and then @Lock(WRITE) on some selected methods that modify data.
For this specific case, we should probably only put @Lock(READ) on the getObfus*() method.

Comment 9 Heiko W. Rupp 2014-04-04 10:33:59 UTC

Actually on top of the before we may need to re-arrange 
getUnmaskedSystemSettings() as this modifies the cachedSystemSettings field.
This may need the write lock, where we may see the same behavior then as before as the getObfus*() calls this and populating the cache may take long time.

So the other option is to add a @AccessTimeout as in 

    @Lock(LockType.READ)
    @AccessTimeout(value = 10, unit = TimeUnit.SECONDS)
    public SystemSettings getObfuscatedSystemSettings(boolean includePrivateSettings) {

Comment 10 Jay Shaughnessy 2014-04-04 13:30:46 UTC

Heiko, good information.  We can play with the settings you recommend, but it's not actually getObfuscatedSystemSettings() or getUnmaskedSystemSettings() that are the issue.  I think those could remain with the default Write lock, because they are fast.

The problem is that the long-running methods: analyze, reindex, vacuum are holding Write locks.  So when analyze() runs (as in this log) nothing else can be executed.  If these methods had Read locks applied I think we'd be OK.  As I read the snippet you posted, it sounds like several Read-locked methods can execute along with one Write-locked method, simultaneously.

This could be a pretty easy fix, even easier than moving the slow-movers to a different bean.

Comment 12 Jay Shaughnessy 2014-06-19 14:42:15 UTC

Marking this as a duplicate of Bug 1071330, although the information here is important to the work on that BZ.  I've noted that in a comment on the other BZ.

*** This bug has been marked as a duplicate of bug 1071330 ***