Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 874477

Summary: Reports fail to create a connection to database and right after fail to close JDBC lease for connection
Product: Red Hat Enterprise Virtualization Manager Reporter: Tomas Dosek <tdosek>
Component: ovirt-engine-reportsAssignee: Yaniv Lavi <ylavi>
Status: CLOSED DUPLICATE QA Contact: David Botzer <dbotzer>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.0.7CC: bazulay, dyasny, iheim, juan.hernandez, pstehlik, Rhev-m-bugs, sgrinber, tvvcox, ykaul
Target Milestone: ---   
Target Release: 3.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-05 09:16:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
rhevm reports log
none
rhevm.log none

Description Tomas Dosek 2012-11-08 09:50:55 UTC
Created attachment 640676 [details]
rhevm reports log

Description of problem:
This is a scenario that happened within large deployment of RHEV-M with RHEV-M reports. (environment containing 30 hosts and around 350 VMs).

During some database operation (not yet being reproduced in our test environment, partially probably because we're not able to simulate that large deployment) RHEV-M reports try to create a connection to database using JDBC connection pool in jboss.

This operation fails, but the connection lease from the pool fails to rollback.

As reports try to allocate this lease again and again (all failing to rollback)
jdbc connection pool, in pretty short time, contains no allocable connection threads to database which rapidly slows down operation of both RHEV-M and RHEV-M reports (for instance login procedure takes about 10 minutes).

This issue disappears while restarting rhevm and rhevm reports (as all database connections are forcibly closed).

Version-Release number of selected component (if applicable):
Sep 14 18:56:26 Updated: rhevm-3.0.7_0001-2.el6_3.x86_64
Apr 24 12:22:07 Installed: rhevm-reports-dwh-3.0.2-3.noarch
Apr 24 12:22:13 Installed: rhevm-reports-server-3.0.2-3.noarch
Apr 24 12:22:14 Installed: rhevm-reports-3.0.2-3.noarch

How reproducible:
Very hard to reproduce - only in large deployments

Steps to Reproduce:
Have a large rhev deployment - due to knowing bug some DB tables are growing rapidly and not being cleaned properly.
After some time this issue happens.

Actual results:
All connection leases from jdbc connection pool are opened even though that some of these should be automatically cleaned
Expected results:
JDBC connection rollback procedure should be finished with no problem and connection lease should be released back to jdbc connection pool.

Additional info:
Please see attached log files.

Comment 1 Tomas Dosek 2012-11-08 09:52:11 UTC
Created attachment 640677 [details]
rhevm.log

Comment 2 Tomas Dosek 2012-11-08 09:53:56 UTC
Here are some basic stats that customer collected for us during the issue. You can see that many postgres threads from rhevm and rhevm reports are left opened and idle for long time:

top - 12:25:04 up 25 days, 18:34,  2 users,  load average: 0.18, 0.10, 0.09
Tasks: 423 total,   9 running, 414 sleeping,   0 stopped,   0 zombie
Cpu0  : 15.9%us, 31.8%sy,  0.0%ni, 52.0%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 17.1%us, 32.9%sy,  0.0%ni, 49.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 23.8%us, 29.2%sy,  0.0%ni, 46.3%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  : 18.2%us, 30.4%sy,  0.0%ni, 51.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu4  : 22.6%us, 30.6%sy,  0.0%ni, 46.5%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  : 17.0%us, 31.6%sy,  0.0%ni, 51.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 16.1%us, 32.4%sy,  0.0%ni, 50.5%id,  1.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 18.2%us, 31.0%sy,  0.0%ni, 50.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16330452k total, 15539644k used,   790808k free,   304528k buffers
Swap: 13766648k total,   294440k used, 13472208k free,  8218072k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                            
 2913 postgres  20   0  229m  51m  37m R 22.2  0.3  15:32.77 postmaster                                                                         
 2480 postgres  20   0  228m  50m  37m S 11.9  0.3  14:14.93 postmaster                                                                         
12419 postgres  20   0  234m  57m  37m S 11.9  0.4  66:52.33 postmaster                                                                         
13731 postgres  20   0  237m  59m  37m R 11.9  0.4 114:39.17 postmaster                                                                         
16749 postgres  20   0  238m  61m  37m S 11.9  0.4  96:26.08 postmaster                                                                         
 1397 postgres  20   0  241m  63m  37m S 11.3  0.4 115:46.08 postmaster                                                                         
17030 postgres  20   0  240m  61m  37m S 11.3  0.4  97:03.90 postmaster                                                                         
 2382 postgres  20   0  229m  52m  37m R 10.9  0.3  12:07.04 postmaster                                                                         
 4699 postgres  20   0  237m  57m  37m S 10.9  0.4  53:26.55 postmaster                                                                         
 7844 postgres  20   0  230m  53m  37m S 10.9  0.3  31:15.54 postmaster                                                                         
13750 postgres  20   0  241m  61m  37m S 10.9  0.4 120:46.13 postmaster                                                                         
17031 postgres  20   0  232m  57m  37m S 10.9  0.4 101:59.40 postmaster                                                                         
17033 postgres  20   0  238m  63m  37m R 10.9  0.4  80:01.09 postmaster                                                                         
 1408 postgres  20   0  240m  63m  37m S 10.6  0.4 112:22.77 postmaster                                                                         
 7910 postgres  20   0  233m  55m  37m S 10.6  0.4  48:19.63 postmaster                                                                         
 9221 postgres  20   0  237m  59m  37m S 10.6  0.4  83:29.28 postmaster                                                                         
13743 postgres  20   0  240m  63m  37m S 10.6  0.4 107:35.24 postmaster                                                                         
17029 postgres  20   0  240m  62m  37m S 10.6  0.4  92:18.53 postmaster                                                                         
17034 postgres  20   0  232m  57m  37m S 10.6  0.4  96:58.75 postmaster                                                                         
17035 postgres  20   0  237m  59m  37m S 10.6  0.4  97:23.49 postmaster                                                                         
28777 postgres  20   0  220m  45m  37m S 10.6  0.3   4:41.75 postmaster                                                                         
29109 postgres  20   0  233m  56m  37m S 10.6  0.4  87:44.67 postmaster                                                                         
31672 postgres  20   0  234m  58m  37m S 10.6  0.4  97:55.69 postmaster                                                                         
 3766 postgres  20   0  238m  60m  37m R 10.3  0.4  72:48.45 postmaster                                                                         
 4697 postgres  20   0  236m  58m  37m R 10.3  0.4  48:54.06 postmaster                                                                         
12211 postgres  20   0  233m  55m  37m S 10.3  0.3  19:58.43 postmaster                                                                         
17026 postgres  20   0  238m  61m  37m S 10.3  0.4  97:00.58 postmaster                                                                         
28298 postgres  20   0  237m  59m  37m S 10.3  0.4  70:48.95 postmaster                                                                         
 2570 postgres  20   0  231m  56m  37m R  9.9  0.4  24:10.65 postmaster                                                                         


[root@jerez1 log]# ps -fe | grep postgres
postgres  1397  7973  8 Nov05 ?        01:55:46 postgres: rhevm rhevm 127.0.0.1(51477) idle       
postgres  1408  7973  8 Nov05 ?        01:52:25 postgres: rhevm rhevm 127.0.0.1(51488) SELECT     
postgres  1918  7973  6 09:07 ?        00:12:15 postgres: rhevm rhevm 127.0.0.1(42892) idle       
postgres  1985  7973  6 09:07 ?        00:13:48 postgres: rhevm rhevm 127.0.0.1(42970) SELECT     
postgres  1986  7973  3 09:07 ?        00:07:11 postgres: rhevm rhevm 127.0.0.1(42971) SELECT     
postgres  2190  7973  0 11:56 ?        00:00:03 postgres: postgres rhevmreports 127.0.0.1(57686) idle
postgres  2191  7973  0 11:56 ?        00:00:00 postgres: postgres rhevmreports 127.0.0.1(57687) idle
postgres  2382  7973  6 09:09 ?        00:12:10 postgres: rhevm rhevm 127.0.0.1(43449) BIND       
postgres  2480  7973  7 09:09 ?        00:14:18 postgres: rhevm rhevm 127.0.0.1(43570) SELECT     
postgres  3766  7973  8 Nov05 ?        01:12:51 postgres: rhevm rhevm 127.0.0.1(40140) SELECT     
postgres  4696  7973  7 01:16 ?        00:50:35 postgres: rhevm rhevm 127.0.0.1(51754) idle       
postgres  4697  7973  7 01:16 ?        00:48:57 postgres: rhevm rhevm 127.0.0.1(51755) SELECT     
postgres  4699  7973  7 01:16 ?        00:53:29 postgres: rhevm rhevm 127.0.0.1(51757) SELECT     
postgres  5245  7973  0 12:11 ?        00:00:00 postgres: postgres rhevmreports 127.0.0.1(33444) idle
postgres  5246  7973  0 12:11 ?        00:00:00 postgres: postgres rhevmreports 127.0.0.1(33445) idle
postgres  7844  7973  6 04:12 ?        00:31:18 postgres: rhevm rhevm 127.0.0.1(47703) SELECT     
postgres  7845  7973  7 04:12 ?        00:38:35 postgres: rhevm rhevm 127.0.0.1(47705) idle       
postgres  7863  7973  7 04:12 ?        00:37:59 postgres: rhevm rhevm 127.0.0.1(47725) SELECT     
postgres  7910  7973  7 01:32 ?        00:48:22 postgres: rhevm rhevm 127.0.0.1(59346) SELECT     
postgres  7973     1  0 Oct11 ?        00:01:42 /usr/bin/postmaster -p 5432 -D /var/lib/pgsql/data
postgres  8070  7973  0 Oct11 ?        00:00:37 postgres: logger process                          
root      8073  9747  0 12:25 pts/0    00:00:00 grep postgres
postgres  8074  7973  0 Oct11 ?        02:14:40 postgres: writer process                          
postgres  8075  7973  0 Oct11 ?        00:08:54 postgres: wal writer process                      
postgres  8076  7973  0 Oct11 ?        00:02:14 postgres: autovacuum launcher process             
postgres  8077  7973  0 Oct11 ?        00:19:36 postgres: stats collector process                 
postgres  9221  7973  8 Nov05 ?        01:23:32 postgres: rhevm rhevm 127.0.0.1(34942) SELECT     
root      9301     1  1 Oct11 ?        07:49:27 java -Xms256M -Xmx1024M -cp /usr/share/rhevm-dwh/etl:/usr/share/rhevm-dwh/etl/historyETLProcedure-3.0.2.jar:/usr/share/rhevm-dwh/etl/lib/advancedPersistentLookupLib-1.0.jar:/usr/share/rhevm-dwh/etl/lib/commons-collections.jar:/usr/share/rhevm-dwh/etl/lib/jboss-serialization.jar:/usr/share/rhevm-dwh/etl/lib/log4j.jar:/usr/share/rhevm-dwh/etl/lib/postgresql-jdbc.jar:/usr/share/rhevm-dwh/etl/lib/trove.jar:/usr/share/rhevm-dwh/etl/lib/dom4j.jar:/usr/share/rhevm-dwh/etl/lib/talendRoutines-4.2.2.r63143.jar:/usr/share/rhevm-dwh/etl/lib/xml-apis-1.0.b2.jar rhevm_reports.historyetl_3_0.HistoryETL --context=Default
postgres  9348  7973  0 Oct11 ?        00:11:44 postgres: postgres rhevm 127.0.0.1(52662) idle    
postgres  9349  7973  0 Oct11 ?        03:06:37 postgres: postgres rhevm_history 127.0.0.1(52663) idle
postgres  9350  7973  0 Oct11 ?        00:32:58 postgres: postgres rhevm_history 127.0.0.1(52664) idle in transaction
postgres  9351  7973  0 Oct11 ?        00:00:35 postgres: postgres rhevm_history 127.0.0.1(52665) idle
postgres  9352  7973  0 Oct11 ?        00:36:49 postgres: postgres rhevm_history 127.0.0.1(52666) idle in transaction
postgres 12211  7973  6 07:14 ?        00:20:01 postgres: rhevm rhevm 127.0.0.1(41475) SELECT     
postgres 12419  7973  8 Nov05 ?        01:06:55 postgres: rhevm rhevm 127.0.0.1(40304) SELECT     
postgres 13731  7973  8 Nov05 ?        01:54:42 postgres: rhevm rhevm 127.0.0.1(47062) SELECT     
postgres 13734  7973  9 Nov05 ?        02:00:31 postgres: rhevm rhevm 127.0.0.1(47064) idle       
postgres 13741  7973  8 Nov05 ?        01:47:03 postgres: rhevm rhevm 127.0.0.1(47071) SELECT     
postgres 13743  7973  8 Nov05 ?        01:47:38 postgres: rhevm rhevm 127.0.0.1(47073) SELECT     
postgres 13746  7973  8 Nov05 ?        01:54:02 postgres: rhevm rhevm 127.0.0.1(47076) idle       
postgres 13750  7973  9 Nov05 ?        02:00:49 postgres: rhevm rhevm 127.0.0.1(47080) SELECT     
postgres 14785  7973  7 02:05 ?        00:44:30 postgres: rhevm rhevm 127.0.0.1(40370) SELECT     
postgres 16747  7973  8 Nov05 ?        01:28:45 postgres: rhevm rhevm 127.0.0.1(48960) idle       
postgres 16749  7973  8 Nov05 ?        01:36:28 postgres: rhevm rhevm 127.0.0.1(48963) idle       
postgres 16751  7973  8 Nov05 ?        01:32:29 postgres: rhevm rhevm 127.0.0.1(48965) idle       
postgres 17026  7973  8 Nov05 ?        01:37:03 postgres: rhevm rhevm 127.0.0.1(51112) SELECT     
postgres 17029  7973  8 Nov05 ?        01:32:21 postgres: rhevm rhevm 127.0.0.1(51115) SELECT     
postgres 17030  7973  8 Nov05 ?        01:37:06 postgres: rhevm rhevm 127.0.0.1(51116) SELECT     
postgres 17031  7973  9 Nov05 ?        01:42:02 postgres: rhevm rhevm 127.0.0.1(51117) SELECT     
postgres 17033  7973  7 Nov05 ?        01:20:04 postgres: rhevm rhevm 127.0.0.1(51119) SELECT     
postgres 17034  7973  8 Nov05 ?        01:37:01 postgres: rhevm rhevm 127.0.0.1(51120) SELECT     
postgres 17035  7973  8 Nov05 ?        01:37:26 postgres: rhevm rhevm 127.0.0.1(51122) SELECT     
postgres 17036  7973  8 Nov05 ?        01:35:07 postgres: rhevm rhevm 127.0.0.1(51123) idle       
postgres 17037  7973  9 Nov05 ?        01:44:05 postgres: rhevm rhevm 127.0.0.1(51125) SELECT     
postgres 18000  7973  8 05:02 ?        00:35:54 postgres: rhevm rhevm 127.0.0.1(33268) idle       
postgres 20435  7973  8 Nov05 ?        01:51:17 postgres: rhevm rhevm 127.0.0.1(58210) SELECT     
postgres 22037  7973  8 10:51 ?        00:07:52 postgres: rhevm rhevm 127.0.0.1(40941) SELECT     
postgres 22038  7973  4 10:51 ?        00:04:13 postgres: rhevm rhevm 127.0.0.1(40942) idle       
postgres 22040  7973  6 10:51 ?        00:06:05 postgres: rhevm rhevm 127.0.0.1(40946) idle       
postgres 28298  7973  8 Nov05 ?        01:10:51 postgres: rhevm rhevm 127.0.0.1(38574) idle       
postgres 28486  7973  0 11:26 ?        00:00:00 postgres: postgres rhevmreports 127.0.0.1(49853) idle
postgres 28777  7973  8 11:27 ?        00:04:44 postgres: rhevm rhevm 127.0.0.1(50128) SELECT     
postgres 28778  7973  4 11:27 ?        00:02:45 postgres: rhevm rhevm 127.0.0.1(50132) idle       
postgres 28780  7973  5 11:27 ?        00:03:26 postgres: rhevm rhevm 127.0.0.1(50134) idle       
postgres 29109  7973  8 Nov05 ?        01:27:47 postgres: rhevm rhevm 127.0.0.1(38069) SELECT     
postgres 31672  7973  8 Nov05 ?        01:37:58 postgres: rhevm rhevm 127.0.0.1(52862) SELECT     
[root@jerez1 log]#

Comment 3 Barak 2012-11-25 11:01:06 UTC
Yaniv,

Do we think it's zstream ?
Do we know what the problem is ? How easy/hard to fix ?

Comment 4 Yaniv Lavi 2012-11-28 12:46:23 UTC
(In reply to comment #3)
> Yaniv,
> 
> Do we think it's zstream ?
> Do we know what the problem is ? How easy/hard to fix ?

Juan, you explored the JRS connection usage. any clue on this?



Yaniv

Comment 5 Juan Hernández 2012-11-29 14:50:37 UTC
As far as I know the reports application uses its own connection pool, separated from the one used by the engine, at least this it how it works in 3.1. We found that while working in a different bug, see [1].

So if database connection pool is exhausted it should take at most 20 connections, not all the connections available for the engine.

Tomas, you say that there are lot of PostgreSQL threads idle for a long time, but what I see in the output of top in comment #2 is that the have a lot of accumulated CPU time, which means that they have been very busy. Can you elaborate?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=856273#c13

Comment 6 Juan Hernández 2012-11-29 15:00:03 UTC
In the rhevm.log file attached the database errors I see haven't been generated by reports, but by the VDS broker, so this looks more and more similar to bug 856273.

Why do we think that this is related to reports?

Comment 7 Tomas Dosek 2012-12-03 07:20:30 UTC
We've reproduced this on environment where database was overloaded with reports data and was not cleaned up properly. 
Looking at the bug you've posted here it seems you're right and this one can be closed as duplicate.

Comment 8 Yaniv Kaul 2012-12-03 13:37:59 UTC
Medium severity, proposing to 3.1.2, not critical for 3.1.1.

Comment 9 Yaniv Lavi 2012-12-05 09:16:14 UTC

*** This bug has been marked as a duplicate of bug 856273 ***