857684 – [engine] Engine fails to reconnect to postgres

Bug 857684 - [engine] Engine fails to reconnect to postgres

Summary: [engine] Engine fails to reconnect to postgres

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.1.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yair Zaslavsky
QA Contact:	Pavel Stehlik
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-09-16 06:14 UTC by Gadi Ickowicz
Modified:	2016-02-10 19:44 UTC (History)
CC List:	11 users (show)
Fixed In Version:	si21
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-12-04 20:04:58 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine logs (418.83 KB, application/x-gzip) 2012-09-16 06:14 UTC, Gadi Ickowicz	no flags	Details
View All

Description Gadi Ickowicz 2012-09-16 06:14:23 UTC

Created attachment 613377 [details]
engine logs

Description of problem:
After updating postgres using yum update, engine could no longer access database until the engine was restarted.

Version-Release number of selected component (if applicable):
rhevm-3.1.0-15.el6ev.noarch

How reproducible:
?

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Engine fails to reconnect to database after postgres process is restarted

Expected results:
engine should handle pid change of postgres and/or restart the postgres process on its own if no connection to database can be established

Additional info:

Comment 5 Yair Zaslavsky 2012-10-04 07:12:09 UTC

Juan, any reason why NOT to add the following to the data source definition?



<validation>
  <check-valid-connection-sql>select 1</check-valid-connection-sql>
</validation>


I just tried that on my jboss-eap-6.0 setup , looks like it's working.

Comment 6 Yair Zaslavsky 2012-10-04 08:06:11 UTC

After consulting with mkublin -

http://gerrit.ovirt.org/#/c/8346/

Comment 7 Itamar Heim 2012-10-04 08:12:31 UTC

when did we lose it?
its in:
./backend/manager/modules/bll/target/test-classes/deploy/postgres-ds.xml

Comment 8 Yair Zaslavsky 2012-10-04 08:21:28 UTC

I suspect it got lost in the transition to JBoss-AS-7 (moving from postgres-ds.xml to standalone.xml)
I just checked my z-stream env - it's there.

Comment 9 Juan Hernández 2012-10-04 08:33:14 UTC

Please note the following:

1. Adding this configuration means that we run an additional useless query before each useful query that we run, and this means additional load for the engine and for the database. Not a big deal probably. It also adds a network round trip for each useful query that we run, and that can be relevant for remote database installations. I mean, it has a price, is not for free.

2. This will *reduce* the chances that the application gets a broken connection, but won't make sure it doesn't. The connection can still break after running "select 1" but before running the useful query, or while running the useful query. The application still needs to be prepared for this kind of failures.

3. How did you verify this? Doing "service postgresql restart"? If you do such a quick stop/start procedure the engine will not probably even notice the shutdown. Did you test this shutting down the database for a noticeable period of time? In that case, even with the "select 1" the application will fail because it won't be able to get connections. Are we prepared for that?

Don't take me wrong, adding this "select 1" makes the application more tolerant to database connection errors, but we still need to make sure that it reacts correctly when they happen.

Comment 10 Yair Zaslavsky 2012-10-04 13:33:29 UTC

I verified not only using "restart" but also using "stop" and "start" after several seconds (enough time to get enough SQL errors at log) - I did manage for example to browse the webadmin tabs, and to issue new commands after i performed start of the jboss service.
If you have more suggestions for how to verify this - I will be glad to hear.

Indeed , the solution is not perfect.

Comment 11 Juan Hernández 2012-10-04 13:56:33 UTC

When you say "after I performed start of the jboss service" you make me think that you also restarted the ovirt-engine service. Is that correct? In my opinion we should test this without restarting it.

Comment 12 Yair Zaslavsky 2012-10-05 06:17:08 UTC

Hi Juan, a typo/confusion - It should have been "after I performed start of postgresql service".

I had jboss up and running (with postgresql up and running).
I performed service postgresql stop.
I tried to view tabs (VMs, Hosts, etc..) and failed.
After I performed service postgresql start I managed to view main tabs and also run flows - for example adding disk to VM.

Comment 13 Juan Hernández 2012-10-05 08:19:32 UTC

Thanks for the clarification Yair. That is a good verification in my opinion.

Comment 14 Itamar Heim 2012-10-05 14:36:23 UTC

I'm pretty sure we added this in 3.0 to avoid the same bug, so removing is probably a regression.
it's not perfect, it adds a roundtrip, but i don't see a better choice.

Comment 15 Yair Zaslavsky 2012-10-09 06:43:15 UTC

In reply to comment #14 -
Yes we did,
We had this at postgres-ds.xml file.
During our switch to jboss as 7.x with a single standalone.xml configuration file , for some reason this was slipped.

Comment 16 Juan Hernández 2012-10-09 08:32:08 UTC

The change to add the connection checker has been merged upstream:

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=093af236551900dfb56aa021020bf8b30bb7b0eb

Comment 18 Pavel Stehlik 2012-10-19 13:42:02 UTC

ok - si21.1

Note You need to log in before you can comment on or make changes to this bug.