Bug 857684 - [engine] Engine fails to reconnect to postgres
[engine] Engine fails to reconnect to postgres
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.1.0
Unspecified Linux
unspecified Severity high
: ---
: ---
Assigned To: Yair Zaslavsky
Pavel Stehlik
infra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-09-16 02:14 EDT by Gadi Ickowicz
Modified: 2016-02-10 14:44 EST (History)
11 users (show)

See Also:
Fixed In Version: si21
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-12-04 15:04:58 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
engine logs (418.83 KB, application/x-gzip)
2012-09-16 02:14 EDT, Gadi Ickowicz
no flags Details

  None (edit)
Description Gadi Ickowicz 2012-09-16 02:14:23 EDT
Created attachment 613377 [details]
engine logs

Description of problem:
After updating postgres using yum update, engine could no longer access database until the engine was restarted.

Version-Release number of selected component (if applicable):
rhevm-3.1.0-15.el6ev.noarch

How reproducible:
?

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Engine fails to reconnect to database after postgres process is restarted

Expected results:
engine should handle pid change of postgres and/or restart the postgres process on its own if no connection to database can be established

Additional info:
Comment 5 Yair Zaslavsky 2012-10-04 03:12:09 EDT
Juan, any reason why NOT to add the following to the data source definition?



<validation>
  <check-valid-connection-sql>select 1</check-valid-connection-sql>
</validation>


I just tried that on my jboss-eap-6.0 setup , looks like it's working.
Comment 6 Yair Zaslavsky 2012-10-04 04:06:11 EDT
After consulting with mkublin -

http://gerrit.ovirt.org/#/c/8346/
Comment 7 Itamar Heim 2012-10-04 04:12:31 EDT
when did we lose it?
its in:
./backend/manager/modules/bll/target/test-classes/deploy/postgres-ds.xml
Comment 8 Yair Zaslavsky 2012-10-04 04:21:28 EDT
I suspect it got lost in the transition to JBoss-AS-7 (moving from postgres-ds.xml to standalone.xml)
I just checked my z-stream env - it's there.
Comment 9 Juan Hernández 2012-10-04 04:33:14 EDT
Please note the following:

1. Adding this configuration means that we run an additional useless query before each useful query that we run, and this means additional load for the engine and for the database. Not a big deal probably. It also adds a network round trip for each useful query that we run, and that can be relevant for remote database installations. I mean, it has a price, is not for free.

2. This will *reduce* the chances that the application gets a broken connection, but won't make sure it doesn't. The connection can still break after running "select 1" but before running the useful query, or while running the useful query. The application still needs to be prepared for this kind of failures.

3. How did you verify this? Doing "service postgresql restart"? If you do such a quick stop/start procedure the engine will not probably even notice the shutdown. Did you test this shutting down the database for a noticeable period of time? In that case, even with the "select 1" the application will fail because it won't be able to get connections. Are we prepared for that?

Don't take me wrong, adding this "select 1" makes the application more tolerant to database connection errors, but we still need to make sure that it reacts correctly when they happen.
Comment 10 Yair Zaslavsky 2012-10-04 09:33:29 EDT
I verified not only using "restart" but also using "stop" and "start" after several seconds (enough time to get enough SQL errors at log) - I did manage for example to browse the webadmin tabs, and to issue new commands after i performed start of the jboss service.
If you have more suggestions for how to verify this - I will be glad to hear.

Indeed , the solution is not perfect.
Comment 11 Juan Hernández 2012-10-04 09:56:33 EDT
When you say "after I performed start of the jboss service" you make me think that you also restarted the ovirt-engine service. Is that correct? In my opinion we should test this without restarting it.
Comment 12 Yair Zaslavsky 2012-10-05 02:17:08 EDT
Hi Juan, a typo/confusion - It should have been "after I performed start of postgresql service".

I had jboss up and running (with postgresql up and running).
I performed service postgresql stop.
I tried to view tabs (VMs, Hosts, etc..) and failed.
After I performed service postgresql start I managed to view main tabs and also run flows - for example adding disk to VM.
Comment 13 Juan Hernández 2012-10-05 04:19:32 EDT
Thanks for the clarification Yair. That is a good verification in my opinion.
Comment 14 Itamar Heim 2012-10-05 10:36:23 EDT
I'm pretty sure we added this in 3.0 to avoid the same bug, so removing is probably a regression.
it's not perfect, it adds a roundtrip, but i don't see a better choice.
Comment 15 Yair Zaslavsky 2012-10-09 02:43:15 EDT
In reply to comment #14 -
Yes we did,
We had this at postgres-ds.xml file.
During our switch to jboss as 7.x with a single standalone.xml configuration file , for some reason this was slipped.
Comment 16 Juan Hernández 2012-10-09 04:32:08 EDT
The change to add the connection checker has been merged upstream:

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=093af236551900dfb56aa021020bf8b30bb7b0eb
Comment 18 Pavel Stehlik 2012-10-19 09:42:02 EDT
ok - si21.1

Note You need to log in before you can comment on or make changes to this bug.