Created attachment 613377 [details] engine logs Description of problem: After updating postgres using yum update, engine could no longer access database until the engine was restarted. Version-Release number of selected component (if applicable): rhevm-3.1.0-15.el6ev.noarch How reproducible: ? Steps to Reproduce: 1. 2. 3. Actual results: Engine fails to reconnect to database after postgres process is restarted Expected results: engine should handle pid change of postgres and/or restart the postgres process on its own if no connection to database can be established Additional info:
Juan, any reason why NOT to add the following to the data source definition? <validation> <check-valid-connection-sql>select 1</check-valid-connection-sql> </validation> I just tried that on my jboss-eap-6.0 setup , looks like it's working.
After consulting with mkublin - http://gerrit.ovirt.org/#/c/8346/
when did we lose it? its in: ./backend/manager/modules/bll/target/test-classes/deploy/postgres-ds.xml
I suspect it got lost in the transition to JBoss-AS-7 (moving from postgres-ds.xml to standalone.xml) I just checked my z-stream env - it's there.
Please note the following: 1. Adding this configuration means that we run an additional useless query before each useful query that we run, and this means additional load for the engine and for the database. Not a big deal probably. It also adds a network round trip for each useful query that we run, and that can be relevant for remote database installations. I mean, it has a price, is not for free. 2. This will *reduce* the chances that the application gets a broken connection, but won't make sure it doesn't. The connection can still break after running "select 1" but before running the useful query, or while running the useful query. The application still needs to be prepared for this kind of failures. 3. How did you verify this? Doing "service postgresql restart"? If you do such a quick stop/start procedure the engine will not probably even notice the shutdown. Did you test this shutting down the database for a noticeable period of time? In that case, even with the "select 1" the application will fail because it won't be able to get connections. Are we prepared for that? Don't take me wrong, adding this "select 1" makes the application more tolerant to database connection errors, but we still need to make sure that it reacts correctly when they happen.
I verified not only using "restart" but also using "stop" and "start" after several seconds (enough time to get enough SQL errors at log) - I did manage for example to browse the webadmin tabs, and to issue new commands after i performed start of the jboss service. If you have more suggestions for how to verify this - I will be glad to hear. Indeed , the solution is not perfect.
When you say "after I performed start of the jboss service" you make me think that you also restarted the ovirt-engine service. Is that correct? In my opinion we should test this without restarting it.
Hi Juan, a typo/confusion - It should have been "after I performed start of postgresql service". I had jboss up and running (with postgresql up and running). I performed service postgresql stop. I tried to view tabs (VMs, Hosts, etc..) and failed. After I performed service postgresql start I managed to view main tabs and also run flows - for example adding disk to VM.
Thanks for the clarification Yair. That is a good verification in my opinion.
I'm pretty sure we added this in 3.0 to avoid the same bug, so removing is probably a regression. it's not perfect, it adds a roundtrip, but i don't see a better choice.
In reply to comment #14 - Yes we did, We had this at postgres-ds.xml file. During our switch to jboss as 7.x with a single standalone.xml configuration file , for some reason this was slipped.
The change to add the connection checker has been merged upstream: http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=093af236551900dfb56aa021020bf8b30bb7b0eb
ok - si21.1