Red Hat Bugzilla – Bug 1286441
If the dwhd lost the original connection to the db it will not update the audit_log and dwh_history_timekeeping when dwh is stopped
Last modified: 2016-05-23 07:03:44 EDT
Description of problem:
If the connection to postgresql is dropped, dwhd does not successfully connect again.
Version-Release number of selected component (if applicable):
Probably old, happened to on rhevm-dwh-3.4.2-1.el6ev.noarch and on rhevm-dwh-3.6.0-0.3.el6ev.noarch .
Always I think
Steps to Reproduce:
1. service postgresql start
2. service ovirt-engine-dwhd start
3. service postgresql restart
4. service ovirt-engine-dwhd stop
dwhd log file shows a clean exit
engine db table 'dwh_history_timekeeping' has '0' for 'DwhCurrentlyRunning'.
dwhd log has:
2015-11-29 16:01:18|ETL Service Stopped
Exception in component tJDBCOutput_2
org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
... 13 more
2015-11-29 16:01:18|REslcY|REslcY|REslcY|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCOutput_2|org.postgresql.util.PSQLException:An I/O error occured while sending to the backend.|1
engine=# select var_value from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
Noticed this because it causes engine-setup to fail with the error:
[ ERROR ] dwhd is currently running. Its hostname is None. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running
service ovirt-engine-dwhd restart
Did you stop the ovirt-engine-dwhd service right after the postgresql restart?
Maybe the connection to the db was unable to reconnect before you stopped the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning value ?
If this is the reason it will be solved only after an hour when the cron job restart the ovirt-engine-dwhd service.
Created attachment 1100373 [details]
(In reply to Shirly Radco from comment #2)
> Did you stop the ovirt-engine-dwhd service right after the postgresql
Yes, after a few seconds. But see below.
> Maybe the connection to the db was unable to reconnect before you stopped
> the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning
> value ?
Indeed, but why? That's the bug here. IMO it should reconnect, update, then stop.
BTW, note that I didn't write in the summary "dwhd fails to cleanly stop if pg is restarted". I expect it to just continue its normal work. Cleanly stopping is just a part of this.
> If this is the reason it will be solved only after an hour when the cron job
> restart the ovirt-engine-dwhd service.
This is not a solution.
I think we must expect everything to work correctly even without this cron job.
Still, I now tried this again and it didn't work:
1. start dwhd
2. restart pg
dwhd log has some errors added once a minute
'service ovirt-engine-dwhd status' returns 0 and says it's running
manually ran the cron job, it didn't restart (obviously, because 'status' returned 0)
3. stopped dwhd
The log shows it stopped, plus an error
DwhCurrentlyRunning is still 1
We need to update the tJDBCConnection for the components that update the engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each time the connection is lost.
We should also consider not stopping the dwhd before updating the engine
or add a message to the ovirt-engine-dwhd.log that the engine could not be updated and this needs to be done manually or it will be fixed once the engine restarts.
(In reply to Shirly Radco from comment #4)
> We need to update the tJDBCConnection for the components that update the
> engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each
> time the connection is lost.
> We should also consider not stopping the dwhd before updating the engine
But then it will continue running against an updated database. Do we check if the database is compatible per sample? Or only at startup?
> or add a message to the ovirt-engine-dwhd.log that the engine could not be
> updated and this needs to be done manually or it will be fixed once the
> engine restarts.
Not sure I follow. Why?
The normal flow when running engine-setup is fine imo. We only need to make it work well also if pg was restarted before it.
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.
*** Bug 1284438 has been marked as a duplicate of this bug. ***
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Not working correctly on 3.6.2-3:
engine=# select * from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
var_name | var_value | var_datetime
DwhCurrentlyRunning | 1 |
This shows up in dwhd log:
Exception in component tJDBCRollback_3
org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
2016-01-04 15:30:46|DpaZCu|DpaZCu|DpaZCu|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCRollback_3|org.postgresql.util.PSQLException:FATAL: terminating connection due to administrator command|1
(In reply to Petr Matyáš from comment #9)
> Not working correctly on 3.6.2-3:
Please provide accurate detailed flow.
See also "Steps to Reproduce:" in the Description.
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Exactly the same as in Description.
Engine is upgraded from previous version of RHEV 3.6. I checked dwhd and postgresql status (both were running). Restarted postgresql and then stopped dwhd.
Please test this after waiting a minute from the postgres restart.
If this works then please verify since this is by design.
Tested again with a few minutes wait before stopping dwhd. Now the DB is set correctly and no exceptions are shown after ETL service stopped.
*** Bug 1295412 has been marked as a duplicate of this bug. ***
*** Bug 1293844 has been marked as a duplicate of this bug. ***
Shirly, please see the discussion in bug 1293844. I agree it's a very reasonable flow, considering the change in bug 1077267. Is it possible e.g. to make dwhd reconnect in the signal handler, right before exiting?
The connection is restored after 4 seconds.
Losing connection to the db and then immediately closing the dwh is not that common in my opinion.
I believe the fix to this bug should solve the issue in bug 1293844.
Please let me know if this is still an issue.
(In reply to Shirly Radco from comment #21)
> The connection is restored after 4 seconds.
In bug 1293844 you wrote "a minute". 4 seconds is much less...
> Losing connection to the db and then immediately closing the dwh is not that
> common in my opinion.
I might agree in general, but suppose the following flow:
1. Run engine-setup
get an error saying max_connections is too low (enforced by bug 1077267 change)
2. Edit postgresql.conf and restart it
3. Run engine-setup again
4. engine-setup stops dwhd
> I believe the fix to this bug should solve the issue in bug 1293844.
> Please let me know if this is still an issue.
It is, if between (2.) and (4.) there are less than 4 seconds.
OK. Please open a bug and add as depends on bug 1077267.
opened bug 1297682. I do not think it depends on bug 1077267, that one just adds more flows in which it might affect. But let's discuss this there.
*** Bug 1316973 has been marked as a duplicate of this bug. ***