Bug 1286441 - If the dwhd lost the original connection to the db it will not update the audit_log and dwh_history_timekeeping when dwh is stopped
If the dwhd lost the original connection to the db it will not update the aud...
Status: CLOSED CURRENTRELEASE
Product: ovirt-engine-dwh
Classification: oVirt
Component: ETL (Show other bugs)
3.6.0
Unspecified Unspecified
medium Severity medium (vote)
: ovirt-3.6.2
: 3.6.2
Assigned To: Shirly Radco
Petr Matyáš
:
: 1284438 1293844 1295412 1316973 (view as bug list)
Depends On:
Blocks: RHEV_36_HTB 1283944 RHEV3.6Upgrade
  Show dependency treegraph
 
Reported: 2015-11-29 09:23 EST by Yedidyah Bar David
Modified: 2016-05-23 07:03 EDT (History)
9 users (show)

See Also:
Fixed In Version: rhevm-dwh-3.6.2-1.el6ev
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-18 06:17:34 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Metrics
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑3.6.z+
ylavi: planning_ack+
rule-engine: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)
ovirt-engine-dwhd.log (8.47 KB, text/plain)
2015-11-30 02:49 EST, Yedidyah Bar David
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 49423 master MERGED history: use engine-events connection that is refreshed Never

  None (edit)
Description Yedidyah Bar David 2015-11-29 09:23:07 EST
Description of problem:

If the connection to postgresql is dropped, dwhd does not successfully connect again.

Version-Release number of selected component (if applicable):

Probably old, happened to on rhevm-dwh-3.4.2-1.el6ev.noarch and on rhevm-dwh-3.6.0-0.3.el6ev.noarch .

How reproducible:

Always I think

Steps to Reproduce:
1. service postgresql start
2. service ovirt-engine-dwhd start
3. service postgresql restart
4. service ovirt-engine-dwhd stop

Actual results:

dwhd log file shows a clean exit
engine db table 'dwh_history_timekeeping' has '0' for 'DwhCurrentlyRunning'.

Expected results:

dwhd log has:

2015-11-29 16:01:18|ETL Service Stopped
Exception in component tJDBCOutput_2
org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:281)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:330)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tRowGenerator_5Process(HistoryETL.java:3351)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tRowGenerator_1Process(HistoryETL.java:2715)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_5Process(HistoryETL.java:2350)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_4Process(HistoryETL.java:2227)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_3Process(HistoryETL.java:2104)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_2Process(HistoryETL.java:1981)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_1Process(HistoryETL.java:1858)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tPostjob_1Process(HistoryETL.java:1730)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.runJobInTOS(HistoryETL.java:11073)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.main(HistoryETL.java:10494)
Caused by: java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at org.postgresql.core.PGStream.flush(PGStream.java:521)
        at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:1134)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:254)
        ... 13 more
2015-11-29 16:01:18|REslcY|REslcY|REslcY|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCOutput_2|org.postgresql.util.PSQLException:An I/O error occured while sending to the backend.|1



engine=# select var_value from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
 var_value 
-----------
 1
(1 row)

Additional info:

Noticed this because it causes engine-setup to fail with the error:

[ ERROR ] dwhd is currently running. Its hostname is None. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running
Comment 1 Yedidyah Bar David 2015-11-29 09:24:53 EST
A workaround:

service ovirt-engine-dwhd restart
Comment 2 Shirly Radco 2015-11-30 02:05:21 EST
Did you stop the ovirt-engine-dwhd service right after the postgresql restart?
Maybe the connection to the db was unable to reconnect before you stopped the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning value ?

If this is the reason it will be solved only after an hour when the cron job restart the ovirt-engine-dwhd service.
Comment 3 Yedidyah Bar David 2015-11-30 02:49 EST
Created attachment 1100373 [details]
ovirt-engine-dwhd.log

(In reply to Shirly Radco from comment #2)
> Did you stop the ovirt-engine-dwhd service right after the postgresql
> restart?

Yes, after a few seconds. But see below.

> Maybe the connection to the db was unable to reconnect before you stopped
> the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning
> value ?

Indeed, but why? That's the bug here. IMO it should reconnect, update, then stop.

BTW, note that I didn't write in the summary "dwhd fails to cleanly stop if pg is restarted". I expect it to just continue its normal work. Cleanly stopping is just a part of this.

> 
> If this is the reason it will be solved only after an hour when the cron job
> restart the ovirt-engine-dwhd service.

This is not a solution.

I think we must expect everything to work correctly even without this cron job.

Still, I now tried this again and it didn't work:

1. start dwhd
2. restart pg

waited

dwhd log has some errors added once a minute

'service ovirt-engine-dwhd status' returns 0 and says it's running

manually ran the cron job, it didn't restart (obviously, because 'status' returned 0)

3. stopped dwhd

The log shows it stopped, plus an error

DwhCurrentlyRunning is still 1

Attaching log.
Comment 4 Shirly Radco 2015-11-30 04:28:21 EST
We need to update the tJDBCConnection for the components that update the engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each time the connection is lost. 

We should also consider not stopping the dwhd before updating the engine 
or add a message to the ovirt-engine-dwhd.log that the engine could not be updated and this needs to be done manually or it will be fixed once the engine restarts.
Comment 5 Yedidyah Bar David 2015-11-30 04:38:35 EST
(In reply to Shirly Radco from comment #4)
> We need to update the tJDBCConnection for the components that update the
> engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each
> time the connection is lost. 
> 
> We should also consider not stopping the dwhd before updating the engine 

But then it will continue running against an updated database. Do we check if the database is compatible per sample? Or only at startup?

> or add a message to the ovirt-engine-dwhd.log that the engine could not be
> updated and this needs to be done manually or it will be fixed once the
> engine restarts.

Not sure I follow. Why?

The normal flow when running engine-setup is fine imo. We only need to make it work well also if pg was restarted before it.
Comment 6 Red Hat Bugzilla Rules Engine 2015-11-30 06:22:24 EST
This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.
Comment 7 Yedidyah Bar David 2015-12-01 01:33:48 EST
*** Bug 1284438 has been marked as a duplicate of this bug. ***
Comment 8 Red Hat Bugzilla Rules Engine 2015-12-22 03:25:56 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 9 Petr Matyáš 2016-01-04 09:34:16 EST
Not working correctly on 3.6.2-3:

engine=# select * from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
      var_name       | var_value | var_datetime 
---------------------+-----------+--------------
 DwhCurrentlyRunning | 1         | 

This shows up in dwhd log:
Exception in component tJDBCRollback_3
org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:669)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:706)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_3Process(HistoryETL.java:2088)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_2Process(HistoryETL.java:2003)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_1Process(HistoryETL.java:1880)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tPostjob_1Process(HistoryETL.java:1752)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.runJobInTOS(HistoryETL.java:11210)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.main(HistoryETL.java:10631)
2016-01-04 15:30:46|DpaZCu|DpaZCu|DpaZCu|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCRollback_3|org.postgresql.util.PSQLException:FATAL: terminating connection due to administrator command|1
Comment 10 Yedidyah Bar David 2016-01-04 09:43:58 EST
(In reply to Petr Matyáš from comment #9)
> Not working correctly on 3.6.2-3:

Please provide accurate detailed flow.

See also "Steps to Reproduce:" in the Description.
Comment 11 Red Hat Bugzilla Rules Engine 2016-01-04 09:44:21 EST
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
Comment 12 Petr Matyáš 2016-01-04 09:50:45 EST
Exactly the same as in Description.
Engine is upgraded from previous version of RHEV 3.6. I checked dwhd and postgresql status (both were running). Restarted postgresql and then stopped dwhd.
Comment 13 Shirly Radco 2016-01-05 01:58:30 EST
Please test this after waiting a minute from the postgres restart.
If this works then please verify since this is by design.
Comment 14 Red Hat Bugzilla Rules Engine 2016-01-05 01:58:33 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 15 Shirly Radco 2016-01-05 01:59:26 EST
Please test this after waiting a minute from the postgres restart.
If this works then please verify since this is by design.
Comment 16 Red Hat Bugzilla Rules Engine 2016-01-05 02:09:35 EST
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
Comment 17 Petr Matyáš 2016-01-05 04:10:23 EST
Tested again with a few minutes wait before stopping dwhd. Now the DB is set correctly and no exceptions are shown after ETL service stopped.
Comment 18 Yedidyah Bar David 2016-01-07 03:56:44 EST
*** Bug 1295412 has been marked as a duplicate of this bug. ***
Comment 19 Yedidyah Bar David 2016-01-12 01:24:26 EST
*** Bug 1293844 has been marked as a duplicate of this bug. ***
Comment 20 Yedidyah Bar David 2016-01-12 01:29:59 EST
Shirly, please see the discussion in bug 1293844. I agree it's a very reasonable flow, considering the change in bug 1077267. Is it possible e.g. to make dwhd reconnect in the signal handler, right before exiting?
Comment 21 Shirly Radco 2016-01-12 02:36:06 EST
The connection is restored after 4 seconds.
Losing connection to the db and then immediately closing the dwh is not that common in my opinion.

I believe the fix to this bug should solve the issue in bug 1293844.

Please let me know if this is still an issue.
Comment 22 Yedidyah Bar David 2016-01-12 02:56:11 EST
(In reply to Shirly Radco from comment #21)
> The connection is restored after 4 seconds.

In bug 1293844 you wrote "a minute". 4 seconds is much less...

> Losing connection to the db and then immediately closing the dwh is not that
> common in my opinion.

I might agree in general, but suppose the following flow:

1. Run engine-setup
get an error saying max_connections is too low (enforced by bug 1077267 change)
2. Edit postgresql.conf and restart it
3. Run engine-setup again
4. engine-setup stops dwhd

> 
> I believe the fix to this bug should solve the issue in bug 1293844.
> 
> Please let me know if this is still an issue.

It is, if between (2.) and (4.) there are less than 4 seconds.
Comment 23 Shirly Radco 2016-01-12 03:16:43 EST
OK. Please open a bug and add as depends on bug 1077267.
Comment 24 Yedidyah Bar David 2016-01-12 03:52:04 EST
opened bug 1297682. I do not think it depends on bug 1077267, that one just adds more flows in which it might affect. But let's discuss this there.
Comment 25 Yedidyah Bar David 2016-04-04 07:54:14 EDT
*** Bug 1316973 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.