1286441 – If the dwhd lost the original connection to the db it will not update the audit_log and dwh_history_timekeeping when dwh is stopped

Bug 1286441 - If the dwhd lost the original connection to the db it will not update the audit_log and dwh_history_timekeeping when dwh is stopped

Summary: If the dwhd lost the original connection to the db it will not update the aud...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine-dwh
Classification:	oVirt
Component:	ETL
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.2
Target Release:	3.6.2
Assignee:	Shirly Radco
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1284438 1293844 1295412 1316973 (view as bug list)
Depends On:
Blocks:	RHEV_36_HTB 1283944 RHEV3.6Upgrade
TreeView+	depends on / blocked

Reported:	2015-11-29 14:23 UTC by Yedidyah Bar David
Modified:	2019-10-10 10:38 UTC (History)
CC List:	9 users (show)
Fixed In Version:	rhevm-dwh-3.6.2-1.el6ev
Clone Of:
Environment:
Last Closed:	2016-02-18 11:17:34 UTC
oVirt Team:	Metrics
Embargoed:
Flags:	rule-engine: ovirt-3.6.z+ ylavi: planning_ack+ rule-engine: devel_ack+ pstehlik: testing_ack+

Attachments	(Terms of Use)
ovirt-engine-dwhd.log (8.47 KB, text/plain) 2015-11-30 07:49 UTC, Yedidyah Bar David	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1284438	0	unspecified	CLOSED	engine-setup fails complaining that DWH is running but it's not running	2021-02-22 00:41:40 UTC
oVirt gerrit	49423	0	master	MERGED	history: use engine-events connection that is refreshed	2020-04-13 05:35:27 UTC

Internal Links: 1284438

Description Yedidyah Bar David 2015-11-29 14:23:07 UTC

Description of problem:

If the connection to postgresql is dropped, dwhd does not successfully connect again.

Version-Release number of selected component (if applicable):

Probably old, happened to on rhevm-dwh-3.4.2-1.el6ev.noarch and on rhevm-dwh-3.6.0-0.3.el6ev.noarch .

How reproducible:

Always I think

Steps to Reproduce:
1. service postgresql start
2. service ovirt-engine-dwhd start
3. service postgresql restart
4. service ovirt-engine-dwhd stop

Actual results:

dwhd log file shows a clean exit
engine db table 'dwh_history_timekeeping' has '0' for 'DwhCurrentlyRunning'.

Expected results:

dwhd log has:

2015-11-29 16:01:18|ETL Service Stopped
Exception in component tJDBCOutput_2
org.postgresql.util.PSQLException: An I/O error occured while sending to the backend.
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:281)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:508)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:384)
        at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:330)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tRowGenerator_5Process(HistoryETL.java:3351)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tRowGenerator_1Process(HistoryETL.java:2715)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_5Process(HistoryETL.java:2350)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_4Process(HistoryETL.java:2227)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_3Process(HistoryETL.java:2104)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_2Process(HistoryETL.java:1981)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_1Process(HistoryETL.java:1858)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tPostjob_1Process(HistoryETL.java:1730)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.runJobInTOS(HistoryETL.java:11073)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.main(HistoryETL.java:10494)
Caused by: java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
        at org.postgresql.core.PGStream.flush(PGStream.java:521)
        at org.postgresql.core.v3.QueryExecutorImpl.sendSync(QueryExecutorImpl.java:1134)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:254)
        ... 13 more
2015-11-29 16:01:18|REslcY|REslcY|REslcY|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCOutput_2|org.postgresql.util.PSQLException:An I/O error occured while sending to the backend.|1



engine=# select var_value from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
 var_value 
-----------
 1
(1 row)

Additional info:

Noticed this because it causes engine-setup to fail with the error:

[ ERROR ] dwhd is currently running. Its hostname is None. Please stop it before running Setup.
[ ERROR ] Failed to execute stage 'Transaction setup': dwhd is currently running

Comment 1 Yedidyah Bar David 2015-11-29 14:24:53 UTC

A workaround:

service ovirt-engine-dwhd restart

Comment 2 Shirly Radco 2015-11-30 07:05:21 UTC

Did you stop the ovirt-engine-dwhd service right after the postgresql restart?
Maybe the connection to the db was unable to reconnect before you stopped the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning value ?

If this is the reason it will be solved only after an hour when the cron job restart the ovirt-engine-dwhd service.

Comment 3 Yedidyah Bar David 2015-11-30 07:49:54 UTC

Created attachment 1100373 [details]
ovirt-engine-dwhd.log

(In reply to Shirly Radco from comment #2)
> Did you stop the ovirt-engine-dwhd service right after the postgresql
> restart?

Yes, after a few seconds. But see below.

> Maybe the connection to the db was unable to reconnect before you stopped
> the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning
> value ?

Indeed, but why? That's the bug here. IMO it should reconnect, update, then stop.

BTW, note that I didn't write in the summary "dwhd fails to cleanly stop if pg is restarted". I expect it to just continue its normal work. Cleanly stopping is just a part of this.

> 
> If this is the reason it will be solved only after an hour when the cron job
> restart the ovirt-engine-dwhd service.

This is not a solution.

I think we must expect everything to work correctly even without this cron job.

Still, I now tried this again and it didn't work:

1. start dwhd
2. restart pg

waited

dwhd log has some errors added once a minute

'service ovirt-engine-dwhd status' returns 0 and says it's running

manually ran the cron job, it didn't restart (obviously, because 'status' returned 0)

3. stopped dwhd

The log shows it stopped, plus an error

DwhCurrentlyRunning is still 1

Attaching log.

Comment 4 Shirly Radco 2015-11-30 09:28:21 UTC

We need to update the tJDBCConnection for the components that update the engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each time the connection is lost. 

We should also consider not stopping the dwhd before updating the engine 
or add a message to the ovirt-engine-dwhd.log that the engine could not be updated and this needs to be done manually or it will be fixed once the engine restarts.

Comment 5 Yedidyah Bar David 2015-11-30 09:38:35 UTC

(In reply to Shirly Radco from comment #4)
> We need to update the tJDBCConnection for the components that update the
> engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each
> time the connection is lost. 
> 
> We should also consider not stopping the dwhd before updating the engine 

But then it will continue running against an updated database. Do we check if the database is compatible per sample? Or only at startup?

> or add a message to the ovirt-engine-dwhd.log that the engine could not be
> updated and this needs to be done manually or it will be fixed once the
> engine restarts.

Not sure I follow. Why?

The normal flow when running engine-setup is fine imo. We only need to make it work well also if pg was restarted before it.

Comment 6 Red Hat Bugzilla Rules Engine 2015-11-30 11:22:24 UTC

This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset.
Please set the correct milestone or add the z-stream flag.

Comment 7 Yedidyah Bar David 2015-12-01 06:33:48 UTC

*** Bug 1284438 has been marked as a duplicate of this bug. ***

Comment 8 Red Hat Bugzilla Rules Engine 2015-12-22 08:25:56 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 9 Petr Matyáš 2016-01-04 14:34:16 UTC

Not working correctly on 3.6.2-3:

engine=# select * from dwh_history_timekeeping where var_name='DwhCurrentlyRunning';
      var_name       | var_value | var_datetime 
---------------------+-----------+--------------
 DwhCurrentlyRunning | 1         | 

This shows up in dwhd log:
Exception in component tJDBCRollback_3
org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command
        at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094)
        at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827)
        at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:669)
        at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:706)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_3Process(HistoryETL.java:2088)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_2Process(HistoryETL.java:2003)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_1Process(HistoryETL.java:1880)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tPostjob_1Process(HistoryETL.java:1752)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.runJobInTOS(HistoryETL.java:11210)
        at ovirt_engine_dwh.historyetl_3_6.HistoryETL.main(HistoryETL.java:10631)
2016-01-04 15:30:46|DpaZCu|DpaZCu|DpaZCu|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCRollback_3|org.postgresql.util.PSQLException:FATAL: terminating connection due to administrator command|1

Comment 10 Yedidyah Bar David 2016-01-04 14:43:58 UTC

(In reply to Petr Matyáš from comment #9)
> Not working correctly on 3.6.2-3:

Please provide accurate detailed flow.

See also "Steps to Reproduce:" in the Description.

Comment 11 Red Hat Bugzilla Rules Engine 2016-01-04 14:44:21 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 12 Petr Matyáš 2016-01-04 14:50:45 UTC

Exactly the same as in Description.
Engine is upgraded from previous version of RHEV 3.6. I checked dwhd and postgresql status (both were running). Restarted postgresql and then stopped dwhd.

Comment 13 Shirly Radco 2016-01-05 06:58:30 UTC

Please test this after waiting a minute from the postgres restart.
If this works then please verify since this is by design.

Comment 14 Red Hat Bugzilla Rules Engine 2016-01-05 06:58:33 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 15 Shirly Radco 2016-01-05 06:59:26 UTC

Please test this after waiting a minute from the postgres restart.
If this works then please verify since this is by design.

Comment 16 Red Hat Bugzilla Rules Engine 2016-01-05 07:09:35 UTC

Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 17 Petr Matyáš 2016-01-05 09:10:23 UTC

Tested again with a few minutes wait before stopping dwhd. Now the DB is set correctly and no exceptions are shown after ETL service stopped.

Comment 18 Yedidyah Bar David 2016-01-07 08:56:44 UTC

*** Bug 1295412 has been marked as a duplicate of this bug. ***

Comment 19 Yedidyah Bar David 2016-01-12 06:24:26 UTC

*** Bug 1293844 has been marked as a duplicate of this bug. ***

Comment 20 Yedidyah Bar David 2016-01-12 06:29:59 UTC

Shirly, please see the discussion in bug 1293844. I agree it's a very reasonable flow, considering the change in bug 1077267. Is it possible e.g. to make dwhd reconnect in the signal handler, right before exiting?

Comment 21 Shirly Radco 2016-01-12 07:36:06 UTC

The connection is restored after 4 seconds.
Losing connection to the db and then immediately closing the dwh is not that common in my opinion.

I believe the fix to this bug should solve the issue in bug 1293844.

Please let me know if this is still an issue.

Comment 22 Yedidyah Bar David 2016-01-12 07:56:11 UTC

(In reply to Shirly Radco from comment #21)
> The connection is restored after 4 seconds.

In bug 1293844 you wrote "a minute". 4 seconds is much less...

> Losing connection to the db and then immediately closing the dwh is not that
> common in my opinion.

I might agree in general, but suppose the following flow:

1. Run engine-setup
get an error saying max_connections is too low (enforced by bug 1077267 change)
2. Edit postgresql.conf and restart it
3. Run engine-setup again
4. engine-setup stops dwhd

> 
> I believe the fix to this bug should solve the issue in bug 1293844.
> 
> Please let me know if this is still an issue.

It is, if between (2.) and (4.) there are less than 4 seconds.

Comment 23 Shirly Radco 2016-01-12 08:16:43 UTC

OK. Please open a bug and add as depends on bug 1077267.

Comment 24 Yedidyah Bar David 2016-01-12 08:52:04 UTC

opened bug 1297682. I do not think it depends on bug 1077267, that one just adds more flows in which it might affect. But let's discuss this there.

Comment 25 Yedidyah Bar David 2016-04-04 11:54:14 UTC

*** Bug 1316973 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.