Bug 1286441
Summary: | If the dwhd lost the original connection to the db it will not update the audit_log and dwh_history_timekeeping when dwh is stopped | ||||||
---|---|---|---|---|---|---|---|
Product: | [oVirt] ovirt-engine-dwh | Reporter: | Yedidyah Bar David <didi> | ||||
Component: | ETL | Assignee: | Shirly Radco <sradco> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Matyáš <pmatyas> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.6.0 | CC: | bugs, didi, dyasny, gklein, jbelka, jentrena, pmatyas, sradco, ylavi | ||||
Target Milestone: | ovirt-3.6.2 | Flags: | rule-engine:
ovirt-3.6.z+
ylavi: planning_ack+ rule-engine: devel_ack+ pstehlik: testing_ack+ |
||||
Target Release: | 3.6.2 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | rhevm-dwh-3.6.2-1.el6ev | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-02-18 11:17:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Metrics | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1283458, 1283944, 1285700 | ||||||
Attachments: |
|
Description
Yedidyah Bar David
2015-11-29 14:23:07 UTC
A workaround: service ovirt-engine-dwhd restart Did you stop the ovirt-engine-dwhd service right after the postgresql restart? Maybe the connection to the db was unable to reconnect before you stopped the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning value ? If this is the reason it will be solved only after an hour when the cron job restart the ovirt-engine-dwhd service. Created attachment 1100373 [details] ovirt-engine-dwhd.log (In reply to Shirly Radco from comment #2) > Did you stop the ovirt-engine-dwhd service right after the postgresql > restart? Yes, after a few seconds. But see below. > Maybe the connection to the db was unable to reconnect before you stopped > the ovirt-engine-dwhd service so it could not update the DwhCurrentlyRunning > value ? Indeed, but why? That's the bug here. IMO it should reconnect, update, then stop. BTW, note that I didn't write in the summary "dwhd fails to cleanly stop if pg is restarted". I expect it to just continue its normal work. Cleanly stopping is just a part of this. > > If this is the reason it will be solved only after an hour when the cron job > restart the ovirt-engine-dwhd service. This is not a solution. I think we must expect everything to work correctly even without this cron job. Still, I now tried this again and it didn't work: 1. start dwhd 2. restart pg waited dwhd log has some errors added once a minute 'service ovirt-engine-dwhd status' returns 0 and says it's running manually ran the cron job, it didn't restart (obviously, because 'status' returned 0) 3. stopped dwhd The log shows it stopped, plus an error DwhCurrentlyRunning is still 1 Attaching log. We need to update the tJDBCConnection for the components that update the engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each time the connection is lost. We should also consider not stopping the dwhd before updating the engine or add a message to the ovirt-engine-dwhd.log that the engine could not be updated and this needs to be done manually or it will be fixed once the engine restarts. (In reply to Shirly Radco from comment #4) > We need to update the tJDBCConnection for the components that update the > engine db when dwhd is stopped to tJDBCConnection_7 that is refreshed each > time the connection is lost. > > We should also consider not stopping the dwhd before updating the engine But then it will continue running against an updated database. Do we check if the database is compatible per sample? Or only at startup? > or add a message to the ovirt-engine-dwhd.log that the engine could not be > updated and this needs to be done manually or it will be fixed once the > engine restarts. Not sure I follow. Why? The normal flow when running engine-setup is fine imo. We only need to make it work well also if pg was restarted before it. This bug is not marked for z-stream, yet the milestone is for a z-stream version, therefore the milestone has been reset. Please set the correct milestone or add the z-stream flag. *** Bug 1284438 has been marked as a duplicate of this bug. *** Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. Not working correctly on 3.6.2-3: engine=# select * from dwh_history_timekeeping where var_name='DwhCurrentlyRunning'; var_name | var_value | var_datetime ---------------------+-----------+-------------- DwhCurrentlyRunning | 1 | This shows up in dwhd log: Exception in component tJDBCRollback_3 org.postgresql.util.PSQLException: FATAL: terminating connection due to administrator command at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2094) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1827) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:255) at org.postgresql.jdbc2.AbstractJdbc2Connection.executeTransactionCommand(AbstractJdbc2Connection.java:669) at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:706) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_3Process(HistoryETL.java:2088) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_2Process(HistoryETL.java:2003) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tJDBCRollback_1Process(HistoryETL.java:1880) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.tPostjob_1Process(HistoryETL.java:1752) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.runJobInTOS(HistoryETL.java:11210) at ovirt_engine_dwh.historyetl_3_6.HistoryETL.main(HistoryETL.java:10631) 2016-01-04 15:30:46|DpaZCu|DpaZCu|DpaZCu|OVIRT_ENGINE_DWH|HistoryETL|Default|6|Java Exception|tJDBCRollback_3|org.postgresql.util.PSQLException:FATAL: terminating connection due to administrator command|1 (In reply to Petr Matyáš from comment #9) > Not working correctly on 3.6.2-3: Please provide accurate detailed flow. See also "Steps to Reproduce:" in the Description. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. Exactly the same as in Description. Engine is upgraded from previous version of RHEV 3.6. I checked dwhd and postgresql status (both were running). Restarted postgresql and then stopped dwhd. Please test this after waiting a minute from the postgres restart. If this works then please verify since this is by design. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. Please test this after waiting a minute from the postgres restart. If this works then please verify since this is by design. Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA. Tested again with a few minutes wait before stopping dwhd. Now the DB is set correctly and no exceptions are shown after ETL service stopped. *** Bug 1295412 has been marked as a duplicate of this bug. *** *** Bug 1293844 has been marked as a duplicate of this bug. *** Shirly, please see the discussion in bug 1293844. I agree it's a very reasonable flow, considering the change in bug 1077267. Is it possible e.g. to make dwhd reconnect in the signal handler, right before exiting? The connection is restored after 4 seconds. Losing connection to the db and then immediately closing the dwh is not that common in my opinion. I believe the fix to this bug should solve the issue in bug 1293844. Please let me know if this is still an issue. (In reply to Shirly Radco from comment #21) > The connection is restored after 4 seconds. In bug 1293844 you wrote "a minute". 4 seconds is much less... > Losing connection to the db and then immediately closing the dwh is not that > common in my opinion. I might agree in general, but suppose the following flow: 1. Run engine-setup get an error saying max_connections is too low (enforced by bug 1077267 change) 2. Edit postgresql.conf and restart it 3. Run engine-setup again 4. engine-setup stops dwhd > > I believe the fix to this bug should solve the issue in bug 1293844. > > Please let me know if this is still an issue. It is, if between (2.) and (4.) there are less than 4 seconds. OK. Please open a bug and add as depends on bug 1077267. opened bug 1297682. I do not think it depends on bug 1077267, that one just adds more flows in which it might affect. But let's discuss this there. *** Bug 1316973 has been marked as a duplicate of this bug. *** |