Description of problem: A Hypervisor time has drifted away and once the time issue was fixed , the engine never realized the update (time was cached). Version-Release number of selected component (if applicable): ovirt-engine-4.3.10.4-1.el7.noarch How reproducible: Happened once, but most probably will reoccur if the Host drifts back/ahead Steps to Reproduce: 1.Host drifts in time 2.HE was migrated away and the host was rebooted 3.HE was returned back to the host , but the engine's internal timekeeping remained ahead in time Actual results: When the HE is migrated , the time of the OS get's in sync - but the engine itself never realizes that (cause it's cached) Expected results: Ovirt Engine should get it's time from the HE OS and not cache it. Additional info: For details check https://www.mail-archive.com/users@ovirt.org/msg61439.html
can you add more thoughts? the meail thread gives some hints but it looks like you do know a bit more about the problem
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.
(In reply to Michal Skrivanek from comment #1) > can you add more thoughts? the meail thread gives some hints but it looks > like you do know a bit more about the problem Source of problem is VdsBrokerObjectsBuilder::checkTimeDrift , it calls assignDatetimeValue to store the host date value, but when you look at assignDatetimeValue you see that it first tries to get the date from the cache and not using the fresh date passed for it in the "input" strcut, this was used probably for performance reasons but IMO we can always override this value in the cache in order to follow changes made to the host time
After trying to reproduce on 4.4 I saw that setting a date with time-drift > 300 seconds on the host using timedatectl to bring ntp down and set the date-time was recognized by the engine and it starts to log those messages while bringing the ntp up again is recognized by the engine and stops the messages for that host without engine restart. So, after a closer look at the code and testing the scenario the current host date-time value is used. However it took some time ( < 1 min) until the engine got the correct date Do you still have the engine log and vdsm log for the relevant host in order to explore the scenario deeper in your environment ? If not , can you please retest the following : 1) on the HE timedatectl set-ntp no 2) timedatectl set-time <any time with time-drift > 300 from current time> 3) Follow engine.log , you should see those tinme-drift messages 4) on the HE timedatectl set-ntp yes 5) engine.log will not write any of that messages after a while ( < 1 min)
(In reply to Eli Mesika from comment #4) Sorry, please test as follows : > 1) on the HE On the host with the time-drift , not the HE > > timedatectl set-ntp no > > 2) > > timedatectl set-time <any time with time-drift > 300 from current time> > > 3) Follow engine.log , you should see those tinme-drift messages > > 4) on the HE On the host with the time-drift , not the HE > > timedatectl set-ntp yes > > 5) > > engine.log will not write any of that messages after a while ( < 1 min)
I do have the logs, but due to logrotate issue one of the nodes have a 20G compressed log, so I need to extract that one.
It seems that the engine logs are missing. Test: 1. on ovirt2 set timedatectl set-ntp no 2. on ovirt2 set time -40 min 3. 5-6 min later synced the node back. Results: All 3 VMs on ovirt2 has a "not responding" mark (questionmark for status) during the timedrift situation, but systems are responding and accessible over ssh. 1 event for the drift was reported. ``` 2020-07-02 23:41:45,325+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-42) [] EVENT_ID: VDS_TIME_DRIFT_ALERT(604), Host ovirt2.localdomain has time-drift of 2494 seconds while maximum configured value is 300 seconds. ```
So, it seems the engine got the time sync and stop reporting the drift ....
I will try to reproduce but this time with live migration to the other host.
I also tried to: 1. Set 'timedatectl set-ntp no' on ovirt1 2. Set global maintenance 3. Power off the engine (simulate patch + reboot) 4. Power on the HE on ovirt1 Results: No time-drift reported. I guess it is a very hard to reproduce issue. You can close it as fixing something that cannot be reproduced will be extremely difficult.
Closing as WORKSFORME as suggested in comment 10 , please feel free to reopen when you have a clear reproduce scenario