Bug 1848353 - HostedEngine caches time from Host
Summary: HostedEngine caches time from Host
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.HostedEngine
Version: 4.3.10.4
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.4.1
: ---
Assignee: Eli Mesika
QA Contact: meital avital
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-18 08:31 UTC by Strahil Nikolov
Modified: 2020-07-13 11:25 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-07-12 10:18:17 UTC
oVirt Team: Infra
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)

Description Strahil Nikolov 2020-06-18 08:31:45 UTC
Description of problem:
A Hypervisor time has drifted away and once the time issue was fixed , the engine never realized the update (time was cached). 

Version-Release number of selected component (if applicable):
ovirt-engine-4.3.10.4-1.el7.noarch

How reproducible:
Happened once, but most probably will reoccur if the Host drifts back/ahead

Steps to Reproduce:
1.Host drifts in time
2.HE was migrated away and the host was rebooted
3.HE was returned back to the host , but the engine's internal timekeeping remained ahead in time

Actual results:
When the HE is migrated , the time of the OS get's in sync - but the engine itself never realizes that (cause it's cached)

Expected results:
Ovirt Engine should get it's time from the HE OS and not cache it.

Additional info:
For details check https://www.mail-archive.com/users@ovirt.org/msg61439.html

Comment 1 Michal Skrivanek 2020-06-26 14:35:20 UTC
can you add more thoughts? the meail thread gives some hints but it looks like you do know a bit more about the problem

Comment 2 RHEL Program Management 2020-06-26 14:35:28 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 3 Eli Mesika 2020-06-28 10:27:24 UTC
(In reply to Michal Skrivanek from comment #1)
> can you add more thoughts? the meail thread gives some hints but it looks
> like you do know a bit more about the problem

Source of problem is VdsBrokerObjectsBuilder::checkTimeDrift , it calls assignDatetimeValue to store the host date value, but when you look at assignDatetimeValue you see that it first tries to get the date from the cache and not using the fresh date passed for it in the "input" strcut, this was used probably for performance reasons but IMO we can always override this value in the cache in order to follow changes made to the host time

Comment 4 Eli Mesika 2020-06-30 09:18:29 UTC
After trying to reproduce on 4.4 I saw that setting a date with time-drift > 300 seconds on the host using timedatectl to bring ntp down and set the date-time was recognized by the engine and it starts to log those messages while bringing the ntp up again is recognized by the engine and stops the messages for that host without engine restart. So, after a closer look at the code and testing the scenario the current host date-time value is used. However it took some time ( < 1 min) until the engine got the correct date 

Do you still have the engine log and vdsm log for the relevant host in order to explore the scenario deeper in your environment ?

If not , can you please retest the following :

1) on the HE 

timedatectl set-ntp no

2)

timedatectl set-time <any time with time-drift > 300 from current time>

3) Follow engine.log , you should see those tinme-drift messages 

4) on the HE

timedatectl set-ntp yes

5)

engine.log will not write any of that messages after a while ( < 1 min)

Comment 5 Eli Mesika 2020-07-01 22:14:42 UTC
(In reply to Eli Mesika from comment #4)


Sorry, please test as follows : 

> 1) on the HE 

On the host with the time-drift , not the HE

> 
> timedatectl set-ntp no
> 
> 2)
> 
> timedatectl set-time <any time with time-drift > 300 from current time>
> 
> 3) Follow engine.log , you should see those tinme-drift messages 
> 
> 4) on the HE

On the host with the time-drift  , not the HE


> 
> timedatectl set-ntp yes
> 
> 5)
> 
> engine.log will not write any of that messages after a while ( < 1 min)

Comment 6 Strahil Nikolov 2020-07-02 03:26:07 UTC
I do have the logs, but due to logrotate issue one of the nodes have a 20G compressed log, so I need to extract that one.

Comment 7 Strahil Nikolov 2020-07-02 21:01:52 UTC
It seems that the engine logs are missing.

Test:
1. on ovirt2 set timedatectl set-ntp no
2. on ovirt2 set time -40 min
3. 5-6 min later synced the node back.

Results:
All 3 VMs on ovirt2 has a "not responding" mark (questionmark for status) during the timedrift situation, but systems are responding and accessible over ssh.
1 event for the drift was reported.
```
2020-07-02 23:41:45,325+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-42) [] EVENT_ID: VDS_TIME_DRIFT_ALERT(604), Host ovirt2.localdomain has time-drift of 2494 seconds while maximum configured value is 300 seconds.
```

Comment 8 Eli Mesika 2020-07-05 10:56:22 UTC
So, it seems the engine got the time sync and stop reporting the drift ....

Comment 9 Strahil Nikolov 2020-07-05 13:09:45 UTC
I will try to reproduce but this time with live migration to the other host.

Comment 10 Strahil Nikolov 2020-07-07 19:40:36 UTC
I also tried to:
1. Set 'timedatectl set-ntp no' on ovirt1
2. Set global maintenance
3. Power off the engine (simulate patch + reboot)
4. Power on the HE on ovirt1

Results: No time-drift reported.


I guess it is a very hard to reproduce issue. You can close it as fixing something that cannot be reproduced will be extremely difficult.

Comment 11 Eli Mesika 2020-07-12 10:18:17 UTC
Closing as WORKSFORME as suggested in comment 10 , please feel free to reopen when you have a clear reproduce scenario


Note You need to log in before you can comment on or make changes to this bug.