Bug 1330768

Summary: [VDSM] sometimes one host loosing IP address
Product: [oVirt] vdsm Reporter: Kobi Hakimi <khakimi>
Component: GeneralAssignee: Dan Kenigsberg <danken>
Status: CLOSED NOTABUG QA Contact: Meni Yakove <myakove>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.17.26CC: bugs, gklein, khakimi, ylavi
Target Milestone: ---Keywords: AutomationBlocker
Target Release: ---Flags: gklein: ovirt-3.6.z?
gklein: ovirt-4.0.0?
gklein: blocker?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-05 08:30:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
folder of vdsm log
none
messages log files none

Description Kobi Hakimi 2016-04-26 22:29:54 UTC
Created attachment 1151136 [details]
folder of vdsm log

Description of problem:
[VDSM] sometimes one host loosing IP address

Version-Release number of selected component (if applicable):
 Red Hat Enterprise Virtualization Manager Version: 3.6.5.3-0.1.el6 

How reproducible:
sometimes

Steps to Reproduce:
There is no scenario to reproduce it but I experience this issue in the second time in a week +-.
1. with ovirt 4.0:
 - I look at my engine of GE4 day after installation it 
 - host_mixed_1 was inactive 
 - When I tried to connect to it I realize that it lost the IP address

2. with rhevm 3.6.5:
 - look at the engine of GE3 after running some test[1]
 - host_mixed_2 was inactive 
 - When I tried to connect to it I realize that it lost the IP address


Actual results:
In the events log I saw that in the same time that probably the host lost his IP 
Apr 26, 2016 5:10:08 PM - VDSM host_mixed_2 command failed: Heartbeat exeeded


Expected results:
Keep working without loosing ip

Additional info:
danken started to investigate it but didn't see the reason.
see attached vdsm and messages logs

Comment 1 Kobi Hakimi 2016-04-26 22:30:56 UTC
Created attachment 1151137 [details]
messages log files

Comment 2 Dan Kenigsberg 2016-04-27 12:58:38 UTC
Critical /var/log/message logs ( from April 24 to 26) are missing. My guess is that at some point during this interval (Apr 25 ~4PM), dhclient has died. 24 hours later, the host looses its lease and dies.

Please add the logs of that time when the problem reproduces.

Comment 3 Kobi Hakimi 2016-05-01 07:02:23 UTC
You right this time interval is missing, but it weird, since I copied all msg folder.
Next time I will try to add this log.

Comment 4 Yaniv Lavi 2016-05-04 08:17:58 UTC
If this happens again and you find the log, please reopen.

Comment 5 Kobi Hakimi 2016-05-04 20:43:16 UTC
When I tried to clean the GE5:

Engine: jenkins-vm-13.scl.lab.tlv.redhat.com
Hosts: 	
 - RHEL72:lynx23,24
 - RHEVH72:lynx21,22

Which installed with:
 Red Hat Enterprise Virtualization Manager Version: 3.6.6-0.1.el6 

I got connection timeout see in:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com:8443/job/GE-cleaner/1530/consoleFull

when I look at this error I saw that 2 hosts are not up:
 - lynx24 was in maintenance mode
 - lynx21 was Non Responsive status
I tried to activate the first and reinsall the second but both operations failed
so I tried to ping to these machines but no connection at all so I connect to one of them with ipmitool:
ipmitool -I lanplus -H lynx24-mgmt.qa.lab.tlv.redhat.com -U root -P **** sol activate
and saw:
 1. There is no ip as expected.
 2. The command "pgrep dhcliet" return nothing

I leave the machine stuck in this state to be able to investigate it so I couldn't take the logs.
 
could you please investigate it?

Comment 6 Dan Kenigsberg 2016-05-05 06:34:09 UTC
Sorry Kobi, I do not understand. Can you please collect historic /var/log/message from the non-responsive host? If not, why?

Comment 7 Kobi Hakimi 2016-05-05 08:30:37 UTC
In this case we found the reason:
one test case of power management stop the network and failed to activate it.

Sorry for interrupt you all!!