Bug 1330768 - [VDSM] sometimes one host loosing IP address
Summary: [VDSM] sometimes one host loosing IP address
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: 4.17.26
Hardware: Unspecified
OS: Unspecified
unspecified
urgent vote
Target Milestone: ---
: ---
Assignee: Dan Kenigsberg
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-26 22:29 UTC by Kobi Hakimi
Modified: 2016-05-05 08:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-05 08:30:37 UTC
oVirt Team: Network
gklein: ovirt-3.6.z?
gklein: ovirt-4.0.0?
gklein: blocker?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
folder of vdsm log (605.09 KB, application/x-gzip)
2016-04-26 22:29 UTC, Kobi Hakimi
no flags Details
messages log files (626.91 KB, application/x-gzip)
2016-04-26 22:30 UTC, Kobi Hakimi
no flags Details

Description Kobi Hakimi 2016-04-26 22:29:54 UTC
Created attachment 1151136 [details]
folder of vdsm log

Description of problem:
[VDSM] sometimes one host loosing IP address

Version-Release number of selected component (if applicable):
 Red Hat Enterprise Virtualization Manager Version: 3.6.5.3-0.1.el6 

How reproducible:
sometimes

Steps to Reproduce:
There is no scenario to reproduce it but I experience this issue in the second time in a week +-.
1. with ovirt 4.0:
 - I look at my engine of GE4 day after installation it 
 - host_mixed_1 was inactive 
 - When I tried to connect to it I realize that it lost the IP address

2. with rhevm 3.6.5:
 - look at the engine of GE3 after running some test[1]
 - host_mixed_2 was inactive 
 - When I tried to connect to it I realize that it lost the IP address


Actual results:
In the events log I saw that in the same time that probably the host lost his IP 
Apr 26, 2016 5:10:08 PM - VDSM host_mixed_2 command failed: Heartbeat exeeded


Expected results:
Keep working without loosing ip

Additional info:
danken started to investigate it but didn't see the reason.
see attached vdsm and messages logs

Comment 1 Kobi Hakimi 2016-04-26 22:30:56 UTC
Created attachment 1151137 [details]
messages log files

Comment 2 Dan Kenigsberg 2016-04-27 12:58:38 UTC
Critical /var/log/message logs ( from April 24 to 26) are missing. My guess is that at some point during this interval (Apr 25 ~4PM), dhclient has died. 24 hours later, the host looses its lease and dies.

Please add the logs of that time when the problem reproduces.

Comment 3 Kobi Hakimi 2016-05-01 07:02:23 UTC
You right this time interval is missing, but it weird, since I copied all msg folder.
Next time I will try to add this log.

Comment 4 Yaniv Lavi 2016-05-04 08:17:58 UTC
If this happens again and you find the log, please reopen.

Comment 5 Kobi Hakimi 2016-05-04 20:43:16 UTC
When I tried to clean the GE5:

Engine: jenkins-vm-13.scl.lab.tlv.redhat.com
Hosts: 	
 - RHEL72:lynx23,24
 - RHEVH72:lynx21,22

Which installed with:
 Red Hat Enterprise Virtualization Manager Version: 3.6.6-0.1.el6 

I got connection timeout see in:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com:8443/job/GE-cleaner/1530/consoleFull

when I look at this error I saw that 2 hosts are not up:
 - lynx24 was in maintenance mode
 - lynx21 was Non Responsive status
I tried to activate the first and reinsall the second but both operations failed
so I tried to ping to these machines but no connection at all so I connect to one of them with ipmitool:
ipmitool -I lanplus -H lynx24-mgmt.qa.lab.tlv.redhat.com -U root -P **** sol activate
and saw:
 1. There is no ip as expected.
 2. The command "pgrep dhcliet" return nothing

I leave the machine stuck in this state to be able to investigate it so I couldn't take the logs.
 
could you please investigate it?

Comment 6 Dan Kenigsberg 2016-05-05 06:34:09 UTC
Sorry Kobi, I do not understand. Can you please collect historic /var/log/message from the non-responsive host? If not, why?

Comment 7 Kobi Hakimi 2016-05-05 08:30:37 UTC
In this case we found the reason:
one test case of power management stop the network and failed to activate it.

Sorry for interrupt you all!!


Note You need to log in before you can comment on or make changes to this bug.