Bug 1142082

Summary: RHEL7 Hosts loosing connectivity with engine every day and stay in non-responsive state
Product: Red Hat Enterprise Virtualization Manager Reporter: Michael Burman <mburman>
Component: vdsmAssignee: Antoni Segura Puimedon <asegurap>
Status: CLOSED DUPLICATE QA Contact: Michael Burman <mburman>
Severity: urgent Docs Contact:
Priority: high    
Version: 3.5.0CC: bazulay, ecohen, gklein, iheim, lpeer, mburman, mpavlik, nyechiel, ogofen, s.kieske, yeylon
Target Milestone: ---Keywords: Triaged, Unconfirmed
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: network
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-10-01 07:57:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1147536, 1164308, 1164311    
Attachments:
Description Flags
Relevant logs- host loosing connectivity
none
/var/log/messages from my 2 rhel7 hosts
none
Connectivity.logs from my 2 rhel7 hosts none

Description Michael Burman 2014-09-16 06:48:16 UTC
Created attachment 937862 [details]
Relevant logs- host loosing connectivity

Description of problem:
RHEL7 Hosts are loosing connectivity with engine and stay in non-responsive state until network service is restarted and only then host is going up.
It is happening also with dhcp and static ip configured on rhevm.
BOOTPROTO=dhcp/static
ONBOOT=yes


Version-Release number of selected component (if applicable):
3.5.0-0.12.beta.el6ev

How reproducible:
Every day

Steps to Reproduce:
1. Working setup with rhel7 host
2. 
3.

Actual results:
Host loosing connectivity during evening/night. Host stays in non-responsive state

Expected results:
Host shouldn't loose connectivity with engine. But if he does loose connectivity, i expect him to enroll back. 

Additional info:

Comment 1 Michael Burman 2014-09-16 06:56:15 UTC
Created attachment 937865 [details]
/var/log/messages from my 2 rhel7 hosts

Comment 2 Michael Burman 2014-09-16 06:59:55 UTC
Created attachment 937867 [details]
Connectivity.logs from my 2 rhel7 hosts

Comment 3 Michael Burman 2014-09-16 07:02:14 UTC
I attached relevant logs from my two rhel7 servers.
vdsm.logs
supervdsm.logs
/var/log/messages

connectivity.logs- in this logs you can see when the host lost connectivity with engine.

Comment 4 Antoni Segura Puimedon 2014-09-22 10:41:40 UTC
It's not happening with static IPs, there is some issue with dhcp. Could you check in your machines if after a few hours of having dhcp the dhclient process is still alive?

Comment 5 Michael Burman 2014-09-22 11:07:13 UTC
I will check that

Comment 6 Michael Burman 2014-09-23 11:14:03 UTC
I changed my rhel7 host from static ip to dhcp and during the night host lost connectivity with engine.
I'm not sure if the dhclient process was alive at that point, but he was alive in the last time i checked before going home.

Comment 7 Ori Gofen 2014-09-23 13:16:38 UTC
I have changed the priority to urgent,this bug frequently aborts many of my test scripts,plus all storage guys hit this issue on nearly daily basis

Comment 8 Antoni Segura Puimedon 2014-09-24 09:54:32 UTC
It was alive when you checked before going home, was it alive when you returned the next day? (Even after losing connectivity)

Comment 9 Michael Burman 2014-09-28 06:01:10 UTC
Hi Toni,

The dhclient process wasn't alive when i returned the next day.

Comment 10 Antoni Segura Puimedon 2014-09-29 09:00:08 UTC
Thanks Michael. I managed to reproduce it as well on f20. We need to find out why the dhclient process quits.

Comment 11 Michael Burman 2014-09-29 09:09:52 UTC
Ok Toni, Thank you.

We all waiting for a solution there.

For now, i configured all my rhel7 hosts with static ip, so they won't loose connectivity.

Comment 12 Antoni Segura Puimedon 2014-10-01 07:57:32 UTC
Ok, I went to Michael's machine and after talking with Jiři Popelka applied the patch for https://bugzilla.redhat.com/show_bug.cgi?id=1116004 there. The patch in question checks if the arping answer belongs to a mac address in the machine.

The issue didn't happend again and I can confirm that the case that was making it fail was the same. Thus, I mark this as a duplicate of bz#1116004

*** This bug has been marked as a duplicate of bug 1116004 ***