Bug 1142082 - RHEL7 Hosts loosing connectivity with engine every day and stay in non-responsive state
Summary: RHEL7 Hosts loosing connectivity with engine every day and stay in non-respon...
Keywords:
Status: CLOSED DUPLICATE of bug 1116004
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.5.0
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 3.5.0
Assignee: Antoni Segura Puimedon
QA Contact: Michael Burman
URL:
Whiteboard: network
Depends On:
Blocks: rhev35betablocker rhev35rcblocker rhev35gablocker
TreeView+ depends on / blocked
 
Reported: 2014-09-16 06:48 UTC by Michael Burman
Modified: 2016-02-10 19:52 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-01 07:57:32 UTC
oVirt Team: Network
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Relevant logs- host loosing connectivity (1.31 MB, application/octet-stream)
2014-09-16 06:48 UTC, Michael Burman
no flags Details
/var/log/messages from my 2 rhel7 hosts (54.83 KB, application/octet-stream)
2014-09-16 06:56 UTC, Michael Burman
no flags Details
Connectivity.logs from my 2 rhel7 hosts (4.69 KB, application/octet-stream)
2014-09-16 06:59 UTC, Michael Burman
no flags Details

Description Michael Burman 2014-09-16 06:48:16 UTC
Created attachment 937862 [details]
Relevant logs- host loosing connectivity

Description of problem:
RHEL7 Hosts are loosing connectivity with engine and stay in non-responsive state until network service is restarted and only then host is going up.
It is happening also with dhcp and static ip configured on rhevm.
BOOTPROTO=dhcp/static
ONBOOT=yes


Version-Release number of selected component (if applicable):
3.5.0-0.12.beta.el6ev

How reproducible:
Every day

Steps to Reproduce:
1. Working setup with rhel7 host
2. 
3.

Actual results:
Host loosing connectivity during evening/night. Host stays in non-responsive state

Expected results:
Host shouldn't loose connectivity with engine. But if he does loose connectivity, i expect him to enroll back. 

Additional info:

Comment 1 Michael Burman 2014-09-16 06:56:15 UTC
Created attachment 937865 [details]
/var/log/messages from my 2 rhel7 hosts

Comment 2 Michael Burman 2014-09-16 06:59:55 UTC
Created attachment 937867 [details]
Connectivity.logs from my 2 rhel7 hosts

Comment 3 Michael Burman 2014-09-16 07:02:14 UTC
I attached relevant logs from my two rhel7 servers.
vdsm.logs
supervdsm.logs
/var/log/messages

connectivity.logs- in this logs you can see when the host lost connectivity with engine.

Comment 4 Antoni Segura Puimedon 2014-09-22 10:41:40 UTC
It's not happening with static IPs, there is some issue with dhcp. Could you check in your machines if after a few hours of having dhcp the dhclient process is still alive?

Comment 5 Michael Burman 2014-09-22 11:07:13 UTC
I will check that

Comment 6 Michael Burman 2014-09-23 11:14:03 UTC
I changed my rhel7 host from static ip to dhcp and during the night host lost connectivity with engine.
I'm not sure if the dhclient process was alive at that point, but he was alive in the last time i checked before going home.

Comment 7 Ori Gofen 2014-09-23 13:16:38 UTC
I have changed the priority to urgent,this bug frequently aborts many of my test scripts,plus all storage guys hit this issue on nearly daily basis

Comment 8 Antoni Segura Puimedon 2014-09-24 09:54:32 UTC
It was alive when you checked before going home, was it alive when you returned the next day? (Even after losing connectivity)

Comment 9 Michael Burman 2014-09-28 06:01:10 UTC
Hi Toni,

The dhclient process wasn't alive when i returned the next day.

Comment 10 Antoni Segura Puimedon 2014-09-29 09:00:08 UTC
Thanks Michael. I managed to reproduce it as well on f20. We need to find out why the dhclient process quits.

Comment 11 Michael Burman 2014-09-29 09:09:52 UTC
Ok Toni, Thank you.

We all waiting for a solution there.

For now, i configured all my rhel7 hosts with static ip, so they won't loose connectivity.

Comment 12 Antoni Segura Puimedon 2014-10-01 07:57:32 UTC
Ok, I went to Michael's machine and after talking with Jiři Popelka applied the patch for https://bugzilla.redhat.com/show_bug.cgi?id=1116004 there. The patch in question checks if the arping answer belongs to a mac address in the machine.

The issue didn't happend again and I can confirm that the case that was making it fail was the same. Thus, I mark this as a duplicate of bz#1116004

*** This bug has been marked as a duplicate of bug 1116004 ***


Note You need to log in before you can comment on or make changes to this bug.