Bug 801747

Summary: [RHEVH-H]ISCSI and NIC Problems
Product: Red Hat Enterprise Linux 6 Reporter: Michael <bestell>
Component: kernelAssignee: Ademar Reis <areis>
kernel sub component: Virtualization QA Contact: Virtualization Bugs <virt-bugs>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: urgent    
Priority: high CC: bazulay, iheim, kroberts, riehecky
Version: 6.2   
Target Milestone: rc   
Target Release: 6.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Dell PowerEdge R415 2x AMD 4184 64GB Rack Chassis, Up to 4x 3.5" Hot Plug HDDs, LCDdiagnostics AMD Opteron 4184 Prozessor (2,8GHz, 6C, 6x512KB L2/6MB L3 Cache, 75W ACP), DDR3-1.333MHz 1U Rack Bezel 64GB Arbeitsspeicher f¸r 2CPU (8x8GB Dual Rank LV RDIMMs) 1066MHz, verwendet 1333MHz DIMMs Zus‰tzlich AMD Opteron 4184 Prozessor (2,8GHz, 6C, 6x512KB L2/6MB L3 Cache, 75W ACP) 2 x 450GB SAS 6Gbit/s 15k 3,5Zoll Festplatte Hot Plug SAS 6/iR Controller f¸r Hot Plug Festplatten Geh‰use No Optical Drive Redundant Power Supply (2 PSU) 500W 2 x 2M Rack Power Cord C13/C14 12A Intel Gigabit ET Quad Port Server-Adapter, Cu, PCIe x4 iDRAC6 Express Sliding Ready Rack Rails C12 Hot-Swap - R1 f¸r SAS 6iR, Genau 2 SAS/SATA Hot Plug Laufwerke
Last Closed: 2015-02-18 17:32:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 846704, 961026, 1164899    

Description Michael 2012-03-09 11:12:13 UTC
Description of problem:

After a new installation all pings ok. 
rhevh02 are SPM.

So i have follow testing:
- install a virtual machine (testvm01 (Win2008r2))
- i have this machine a lot times life migrated between rhevh01 and rhevh02 (all ok) 
- i have maintanance rhevh01 Host, the virtual machine migrate to rhevh02.
- than i have restart rhevh01 from the admin console (directly on Server).
- after restart i confirmed (with checkbox) host has been restarted.
- than i activated host rhevh01
(display to vm ok and virtual machine on rhevh01 ok too)
- at now i have make that same with rhevh02
(display to vm ok and virtual machine on rhevh02 ok too)
- than with rhevh01, ...
but here i become, whan i make this host to maintenance follow errors on the login screen:

end_nequest: I/O error, dev dm-10, sector 0
end_nequest: I/O error, dev dm-10, sector 0
end_nequest: I/O error, dev dm-10, sector 8
end_nequest: I/O error, dev dm-10, sector 128
end_nequest: I/O error, dev dm-10, sector 0
end_nequest: I/O error, dev dm-10, sector 0
end_nequest: I/O error, dev dm-10, sector 4096
end_nequest: I/O error, dev dm-10, sector 0
end_nequest: I/O error, dev dm-10, sector 0
....

whan i now restart rhev01, so the rhevh02 contening from new.

- at now i bring rhevh02 to maintanance and i become follow errors:

"Migration failed due to Error: Could not connect to peer host (VM: testvm01, Source Host: rhevh02.*.*.de). Trying to migrate to another Host."

"VM migration failed due to Error: Could not connect to peer host while Host is in 'preparing for maintenance' state.
  Consider manual intervention: stopping/migrating Vms as Host's state will not
  turn to maintenance while VMs are still running on it.(VM: testvm01, Source Host: rhevh02.*.*.de)."

"Invalid status on Data Center Default. Setting status to Non-Responsive."

"Storage Pool Manager runs on Host rhevh01.*.*.de (Address: 192.*.*.*)."

Than contening rhevh01 from new with folloow error:

"Invalid status on Data Center Default. Setting Data Center status to Non-Responsive (On host rhevh01.*.*.de, Error: done)."

The server rhevh02 have only the status "Preparing for Maintanance" and the virtual machine don't migrate. But i can connecting to the virtual machine and the virtual machine works.

The ping from rhevh02 works to. But he dont change the status.

- at now i reboot the rhevh02 manual from admin console.

Than change the status to 

"Host rhevh02.*.*.de is non-responsive."

than have rhev automaticaly rebooting (iDrac) from now the host rhevh02. Than was this machine down with follow error:

"Power Management test failed for Host rhevh02.*.*.de.Unable to connect/login to fencing device"

After that the host rhevh02 started.

But the problem is, that the server don't go to a up status. The status was non response.

Than i go to power management and click test (was succesful) and than ok, but the server don't go up.

The ping to the rhevm server work, but not the others pings.

I don't know if the problems come from idrac or if the problem comes between the SPM changing.

a day after that the non responce status of rhevh02 was not gone, so i have change the status to maintanance and shutdown the server over the admin console.

Than i have remove the power cables and wait 3 seconds, than i have started the server and after starting i confirmed (with checkbox) that the host has been manual fencing.

Than i have activate this. All pings are ok.
Than i have life migrated the machine to rhevh02 and all pings ok too, but i become the display failures. (display.jpg).

Version-Release number of selected component (if applicable):
rhevh-6.2-20120209.0.iso
Red Hat Enterprise Virtualization Manager:
Version 3.0.2_0001-2.el6 

How reproducible:
tha same ^^

Steps to Reproduce:
1. SAN -> Install Open-E with vds 1.0 (str+w on console): http://documents.open-e.com/Open-E_DSS_V6_Synchronous_Volume_Replication_with_Failover_over_a_LAN_with_unicast.pdf

2.Map the scst or a iet iscsi target to rhev

3.Make the same what is in the description on a Dell R415.
  
Actual results:
The ISCSI Storage make problems of changing the SPM.

Additional info:
I have make a case on redhat "Case 00590201 with the name RHEV 3 Network".
There i have all logs "sosreport-LogCollector-*-20120302083716-36aa.tar.xz" and the image "display.jpg".


I don't know how components can make this problems.

Comment 2 Michael 2012-03-21 07:58:00 UTC
i have found the problem. It is a bug in the RHEL 6.2.0.3 igb module (3.0.6-k).

I have compiling and install the igb module from intel sourcecode v3.3.6 and i don't have this problems. The idrac is ok too. Only after idrac rebooting i become follow error (Power Management failed for Host ....), but idrac is funcionality and when i make a manual fancing test so is the error delete. 

The version in igb 3.0.6-k has the bugs with my network card "Intel Gigabit ET Quad Port Server-Adapter, Cu, PCIe x4 ".

Comment 3 Michael 2012-03-21 08:01:29 UTC
I have update the pakage in this case from vdsm to kernel, but i think that this problem is in the kernel igb module.

For the idrac problem i make a new case.

Comment 4 RHEL Program Management 2012-05-03 05:10:34 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Michael 2012-05-10 15:51:03 UTC
I have see that the RHEL 6.3 bata has a newer igb driver, so i think that the problem can be fixed with this. I have tired to testing this problem with 	 rhev-hypervisor6-6.3-20120419.0, but this version has many big other bugs, so that i have cancel this test.

Comment 7 RHEL Program Management 2012-12-14 07:48:50 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 RHEL Program Management 2013-10-14 05:05:19 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 9 Ademar Reis 2014-11-25 17:45:04 UTC
Michael: the customer case attached to this problem has been closed. Are you still experiencing this issue, or can we close this bug as well? Looks like the recent upgrades in the igb driver should have fixed the problem described.

Thanks.

Comment 10 Ademar Reis 2015-02-18 17:32:58 UTC
(In reply to Ademar Reis from comment #9)
> Michael: the customer case attached to this problem has been closed. Are you
> still experiencing this issue, or can we close this bug as well? Looks like
> the recent upgrades in the igb driver should have fixed the problem
> described.

Michael, I'm closing this bug for now. If you have any extra information or if you can reproduce with the latest packages, please reopen the BZ. Thanks!

Comment 11 Michael 2017-07-11 10:55:51 UTC
This old bug was than in the intel driver and was fixed with a newer Kernel update.