Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 788155

Summary:

RHEV 3.0 VDSM iSCSI cluster remove produces kernel I/O failures

Product:

Red Hat Enterprise Linux 6

Reporter:

Rafael Kolless <rafael.kolless>

Component:

vdsm

Assignee:

Igor Lvovsky <ilvovsky>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

yeylon <yeylon>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.2

CC:

abaron, bazulay, hateya, iheim, lpeer, srevivo, ykaul

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-04-23 22:11:08 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Console output after switch form iSCSI to NFS cluster	none
VDSM Log during switchover from iSCSI to NFS	none
VDSM Log during soft reboot try and hard reset	none
RHEV-M log file	none

Description Rafael Kolless 2012-02-07 15:37:10 UTC

Description of problem:
After switching a host to a iSCSI cluster with configured data storage it is not possible to switch the host back to a NFS cluster safely.
After the switch to NFS cluster all iSCSI LUNs are disconnected without unmounting the storage filesystem first. This produces I/O hang on this mount points. The host remains in the status: unassigned
A soft reboot is not possible anymore and the reboots stops. Hard reboots are necessary 

Version-Release number of selected component (if applicable):
RHEV 3.0
vdsm-4.9-112.4.el6_2.x86_64

How reproducible:


Steps to Reproduce:
1. Move host to a iSCSI based cluster
2. Activate the host
3. Move the same host back to a NFS based cluster
4. Activate the host

  
Actual results:
Host remains in status "unassigned". Soft reboots end in a stop of the init 6 process
iSCSI datastorage is not unmounted but LUNs are diconnected forcfully

Expected results:
Clean unmount

Additional info:

Comment 2 Rafael Kolless 2012-02-07 16:00:12 UTC

Created attachment 559996 [details]
Console output after switch form iSCSI to NFS cluster

This is the last output after the switch and if a soft reboot of the hypervisor is triggered.
Only a hard reset can help here.

Comment 3 Dan Kenigsberg 2012-02-07 16:48:53 UTC

Please attach /var/log/vdsm/vdsm.log of the relevant time, and rhev-m logs as well. I believe that RHEV-M should have asked to detach the host from the master storage domain before disconnecting the iSCSI session, but I would like to see if this has actually happened.

How many host did you have in your iSCSI cluster? Was the problematic host the "SPM"?

Comment 4 Rafael Kolless 2012-02-09 12:22:02 UTC

Created attachment 560565 [details]
VDSM Log during switchover from iSCSI to NFS

This is a vdsm logfile created after reproducing the switchover failure.
Switch from NFS to iSCSI cluster and back is logged here.
Logging stops at 13:03 after host is reactivated with the NFS clust and the host remains in "Maintenance" until hard reset of server (see console output)

Comment 5 Rafael Kolless 2012-02-09 12:24:59 UTC

Created attachment 560566 [details]
VDSM Log during soft reboot try and hard reset

This vdsm log shows the output written after the switch over beginning at 13:03 until the system remains unresponsive and system is reseted hardly by switching off the hardware

Comment 6 Rafael Kolless 2012-02-09 12:36:07 UTC

Created attachment 560570 [details]
RHEV-M log file

This is the rhev log file.
I noticed a time difference between the host and the RHEV-M system of 43 seconds.
So please add 43 seconds within this log to match with time of the vdsm logs.

Comment 7 Rafael Kolless 2012-02-09 12:51:00 UTC

Looking at the log files it seems that the iSCSI Lun is disconnected and after this the vdsmd tries to deactivate the logical volumes using the LUN.

Is this interpretation correct?

Comment 8 Igor Lvovsky 2012-02-13 07:04:43 UTC

   Rafael,
I still need to know whether this host was SPM in iSCSI cluster before you moved it to NFS cluster?
I believe that you attached it to existing NFS cluster as regular (HSM) host, right?

Comment 9 Rafael Kolless 2012-02-13 07:58:32 UTC

Hi Igor,

yes the host promotes to the SPM within the iSCSI cluster and in the existing NFS cluster is is a regular host without SPM status.

Regards

Rafael

Comment 10 Haim 2012-02-14 16:08:58 UTC

Hi Rafael, 

Tried to reproduce this issue several time with no success on both RHEV (vdsm-4.9-112.6.el6_2.x86_64) and oVirt (latest vdsm built from git). 
executed the same steps as you mentioned above:

1. Move host to a iSCSI based cluster
2. Activate the host
3. Move the same host back to a NFS based cluster
4. Activate the host

Could you provide some more data about the setup which will allow me to get to a reproducer ? were there any networking issues ? was the storage loaded ?

Comment 11 Rafael Kolless 2012-02-15 07:40:02 UTC

Actually there is not more data I can provide and there are no other issues in the landscape currently.

The RHEL-Host is connected to a Netapp Filer with FC and 10Gb/s networking.
The LUNs of the iSCSI storage are distributed for iSCSI access group for the iscsid only.
The hosts are running on UCS blades with the local installation on FC based disk luns.
Here the problem can be reproduced regulary and the logs produce equal output always.

Comment 12 Haim 2012-02-19 07:47:54 UTC

(In reply to comment #11)
> Actually there is not more data I can provide and there are no other issues in
> the landscape currently.
> 
> The RHEL-Host is connected to a Netapp Filer with FC and 10Gb/s networking.
> The LUNs of the iSCSI storage are distributed for iSCSI access group for the
> iscsid only.
> The hosts are running on UCS blades with the local installation on FC based
> disk luns.
> Here the problem can be reproduced regulary and the logs produce equal output
> always.

well, I think I have a theory on this one, the crucial evidence according to your report is when you said reboot was not succeeded ("Soft reboots end in a stop of the init 6
process"), which leads on a multiple, known issues where each one has an opened bug to show for: 

- 760214 - vgs hangs in d-state after iscsi session is disconnected
- 785811 - Fedora16 with iscsid running, host hang on reboot (this issue exists on RHEL as well)

I assume one of your lvm processes hanged on the host, and once that happens, other lvm operations are blocked as well, and even reboot won't succeed, so, eventually, rhevm won't be able to connect host to pool and storage domains. 

1) is it reproducible ? 
2) if it does reproduce, can you please provide us the following: 
   - iscsiadm -m session 
   - /var/log/messages
   - ps -elf | grep lvm 
3) if there's an lvm process in hang, could you try attach to it using gdb: 
   - gdb -p `pgrep lvm`
     * thread apply all bt full

Comment 13 Igor Lvovsky 2012-04-08 11:58:42 UTC

  Rafael,
Since we failed to reproduce this behavior. Could you please try to reproduce this on latest vdsm version 
https://brewweb.devel.redhat.com/buildinfo?buildID=208324

In additional, if it will happen to you again please look at the running processes on the host and check whether one of lvm processes stuck (look at comment #12)

Comment 14 Ayal Baron 2012-04-23 22:11:08 UTC

No response for 2 weeks, closing as insuggicientdata