Hide Forgot
Description of problem: After switching a host to a iSCSI cluster with configured data storage it is not possible to switch the host back to a NFS cluster safely. After the switch to NFS cluster all iSCSI LUNs are disconnected without unmounting the storage filesystem first. This produces I/O hang on this mount points. The host remains in the status: unassigned A soft reboot is not possible anymore and the reboots stops. Hard reboots are necessary Version-Release number of selected component (if applicable): RHEV 3.0 vdsm-4.9-112.4.el6_2.x86_64 How reproducible: Steps to Reproduce: 1. Move host to a iSCSI based cluster 2. Activate the host 3. Move the same host back to a NFS based cluster 4. Activate the host Actual results: Host remains in status "unassigned". Soft reboots end in a stop of the init 6 process iSCSI datastorage is not unmounted but LUNs are diconnected forcfully Expected results: Clean unmount Additional info:
Created attachment 559996 [details] Console output after switch form iSCSI to NFS cluster This is the last output after the switch and if a soft reboot of the hypervisor is triggered. Only a hard reset can help here.
Please attach /var/log/vdsm/vdsm.log of the relevant time, and rhev-m logs as well. I believe that RHEV-M should have asked to detach the host from the master storage domain before disconnecting the iSCSI session, but I would like to see if this has actually happened. How many host did you have in your iSCSI cluster? Was the problematic host the "SPM"?
Created attachment 560565 [details] VDSM Log during switchover from iSCSI to NFS This is a vdsm logfile created after reproducing the switchover failure. Switch from NFS to iSCSI cluster and back is logged here. Logging stops at 13:03 after host is reactivated with the NFS clust and the host remains in "Maintenance" until hard reset of server (see console output)
Created attachment 560566 [details] VDSM Log during soft reboot try and hard reset This vdsm log shows the output written after the switch over beginning at 13:03 until the system remains unresponsive and system is reseted hardly by switching off the hardware
Created attachment 560570 [details] RHEV-M log file This is the rhev log file. I noticed a time difference between the host and the RHEV-M system of 43 seconds. So please add 43 seconds within this log to match with time of the vdsm logs.
Looking at the log files it seems that the iSCSI Lun is disconnected and after this the vdsmd tries to deactivate the logical volumes using the LUN. Is this interpretation correct?
Rafael, I still need to know whether this host was SPM in iSCSI cluster before you moved it to NFS cluster? I believe that you attached it to existing NFS cluster as regular (HSM) host, right?
Hi Igor, yes the host promotes to the SPM within the iSCSI cluster and in the existing NFS cluster is is a regular host without SPM status. Regards Rafael
Hi Rafael, Tried to reproduce this issue several time with no success on both RHEV (vdsm-4.9-112.6.el6_2.x86_64) and oVirt (latest vdsm built from git). executed the same steps as you mentioned above: 1. Move host to a iSCSI based cluster 2. Activate the host 3. Move the same host back to a NFS based cluster 4. Activate the host Could you provide some more data about the setup which will allow me to get to a reproducer ? were there any networking issues ? was the storage loaded ?
Actually there is not more data I can provide and there are no other issues in the landscape currently. The RHEL-Host is connected to a Netapp Filer with FC and 10Gb/s networking. The LUNs of the iSCSI storage are distributed for iSCSI access group for the iscsid only. The hosts are running on UCS blades with the local installation on FC based disk luns. Here the problem can be reproduced regulary and the logs produce equal output always.
(In reply to comment #11) > Actually there is not more data I can provide and there are no other issues in > the landscape currently. > > The RHEL-Host is connected to a Netapp Filer with FC and 10Gb/s networking. > The LUNs of the iSCSI storage are distributed for iSCSI access group for the > iscsid only. > The hosts are running on UCS blades with the local installation on FC based > disk luns. > Here the problem can be reproduced regulary and the logs produce equal output > always. well, I think I have a theory on this one, the crucial evidence according to your report is when you said reboot was not succeeded ("Soft reboots end in a stop of the init 6 process"), which leads on a multiple, known issues where each one has an opened bug to show for: - 760214 - vgs hangs in d-state after iscsi session is disconnected - 785811 - Fedora16 with iscsid running, host hang on reboot (this issue exists on RHEL as well) I assume one of your lvm processes hanged on the host, and once that happens, other lvm operations are blocked as well, and even reboot won't succeed, so, eventually, rhevm won't be able to connect host to pool and storage domains. 1) is it reproducible ? 2) if it does reproduce, can you please provide us the following: - iscsiadm -m session - /var/log/messages - ps -elf | grep lvm 3) if there's an lvm process in hang, could you try attach to it using gdb: - gdb -p `pgrep lvm` * thread apply all bt full
Rafael, Since we failed to reproduce this behavior. Could you please try to reproduce this on latest vdsm version https://brewweb.devel.redhat.com/buildinfo?buildID=208324 In additional, if it will happen to you again please look at the running processes on the host and check whether one of lvm processes stuck (look at comment #12)
No response for 2 weeks, closing as insuggicientdata