Created attachment 599393 [details] update failure logs(rhevm, vdsm, syslog) Description of problem: When hypervisor is moved to maintenance mode for an upgradation, all iSCSI sessions gets disconnected and eventually causes storage domains to lose all the underlying devices, and LVM layer gets hung. At this point when upgradation is done,upgrade failed with following message: 2012-07-18 16:20:28,959 INFO [org.ovirt.engine.core.bll.InstallerMessages] (pool-11-thread-45) VDS message: The required action is taking longer than allowed by configuration. Verify host networking and storage settings. 2012-07-18 16:20:28,959 ERROR [org.ovirt.engine.core.utils.hostinstall.MinaInstallWrapper] (pool-11-thread-45) Error running command /usr/share/vdsm-reg/vdsm-upgrade "Hung_task timeout" for lvs commands are noticed in syslog. Version-Release number of selected component (if applicable): RHEV manager - 3.0.3_0001-3.el6 RHEV Hypervisor - 6.2 - 20120320.0.el6_2 VDSM Version - 3.0.112.6 How reproducible: always Steps to Reproduce: 1.create a iSCSI datacenter with a two node cluster. 2.create a master storage domain(iSCSI). 3.move Hypervisor1(node1) to maintenance mode for upgradation 4.start updating the hypervisor from RHEV manager. 5.update failed Actual results: update fail Expected results: Update should succeed Additional info:
Any updates here?
Barak, this looks like an infra issue (and rhev-h upgrade) can you get someone to take a look?
Hello Inbaraj, We went over the logs and one important logs are missing. Can you please send the log content of the upgrade process? Files are located at /var/log/vdsm-reg Thanks!
Created attachment 607191 [details] update failure logs(rhevm, vdsm, vdsm-reg, syslog)
Alon, I don’t have the vdsm-reg logs for the previous run. So have recreated the same issue and updated with the new set of logs. Here I have attached the vdsm-reg logs also.
Thank you so much Inbaraj, for reproducing. I am trying to figure out where we reach before not responding... then be able to move forward. Do you have the following files as well at node side? /var/log/ovirt.log /tmp/ovirt.log
Created attachment 607230 [details] ovirt logs from (/var/log and /tmp)
Alon, i have updated the ovirt.log file from /var/log and /tmp locations.
Hello Mike, Very difficult to know where this hungs... as there are almost no informative messages in the code, most are errors only. Can you please suggest how to debug this? Or maybe you encountered this in the past? If I to guess we are hung at: ovirt-config-boot::iscsiadm -p $OVIRT_ISCSI_TARGET_IP:$OVIRT_ISCSI_TARGET_PORT -m discovery -t sendtargets I can confirm this using mount -o bind with debug file... but it is not local configuration. Please also note the error at /tmp/ovirt.log. Thanks! Alon.
Hello Inbaraj, Can you please just take the host into maintenance mode and see if the disk of the root file system still attached? Test this without trying to upgrade. Or maybe you can confirm that the following deletes the root? iscsiadm -m node -o delete -T iqn.1992-08.com.netapp:sn.118043588 -p 192.168.100.155:3260 Thank you!
Created attachment 607452 [details] multipath, VG, and iscsi status output
(In reply to comment #15) > Hello Inbaraj, > > Can you please just take the host into maintenance mode and see if the disk > of the root file system still attached? > > Test this without trying to upgrade. > > Or maybe you can confirm that the following deletes the root? > iscsiadm -m node -o delete -T iqn.1992-08.com.netapp:sn.118043588 -p > 192.168.100.155:3260 > > Thank you! Alon, OS is installed on the server local hard disk .so it doesn’t affect root disk. But as mentioned in the description of the problem, once the host is moved to maintenance mode, the iSCSI sessions are getting shutdown and due to this storage master domain which is created on the multipath device(NetApp Lun) loses its paths and the LVM layer hung. Here is the snippets of the logs which clearly shows iSCSI sessions are shut down and multipath devices started losing paths. Aug 28 08:45:23 ibmx3550-210-126 ntpd[8689]: kernel time sync status change 2001 Aug 28 08:59:02 ibmx3550-210-126 multipathd: sdb: remove path (uevent) Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37 Last path deleted, disabling queueing Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: map in use Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: can't flush Aug 28 08:59:02 ibmx3550-210-126 multipathd: flush_on_last_del in progress Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: load table [0 41943040 multipath 3 queue_if_no_path pg_init_retries 50 0 0 0] Aug 28 08:59:02 ibmx3550-210-126 multipathd: sdb: path removed from map 360a98000486e536369346d3363302f37 Aug 28 08:59:02 ibmx3550-210-126 kernel: ata1.00: hard resetting link Aug 28 08:59:03 ibmx3550-210-126 iscsid: Connection1:0 to [target: iqn.1992-08.com.netapp:sn.118043588, portal: 192.168.100.155,3260] through [iface: default] is shutdown. I have also captured the multipath and iSCSI session status output before and after maintenance mode and attached here.
Thank you Inbaraj, I was trying to find which process actually uses the disk after disconnect. I will try to reproduce this with all the above information. Thank you, Alon.
Can you attach lvm.conf? Specifically I want to see if it has: ignore_suspended_devices=1 If not I would like to test with it (but with rhev-h this could be more complicated than simply changing the file and persisting it)