Bug 841883 - RHEV 6.2 Hypervisor failed to update from RHEV3.0 manager.
Summary: RHEV 6.2 Hypervisor failed to update from RHEV3.0 manager.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.0.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 3.1.0
Assignee: Ayal Baron
QA Contact: Pavel Stehlik
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-07-20 13:24 UTC by Inbaraj
Modified: 2016-02-10 16:42 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-10-02 16:16:41 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
update failure logs(rhevm, vdsm, syslog) (313.99 KB, application/x-compressed)
2012-07-20 13:24 UTC, Inbaraj
no flags Details
update failure logs(rhevm, vdsm, vdsm-reg, syslog) (319.15 KB, application/x-compressed)
2012-08-27 12:25 UTC, Inbaraj
no flags Details
ovirt logs from (/var/log and /tmp) (1.97 KB, application/x-compressed)
2012-08-27 15:09 UTC, Inbaraj
no flags Details
multipath, VG, and iscsi status output (7.68 KB, text/plain)
2012-08-28 09:18 UTC, Inbaraj
no flags Details

Description Inbaraj 2012-07-20 13:24:19 UTC
Created attachment 599393 [details]
update failure logs(rhevm, vdsm, syslog)

Description of problem:
When hypervisor is moved to maintenance mode for an upgradation, all  iSCSI sessions gets disconnected and eventually causes storage domains to lose all the underlying devices, and LVM layer gets hung.

At this point when upgradation is done,upgrade failed with following message:

2012-07-18 16:20:28,959 INFO  [org.ovirt.engine.core.bll.InstallerMessages] (pool-11-thread-45) VDS message: The required action is taking longer than allowed by configuration. Verify host networking and storage settings.
2012-07-18 16:20:28,959 ERROR [org.ovirt.engine.core.utils.hostinstall.MinaInstallWrapper] (pool-11-thread-45) Error running command /usr/share/vdsm-reg/vdsm-upgrade


"Hung_task timeout" for lvs commands are noticed in syslog.

 
Version-Release number of selected component (if applicable):
RHEV manager - 3.0.3_0001-3.el6
RHEV Hypervisor - 6.2 - 20120320.0.el6_2	
VDSM Version - 3.0.112.6

How reproducible:
always

Steps to Reproduce:
1.create a iSCSI datacenter with a two node  cluster.
2.create a master storage domain(iSCSI).
3.move Hypervisor1(node1) to maintenance mode for upgradation
4.start updating the hypervisor from RHEV manager.
5.update failed

  
Actual results:
update fail

Expected results:
Update should succeed

Additional info:

Comment 1 Inbaraj 2012-07-27 10:41:28 UTC
Any updates here?

Comment 2 Ayal Baron 2012-07-29 12:19:00 UTC
Barak, this looks like an infra issue (and rhev-h upgrade) can you get someone to take a look?

Comment 8 Alon Bar-Lev 2012-08-16 09:36:51 UTC
Hello Inbaraj,

We went over the logs and one important logs are missing.

Can you please send the log content of the upgrade process?

Files are located at /var/log/vdsm-reg

Thanks!

Comment 9 Inbaraj 2012-08-27 12:25:37 UTC
Created attachment 607191 [details]
update failure logs(rhevm, vdsm, vdsm-reg, syslog)

Comment 10 Inbaraj 2012-08-27 12:31:22 UTC
Alon, 
I don’t have the vdsm-reg logs for the previous run. So  have recreated the same issue and updated with the new set of logs.  Here I have attached the vdsm-reg  logs also.

Comment 11 Alon Bar-Lev 2012-08-27 13:34:58 UTC
Thank you so much Inbaraj, for reproducing.

I am trying to figure out where we reach before not responding... then be able to move forward.

Do you have the following files as well at node side?
 /var/log/ovirt.log
 /tmp/ovirt.log

Comment 12 Inbaraj 2012-08-27 15:09:27 UTC
Created attachment 607230 [details]
ovirt logs from (/var/log and /tmp)

Comment 13 Inbaraj 2012-08-27 15:12:17 UTC
Alon, i have updated the ovirt.log file from /var/log and /tmp locations.

Comment 14 Alon Bar-Lev 2012-08-27 18:49:42 UTC
Hello Mike,

Very difficult to know where this hungs... as there are almost no informative messages in the code, most are errors only.

Can you please suggest how to debug this? Or maybe you encountered this in the past?

If I to guess we are hung at:
 ovirt-config-boot::iscsiadm -p $OVIRT_ISCSI_TARGET_IP:$OVIRT_ISCSI_TARGET_PORT -m discovery -t sendtargets

I can confirm this using mount -o bind with debug file... but it is not local configuration.

Please also note the error at /tmp/ovirt.log.

Thanks!
Alon.

Comment 15 Alon Bar-Lev 2012-08-28 06:34:12 UTC
Hello Inbaraj,

Can you please just take the host into maintenance mode and see if the disk of the root file system still attached?

Test this without trying to upgrade.

Or maybe you can confirm that the following deletes the root?
 iscsiadm -m node -o delete -T iqn.1992-08.com.netapp:sn.118043588 -p 192.168.100.155:3260

Thank you!

Comment 16 Inbaraj 2012-08-28 09:18:02 UTC
Created attachment 607452 [details]
multipath, VG, and iscsi status output

Comment 17 Inbaraj 2012-08-28 09:24:24 UTC
(In reply to comment #15)
> Hello Inbaraj,
> 
> Can you please just take the host into maintenance mode and see if the disk
> of the root file system still attached?
> 
> Test this without trying to upgrade.
> 
> Or maybe you can confirm that the following deletes the root?
>  iscsiadm -m node -o delete -T iqn.1992-08.com.netapp:sn.118043588 -p
> 192.168.100.155:3260
> 
> Thank you!

Alon, 
OS is installed on the server local hard disk .so it doesn’t affect root disk.

But as mentioned in the description of the problem, once the host is moved to maintenance mode, the iSCSI sessions are getting shutdown and due to this storage master domain which is created on the multipath device(NetApp Lun) loses its paths and the LVM layer hung.
Here is the snippets of the logs which clearly shows iSCSI sessions are shut down and multipath devices started losing paths. 

Aug 28 08:45:23 ibmx3550-210-126 ntpd[8689]: kernel time sync status change 2001
Aug 28 08:59:02 ibmx3550-210-126 multipathd: sdb: remove path (uevent)
Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37 Last path deleted, disabling queueing
Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: map in use
Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: can't flush
Aug 28 08:59:02 ibmx3550-210-126 multipathd: flush_on_last_del in progress
Aug 28 08:59:02 ibmx3550-210-126 multipathd: 360a98000486e536369346d3363302f37: load table [0 41943040 multipath 3 queue_if_no_path pg_init_retries 50 0 0 0]
Aug 28 08:59:02 ibmx3550-210-126 multipathd: sdb: path removed from map 360a98000486e536369346d3363302f37
Aug 28 08:59:02 ibmx3550-210-126 kernel: ata1.00: hard resetting link
Aug 28 08:59:03 ibmx3550-210-126 iscsid: Connection1:0 to [target: iqn.1992-08.com.netapp:sn.118043588, portal: 192.168.100.155,3260] through [iface: default] is shutdown.


I have also captured the multipath and iSCSI session status output before and after maintenance mode and attached here.

Comment 18 Alon Bar-Lev 2012-08-28 11:34:27 UTC
Thank you Inbaraj,

I was trying to find which process actually uses the disk after disconnect.

I will try to reproduce this with all the above information.

Thank you,
Alon.

Comment 21 Ayal Baron 2012-09-23 06:22:50 UTC
Can you attach lvm.conf?
Specifically I want to see if it has:
ignore_suspended_devices=1

If not I would like to test with it (but with rhev-h this could be more complicated than simply changing the file and persisting it)


Note You need to log in before you can comment on or make changes to this bug.