Bug 690520 - VDSM: host reboots when blocking connection to iscsi storage when Export and data SD are located in the same storage
Summary: VDSM: host reboots when blocking connection to iscsi storage when Export and ...
Keywords:
Status: CLOSED DUPLICATE of bug 678853
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: vdsm
Version: 6.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Sanjay Mehrotra
QA Contact: yeylon@redhat.com
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-24 14:56 UTC by Dafna Ron
Modified: 2016-04-18 06:39 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-03 07:46:42 UTC
Target Upstream Version:


Attachments (Terms of Use)
logs (1.67 MB, application/x-gzip)
2011-03-24 14:56 UTC, Dafna Ron
no flags Details

Description Dafna Ron 2011-03-24 14:56:47 UTC
Created attachment 487364 [details]
logs

Description of problem:

setup:

2 hosts in same cluster. 
attached 1 Data SD and one Export SD - both located in the same storage. 

when blocking iscsi port the host reboots - this only happens when Export is attached and only in 2.3 rhevm. 

tested the following scenarios to make sure this is the only scenario:

1) old vdsm version - host rebooted
2) 2 different export domains (to make sure there is no corruption) - host rebooted
3) attached iso instead of Export - host did not reboot
4) attached 2 Data Domains located in the same storage - host did not reboot
5) checked with NFS storage - host did not reboot


Version-Release number of selected component (if applicable):
vdsm-cli-4.9-55.el6.x86_64
vdsm-4.9-55.el6.x86_64
qemu-img-0.12.1.2-2.146.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.146.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. in two host cluster, attach Data and Export SD located in the same storage to SPM
2. block communication to storage using iptables
3.
  
Actual results:

host will reboot

Expected results:

host should not reboot

Additional info: vdsm + rhevm logs

Comment 2 Dafna Ron 2011-03-24 15:35:45 UTC
reproduced with updated vdsm and qemu:

[root@south-02 ~]# rpm -qa |grep vdsm
vdsm-4.9-56.el6.x86_64
vdsm-debuginfo-4.9-56.el6.x86_64
vdsm-cli-4.9-56.el6.x86_64
[root@south-02 ~]# rpm -qa |grep qemu
qemu-img-0.12.1.2-2.152.el6.x86_64
qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.152.el6.x86_64

Comment 3 RHEL Program Management 2011-04-04 02:04:16 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Sanjay Mehrotra 2011-04-21 14:10:16 UTC
Currently investigating the issue thru the log files provided and the source code safelease.c, spmprotect.sh Please feel to post any other files to be looked at for this case.

Comment 5 Sanjay Mehrotra 2011-04-26 22:40:27 UTC
The current functionality of rebooting the host is performed by Storage Pool Manager which monitors the leases folder of the block based storage domain.  In block based storage domain,  it is another LV. 

How the SPM works on leases folder ?  The SPM acquires SPM lock ( vdsm pid ) for the LV ( lease folder ).  Once it is succesful in acquiring the lock,  the renewal process is started which is basically to write the time stamp at a time interval on the logical volume.  When the communication to the storage are blocked ( as per the test),  the SPM waits for the time outs which leads to "fencing" failure to occur.  The SPM kills the vdsm daemon, restarts the vdsm daemon.  It also starts netconsole, iscsiadm.  Now, if the logical volumes are still unavailable after restarting vdsm/SPM,  then the host is forced reboot with the following command sudo - reboot -f.  The RHEV Manager, shows VM status as unknown.  In this test case, the communication is still blocked to the storage which causes the reboot to happen. 


[ References 
a) Code spmprotect.sh, safelease.c 
b) knowledge sessions slides presented by Marina & Vladik ] 

So, each Storage Domain, with leases volume will result in reboot of the host. 

Questions ( For Needinfo State )  -  Was the communication between the host and storage blocked for both the hosts in the cluster at the same time. Looking at the spm-lock file, it shows both the hosts were rebooting but with different times stamps. 

Further to Investigate -
1.  Why didn't the second host took over the LVM to continue the lease renewal ?  
2.  Storage recovery recovery process ?  Is rebooting the host the best strategy ?  
3.  We do not try to discover the LUNs ( from iSCSI perspective ), just restart the iscsiadm.

Comment 6 Ayal Baron 2011-04-27 21:10:04 UTC
[SNIP]

> as unknown.  In this test case, the communication is still blocked to the
> storage which causes the reboot to happen. 

[AB] This should actually not happen.  Upon VDSM startup, if all SPM state has been released, vdsm will shut spmprotect down and will avoid reboot.  This is only possible though if umount of the master mount succeeded.

> 
> 
> [ References 
> a) Code spmprotect.sh, safelease.c 
> b) knowledge sessions slides presented by Marina & Vladik ] 
> 
> So, each Storage Domain, with leases volume will result in reboot of the host. 
> 
> Questions ( For Needinfo State )  -  Was the communication between the host and
> storage blocked for both the hosts in the cluster at the same time. Looking at
> the spm-lock file, it shows both the hosts were rebooting but with different
> times stamps. 

[AB] Yaniv, please provide above info.

> 
> Further to Investigate -
> 1.  Why didn't the second host took over the LVM to continue the lease renewal
> ?  
> 2.  Storage recovery recovery process ?  Is rebooting the host the best
> strategy ?  

[AB] It's a last resort.  We try to:
1. stop SPM (and properly free shared resources, namely umount masterfs)
2. if above fails - restart vdsm (and again try to free spm state incl. disconnecting iSCSI session etc. to make this work)
3. if above fails - reboot (nothing else we can do)

> 3.  We do not try to discover the LUNs ( from iSCSI perspective ), just restart
> the iscsiadm.

[AB] Sanjay, Where do we restart iscsiadm? and when (and why) should we rediscover the luns?

Comment 7 Sanjay Mehrotra 2011-05-03 04:06:29 UTC
Response to Ayal queries. 
[AB] This should actually not happen.  Upon VDSM startup, if all SPM state has
been released, vdsm will shut spmprotect down and will avoid reboot.  This is
only possible though if umount of the master mount succeeded.

In the vdsm log file of this case, the umount of the master succeeded.  Following is the log output with pool UUID - xxxx92e0 
spmStatus - here 
Thread-52945::DEBUG::2011-03-24 09:11:09,874::blockSD::156::Storage.Metadata::(_get) metadata=['CLASS=Data', 'DESCRIPTION=RHEL6', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIMESEC=60', 'LOCKPOLICY=None', 'LOCKRENEWALINTERVALSEC=5', 'MASTER_VERSION=1', 'POOL_DESCRIPTION=RHEL6', 'POOL_DOMAINS=48693808-451b-4407-8aa1-d7df96d69124:Active,2e46da5a-2ad6-4672-a297-02b6d3713274:Active,ff008647-727c-42be-9dd0-e49d9c9df079:Active', 'POOL_SPM_ID=1', 'POOL_SPM_LVER=2', 'POOL_UUID=42a09e2e-8665-4bc6-98d1-583b9c0d92e0', 'PV0=pv:36090a068a074fb8fe0d7848d5a25cd12,uuid:kh2Ttn-etkH-Gm80-03SP-cGqt-fWmf-Ar81ld,pestart:0,pecount:798,mapoffset:0', 'ROLE=Master', 'SDUUID=ff008647-727c-42be-9dd0-e49d9c9df079', 'TYPE=ISCSI', 'VERSION=0', 'VGUUID=xbOe3g-6E7v-6Kif-5tBR-jGtO-roQT-nRC3Ee', '_SHA_CKSUM=51880329178e0549b5d933643a6587072e425259']

spmStop issued - here for the same UUID
MainThread::INFO::2011-03-24 09:12:30,993::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: spmStop, args: ( spUUID=42a09e2e-8665-4bc6-98d1-583b9c0d92e0)
MainThread::DEBUG::2011-03-24 09:12:30,994::task::491::TaskManager.Task::(_debug) Task 5121a9f1-d6d3-467e-bc27-ee0768a333b2: moving from state init -> state preparing

PrepareShutdown issued 
MainThread::INFO::2011-03-24 09:12:38,365::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: prepareForShutdown, args: ()

MainThread::DEBUG::2011-03-24 09:12:38,368::fileUtils::109::Storage.Misc.excCmd::(umount) '/usr/bin/sudo -n /bin/umount -f /rhev/data-center/mnt/blockSD/ff008647-727c-42be-9dd0-e49d9c9df079/master' (cwd None) 

[ Does not mention whether umount succeeded or not ], vdsm did restart immediately 

Thread-16154::INFO::2011-03-24 09:12:39,605::blockSD::613::Storage.StorageDomain::(validate) sdUUID=48693808-451b-4407-8aa1-d7df96d69124
MainThread::INFO::2011-03-24 09:12:42,846::vdsm::71::vds::(run) I am the actual vdsm 4.9-55

[ Volumes are not available due storage connection blocked, hence the I/O error ]
MainThread::DEBUG::2011-03-24 09:13:41,240::lvm::352::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  /dev/mapper/36090a068a074fb8fe0d7848d5a25cd12: read failed after 0 of 4096 at 53697511424: Input/output error\n  
[ The error above continues for all dev/mapper devices  in the storage domain including backups ] 

ainThread::DEBUG::2011-03-24 09:13:41,953::spm::279::Storage.Misc.excCmd::(__releaseLocks) '/usr/bin/killall -g -USR1 spmprotect.sh' (cwd None)
MainThread::DEBUG::2011-03-24 09:13:42,034::spm::279::Storage.Misc.excCmd::(__releaseLocks) SUCCESS: <err> = ''; <rc> = 0
MainThread::WARNING::2011-03-24 09:13:42,037::spm::284::Storage.SPM::(__releaseLocks) SPM: found lease locks, releasing
MainThread::DEBUG::2011-03-24 09:13:43,039::spm::288::Storage.Misc.excCmd::(__releaseLocks) '/usr/bin/killall -0 spmprotect.sh' (cwd None)
MainThread::DEBUG::2011-03-24 09:13:43,060::spm::288::Storage.Misc.excCmd::(__releaseLocks) FAILED: <err> = 'spmprotect.sh: no process killed\n'; <rc> = 1

[ lvm recovery fails, multipathing fails iscsiadm session fails, again no connection to the storage ]
[ Cleanup repository fails - invalid args ] 
MainThread::WARNING::2011-03-24 09:15:56,975::hsm::209::Storage.HSM::(__init__) Failed to clean Storage Repository.
OSError: [Errno 22] Invalid argument: '/rhev/data-center/42a09e2e-8665-4bc6-98d1-583b9c0d92e0'

[ Reboot takes place ]

Comment 8 Sanjay Mehrotra 2011-05-03 04:34:28 UTC
[AB] Sanjay, Where do we restart iscsiadm? and when (and why) should we
rediscover the luns?
iscsiadm deamon is started as part of vdsm daemon restart.  ( vdsm/vdsm/vdsmd service source code ) 
[ Snip from the service code ] 
NEEDED_SERVICES="iscsid multipathd netconsole"

I was wrong on this, it first checks the status, if it is not running then starts these services. 

if ! /sbin/service $srv status > /dev/null 2>&1; then it starts the services /sbin/service $srv start.  In case if iscsi, it tries to force start.  

After vdsm restart from spmprotect.sh,  vdsm code tries to rediscover the volumes up the software stack ( multipathing, device mapper and lvm ), hence we are okay with the discovery process.

Comment 9 Sanjay Mehrotra 2011-05-03 04:40:01 UTC
Is it a duplicate of 678853 ?

Comment 10 Sanjay Mehrotra 2011-05-03 04:44:43 UTC
Since more recent packages has been released,  I sincerely request that this defect to be reproduced with the latest especially with the changes made to fix 678853

On 5/2/2011 22:18, Sanjay Mehrotra wrote:
> Noticed in the vdsm log file for the BZ - 690520 which has the following version.  Trying to find out if there is any corelation between this message and the defect.
>
>
> [root@south-02 ~]# rpm -qa |grep vdsm
> vdsm-4.9-56.el6.x86_64
> vdsm-debuginfo-4.9-56.el6.x86_64
> vdsm-cli-4.9-56.el6.x86_64
> [root@south-02 ~]# rpm -qa |grep qemu
> qemu-img-0.12.1.2-2.152.el6.x86_64
> qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
> gpxe-roms-qemu-0.9.7-6.4.el6.noarch
> qemu-kvm-0.12.1.2-2.152.el6.x86_64

Those are outdated packages. All of them.
I warmly suggest upgrading first.
Y.

>
> ----- Original Message -----
> From: "Yaniv Kaul"<ykaul 
> To: "Sanjay Mehrotra"<smehrotr>
> Cc: rhev-devel
> Sent: Monday, May 2, 2011 3:01:04 PM
> Subject: Re: [rhev-devel] KEY PV0 is not registered - What does it mean.
>

Comment 12 Saggi Mizrahi 2011-05-03 07:46:42 UTC

*** This bug has been marked as a duplicate of bug 678853 ***


Note You need to log in before you can comment on or make changes to this bug.