Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 690520

Summary:

VDSM: host reboots when blocking connection to iscsi storage when Export and data SD are located in the same storage

Product:

Red Hat Enterprise Linux 6

Reporter:

Dafna Ron <dron>

Component:

vdsm

Assignee:

Sanjay Mehrotra <smehrotr>

Status:

CLOSED DUPLICATE

QA Contact:

yeylon <yeylon>

Severity:

high

Docs Contact:

Priority:

high

Version:

6.1

CC:

abaron, bazulay, dornelas, iheim, smehrotr, smizrahi, srevivo, ykaul

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-05-03 07:46:42 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs	none

Description Dafna Ron 2011-03-24 14:56:47 UTC

Created attachment 487364 [details]
logs

Description of problem:

setup:

2 hosts in same cluster. 
attached 1 Data SD and one Export SD - both located in the same storage. 

when blocking iscsi port the host reboots - this only happens when Export is attached and only in 2.3 rhevm. 

tested the following scenarios to make sure this is the only scenario:

1) old vdsm version - host rebooted
2) 2 different export domains (to make sure there is no corruption) - host rebooted
3) attached iso instead of Export - host did not reboot
4) attached 2 Data Domains located in the same storage - host did not reboot
5) checked with NFS storage - host did not reboot


Version-Release number of selected component (if applicable):
vdsm-cli-4.9-55.el6.x86_64
vdsm-4.9-55.el6.x86_64
qemu-img-0.12.1.2-2.146.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.146.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. in two host cluster, attach Data and Export SD located in the same storage to SPM
2. block communication to storage using iptables
3.
  
Actual results:

host will reboot

Expected results:

host should not reboot

Additional info: vdsm + rhevm logs

Comment 2 Dafna Ron 2011-03-24 15:35:45 UTC

reproduced with updated vdsm and qemu:

[root@south-02 ~]# rpm -qa |grep vdsm
vdsm-4.9-56.el6.x86_64
vdsm-debuginfo-4.9-56.el6.x86_64
vdsm-cli-4.9-56.el6.x86_64
[root@south-02 ~]# rpm -qa |grep qemu
qemu-img-0.12.1.2-2.152.el6.x86_64
qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
gpxe-roms-qemu-0.9.7-6.4.el6.noarch
qemu-kvm-0.12.1.2-2.152.el6.x86_64

Comment 3 RHEL Program Management 2011-04-04 02:04:16 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Sanjay Mehrotra 2011-04-21 14:10:16 UTC

Currently investigating the issue thru the log files provided and the source code safelease.c, spmprotect.sh Please feel to post any other files to be looked at for this case.

Comment 5 Sanjay Mehrotra 2011-04-26 22:40:27 UTC

The current functionality of rebooting the host is performed by Storage Pool Manager which monitors the leases folder of the block based storage domain. In block based storage domain, it is another LV.

How the SPM works on leases folder ? The SPM acquires SPM lock ( vdsm pid ) for the LV ( lease folder ). Once it is succesful in acquiring the lock, the renewal process is started which is basically to write the time stamp at a time interval on the logical volume. When the communication to the storage are blocked ( as per the test), the SPM waits for the time outs which leads to "fencing" failure to occur. The SPM kills the vdsm daemon, restarts the vdsm daemon. It also starts netconsole, iscsiadm. Now, if the logical volumes are still unavailable after restarting vdsm/SPM, then the host is forced reboot with the following command sudo - reboot -f. The RHEV Manager, shows VM status as unknown. In this test case, the communication is still blocked to the storage which causes the reboot to happen.

[ References
a) Code spmprotect.sh, safelease.c
b) knowledge sessions slides presented by Marina & Vladik ]

So, each Storage Domain, with leases volume will result in reboot of the host.

Questions ( For Needinfo State ) - Was the communication between the host and storage blocked for both the hosts in the cluster at the same time. Looking at the spm-lock file, it shows both the hosts were rebooting but with different times stamps.

Further to Investigate -
1. Why didn't the second host took over the LVM to continue the lease renewal ?
2. Storage recovery recovery process ? Is rebooting the host the best strategy ?
3. We do not try to discover the LUNs ( from iSCSI perspective ), just restart the iscsiadm.

Comment 6 Ayal Baron 2011-04-27 21:10:04 UTC

[SNIP]

> as unknown.  In this test case, the communication is still blocked to the
> storage which causes the reboot to happen. 

[AB] This should actually not happen.  Upon VDSM startup, if all SPM state has been released, vdsm will shut spmprotect down and will avoid reboot.  This is only possible though if umount of the master mount succeeded.

> 
> 
> [ References 
> a) Code spmprotect.sh, safelease.c 
> b) knowledge sessions slides presented by Marina & Vladik ] 
> 
> So, each Storage Domain, with leases volume will result in reboot of the host. 
> 
> Questions ( For Needinfo State )  -  Was the communication between the host and
> storage blocked for both the hosts in the cluster at the same time. Looking at
> the spm-lock file, it shows both the hosts were rebooting but with different
> times stamps. 

[AB] Yaniv, please provide above info.

> 
> Further to Investigate -
> 1.  Why didn't the second host took over the LVM to continue the lease renewal
> ?  
> 2.  Storage recovery recovery process ?  Is rebooting the host the best
> strategy ?  

[AB] It's a last resort.  We try to:
1. stop SPM (and properly free shared resources, namely umount masterfs)
2. if above fails - restart vdsm (and again try to free spm state incl. disconnecting iSCSI session etc. to make this work)
3. if above fails - reboot (nothing else we can do)

> 3.  We do not try to discover the LUNs ( from iSCSI perspective ), just restart
> the iscsiadm.

[AB] Sanjay, Where do we restart iscsiadm? and when (and why) should we rediscover the luns?

Comment 7 Sanjay Mehrotra 2011-05-03 04:06:29 UTC

Response to Ayal queries. 
[AB] This should actually not happen.  Upon VDSM startup, if all SPM state has
been released, vdsm will shut spmprotect down and will avoid reboot.  This is
only possible though if umount of the master mount succeeded.

In the vdsm log file of this case, the umount of the master succeeded.  Following is the log output with pool UUID - xxxx92e0 
spmStatus - here 
Thread-52945::DEBUG::2011-03-24 09:11:09,874::blockSD::156::Storage.Metadata::(_get) metadata=['CLASS=Data', 'DESCRIPTION=RHEL6', 'IOOPTIMEOUTSEC=10', 'LEASERETRIES=3', 'LEASETIMESEC=60', 'LOCKPOLICY=None', 'LOCKRENEWALINTERVALSEC=5', 'MASTER_VERSION=1', 'POOL_DESCRIPTION=RHEL6', 'POOL_DOMAINS=48693808-451b-4407-8aa1-d7df96d69124:Active,2e46da5a-2ad6-4672-a297-02b6d3713274:Active,ff008647-727c-42be-9dd0-e49d9c9df079:Active', 'POOL_SPM_ID=1', 'POOL_SPM_LVER=2', 'POOL_UUID=42a09e2e-8665-4bc6-98d1-583b9c0d92e0', 'PV0=pv:36090a068a074fb8fe0d7848d5a25cd12,uuid:kh2Ttn-etkH-Gm80-03SP-cGqt-fWmf-Ar81ld,pestart:0,pecount:798,mapoffset:0', 'ROLE=Master', 'SDUUID=ff008647-727c-42be-9dd0-e49d9c9df079', 'TYPE=ISCSI', 'VERSION=0', 'VGUUID=xbOe3g-6E7v-6Kif-5tBR-jGtO-roQT-nRC3Ee', '_SHA_CKSUM=51880329178e0549b5d933643a6587072e425259']

spmStop issued - here for the same UUID
MainThread::INFO::2011-03-24 09:12:30,993::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: spmStop, args: ( spUUID=42a09e2e-8665-4bc6-98d1-583b9c0d92e0)
MainThread::DEBUG::2011-03-24 09:12:30,994::task::491::TaskManager.Task::(_debug) Task 5121a9f1-d6d3-467e-bc27-ee0768a333b2: moving from state init -> state preparing

PrepareShutdown issued 
MainThread::INFO::2011-03-24 09:12:38,365::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: prepareForShutdown, args: ()

MainThread::DEBUG::2011-03-24 09:12:38,368::fileUtils::109::Storage.Misc.excCmd::(umount) '/usr/bin/sudo -n /bin/umount -f /rhev/data-center/mnt/blockSD/ff008647-727c-42be-9dd0-e49d9c9df079/master' (cwd None) 

[ Does not mention whether umount succeeded or not ], vdsm did restart immediately 

Thread-16154::INFO::2011-03-24 09:12:39,605::blockSD::613::Storage.StorageDomain::(validate) sdUUID=48693808-451b-4407-8aa1-d7df96d69124
MainThread::INFO::2011-03-24 09:12:42,846::vdsm::71::vds::(run) I am the actual vdsm 4.9-55

[ Volumes are not available due storage connection blocked, hence the I/O error ]
MainThread::DEBUG::2011-03-24 09:13:41,240::lvm::352::Storage.Misc.excCmd::(cmd) SUCCESS: <err> = '  /dev/mapper/36090a068a074fb8fe0d7848d5a25cd12: read failed after 0 of 4096 at 53697511424: Input/output error\n  
[ The error above continues for all dev/mapper devices  in the storage domain including backups ] 

ainThread::DEBUG::2011-03-24 09:13:41,953::spm::279::Storage.Misc.excCmd::(__releaseLocks) '/usr/bin/killall -g -USR1 spmprotect.sh' (cwd None)
MainThread::DEBUG::2011-03-24 09:13:42,034::spm::279::Storage.Misc.excCmd::(__releaseLocks) SUCCESS: <err> = ''; <rc> = 0
MainThread::WARNING::2011-03-24 09:13:42,037::spm::284::Storage.SPM::(__releaseLocks) SPM: found lease locks, releasing
MainThread::DEBUG::2011-03-24 09:13:43,039::spm::288::Storage.Misc.excCmd::(__releaseLocks) '/usr/bin/killall -0 spmprotect.sh' (cwd None)
MainThread::DEBUG::2011-03-24 09:13:43,060::spm::288::Storage.Misc.excCmd::(__releaseLocks) FAILED: <err> = 'spmprotect.sh: no process killed\n'; <rc> = 1

[ lvm recovery fails, multipathing fails iscsiadm session fails, again no connection to the storage ]
[ Cleanup repository fails - invalid args ] 
MainThread::WARNING::2011-03-24 09:15:56,975::hsm::209::Storage.HSM::(__init__) Failed to clean Storage Repository.
OSError: [Errno 22] Invalid argument: '/rhev/data-center/42a09e2e-8665-4bc6-98d1-583b9c0d92e0'

[ Reboot takes place ]

Comment 8 Sanjay Mehrotra 2011-05-03 04:34:28 UTC

[AB] Sanjay, Where do we restart iscsiadm? and when (and why) should we
rediscover the luns?
iscsiadm deamon is started as part of vdsm daemon restart.  ( vdsm/vdsm/vdsmd service source code ) 
[ Snip from the service code ] 
NEEDED_SERVICES="iscsid multipathd netconsole"

I was wrong on this, it first checks the status, if it is not running then starts these services. 

if ! /sbin/service $srv status > /dev/null 2>&1; then it starts the services /sbin/service $srv start.  In case if iscsi, it tries to force start.  

After vdsm restart from spmprotect.sh,  vdsm code tries to rediscover the volumes up the software stack ( multipathing, device mapper and lvm ), hence we are okay with the discovery process.

Comment 9 Sanjay Mehrotra 2011-05-03 04:40:01 UTC

Is it a duplicate of 678853 ?

Comment 10 Sanjay Mehrotra 2011-05-03 04:44:43 UTC

Since more recent packages has been released,  I sincerely request that this defect to be reproduced with the latest especially with the changes made to fix 678853

On 5/2/2011 22:18, Sanjay Mehrotra wrote:
> Noticed in the vdsm log file for the BZ - 690520 which has the following version.  Trying to find out if there is any corelation between this message and the defect.
>
>
> [root@south-02 ~]# rpm -qa |grep vdsm
> vdsm-4.9-56.el6.x86_64
> vdsm-debuginfo-4.9-56.el6.x86_64
> vdsm-cli-4.9-56.el6.x86_64
> [root@south-02 ~]# rpm -qa |grep qemu
> qemu-img-0.12.1.2-2.152.el6.x86_64
> qemu-kvm-debuginfo-0.12.1.2-2.152.el6.x86_64
> gpxe-roms-qemu-0.9.7-6.4.el6.noarch
> qemu-kvm-0.12.1.2-2.152.el6.x86_64

Those are outdated packages. All of them.
I warmly suggest upgrading first.
Y.

>
> ----- Original Message -----
> From: "Yaniv Kaul"<ykaul 
> To: "Sanjay Mehrotra"<smehrotr>
> Cc: rhev-devel
> Sent: Monday, May 2, 2011 3:01:04 PM
> Subject: Re: [rhev-devel] KEY PV0 is not registered - What does it mean.
>

Comment 12 Saggi Mizrahi 2011-05-03 07:46:42 UTC


*** This bug has been marked as a duplicate of bug 678853 ***