Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1150008

Summary:

[ppc64] Cannot mix ppc and x86 hosts on same storage domain due to sanlock

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Jiri Belka <jbelka>

Component:

vdsm

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Jiri Belka <jbelka>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

3.4.0

CC:

acanan, amureini, bazulay, dfediuck, ebenahar, ecohen, eedri, gklein, hannsj_uhl, iheim, jbelka, lpeer, lsurette, lsvaty, michal.skrivanek, ogofen, scohen, sherold, teigland, tnisan, yeylon

Target Milestone:

---

Keywords:

ZStream

Target Release:

3.5.0

Hardware:

ppc64

OS:

Linux

Whiteboard:

storage

Fixed In Version:

Doc Type:

Known Issue

Doc Text:

Sanlock version before 3.2.2 for PPC was not compatible with sanlock for x86. Data centers created with this version could not be used by x86 hosts, and data centers created with x86 host could not be used by PPC hosts. Sanlock 3.2.2 fixed this issue and we require now this version.

Story Points:

---

Clone Of:

Clones:

1159821 1176396 (view as bug list)

Environment:

Last Closed:

2015-02-16 13:37:01 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1159821

Bug Blocks:

1122979, 1148013, 1176396

Attachments:

Description	Flags
engine.log, vdsm.log	none
engine logs	none
p8 logs	none
sanlock lease file from little endian machine	none
test program	none

Description Jiri Belka 2014-10-07 08:47:53 UTC

Created attachment 944469 [details]
engine.log, vdsm.log

Description of problem:

I cannot make ISCSI DC up while using ppc64 host.

b65c01b8-71ff-4f6a-a9ed-b627d52b283a::ERROR::2014-10-07 04:31:08,109::task::866::TaskManager.Task::(_setError) Task=`b65c01b8-71ff-4f6a-a9ed-b627d52b283a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('0bd899ea-85a6-4384-a98c-624d4bd6d584', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception'))

Thread-2190::ERROR::2014-10-07 04:31:19,265::dispatcher::68::Storage.Dispatcher.Protect::(run) Secured object is not in safe state
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run
    result = ctask.prepare(self.func, *args, **kwargs)
  File "/usr/share/vdsm/storage/task.py", line 103, in wrapper
    return m(self, *a, **kw)
  File "/usr/share/vdsm/storage/task.py", line 1176, in prepare
    raise self.error
SecureError: Secured object is not in safe state


Version-Release number of selected component (if applicable):
iscsi-initiator-utils-6.2.0.873-21.pkvm2_1.1.ppc64
vdsm-4.14.17-1.mrkev.ppc64
libiscsi-1.7.0-4.pkvm2_1.2.ppc64
libvirt-lock-sanlock-1.1.3-1.pkvm2_1.17.6.ppc64
sanlock-python-3.2.1-1.pkvm2_1.1.ppc64
iscsi-initiator-utils-iscsiuio-6.2.0.873-21.pkvm2_1.1.ppc64
sanlock-lib-3.2.1-1.pkvm2_1.1.ppc64
sanlock-3.2.1-1.pkvm2_1.1.ppc64

How reproducible:
??

Steps to Reproduce:
1. make dc, cl and add a non-ppc64 host, add iscsi storage
2. remove non-ppc64 host
3. add ppc64 host into this dc/cl

Actual results:
Invalid status on Data Center iscsi. Setting status to Non Responsive.

Expected results:
should work

Additional info:
well i'm not 100% sure about reproducible steps because i had there iscsi storage without a host. but as i used non-ppc64 hosts before i conclude the flow is as written above.

Comment 2 Nir Soffer 2014-10-14 15:23:46 UTC

Please attach these logs, showing the time frame where you get this error.

/var/log/messages
/var/log/sanlock.log
/var/log/audit/audit.log

What do you mean by "How reproducible: ??"?

How many time did you attempt to add a host and how many times it failed?

Comment 4 Gil Klein 2014-10-19 10:32:08 UTC

Looks like a potential blocker but I'm not sure how reproducible it is.

Jiri, can you try reproduce this?

Comment 5 Michal Skrivanek 2014-10-22 10:11:07 UTC

Jiri, update?
It would mean we can't use mixed hosts in a DC

Comment 6 Gil Klein 2014-10-22 14:22:18 UTC

Lukas, can you please try reproduce as soon as the hosts are re-installed with the latest image?

Comment 7 Lukas Svaty 2014-10-22 15:22:04 UTC

I was unable to add iSCSI to 3.4.3 engine with various errors mostly:

Operation Canceled
Error while executing action New SAN Storage Domain: Network error during communication with the Host.

At the moment this is blocking reproduction. 

Was there some blocker with adding iSCSi to engine? Or adding it as master storage domain?

Once this is fixed or WA proposed setup is ready for testing. I left my non-ppc host on testing engine so this will be reproducible easier.

Comment 10 Ori Gofen 2014-10-22 15:45:31 UTC

Gil this issue is caused due to either another rhev/vdsm iSCSi connection bug, or wrong | unsupported iSCSi configuration (exmaple for unsupported conf is:one lun to one iqn to one host correspondence).

I don't think this will be a PPC issue, but, never say never.

Comment 11 Nir Soffer 2014-10-22 16:06:10 UTC

Jiri, I asked for additional logs (see comment 2) on 2014-10-14. Can you attach these files?

Comment 12 Ori Gofen 2014-10-22 16:13:39 UTC

lsvaty, Ok then, Dell servers have a "special" design that basically exposes one Lun per iSCSi Target(see the targets chapter, http://www.thomas-krenn.com/en/wiki/ISCSI_Basics ), in addition there is a flag (which is enabled by default) that prevents multiple hosts to initiate an iscsiadm sessions to the same target while another host is already connected to it(security reasons, this can be disabled via Dell SSux).

so what probably happened, is that the removed host didn't "give up" his iscsiadm sessions, hence, remained logged in (to at least one session), the second host wants to "see" the domain but he can't since no multiple login sessions are allowed on target and in the Dell storage design every Lun is unique to it's target, so every thing crashes.

Comment 13 Jiri Belka 2014-10-23 08:18:08 UTC

reproduced:
- bos pserver8
- x86 dell brq server
- engine 3.4.3 (av12.3) in brq
- iscsi in brq

Comment 14 Jiri Belka 2014-10-23 08:19:25 UTC

Created attachment 949729 [details]
engine logs

Comment 15 Jiri Belka 2014-10-23 08:20:43 UTC

Created attachment 949732 [details]
p8 logs

Comment 16 Jiri Belka 2014-10-23 08:22:21 UTC

x86 host which was having iscsi domain added doesn't have iscsi storage added anymore:

[root@dell-r210ii-04 ~]# /etc/init.d/iscsi status
No active sessions

Comment 17 Jiri Belka 2014-10-23 11:08:26 UTC

More precise steps:
- new DC
- new CL x86
- add x86 host
- add iscsi storage
- host into maintenance
- new CL ppc64
- add ppc64 host

Comment 18 Itamar Heim 2014-10-27 22:02:10 UTC

didn't we say one cannot mix x86 and ppc in same DC due sanlock not being BE-friendly still in the current image hence we cannot mix BE and LE sanlock in same DC?

Comment 19 Nir Soffer 2014-10-27 22:29:03 UTC

David, can you confirm that sanlock can work with mixed hosts (big endian, little endian) using same lockspace/leases?

Comment 20 Allon Mureinik 2014-10-28 07:52:10 UTC

(In reply to Itamar Heim from comment #18)
> didn't we say one cannot mix x86 and ppc in same DC due sanlock not being
> BE-friendly still in the current image hence we cannot mix BE and LE sanlock
> in same DC?
sanlock-3.2.1 is supposed to solve this AFAIK, but let's wait for David to confirm (see needinfo in comment 19).

Comment 21 David Teigland 2014-10-28 14:13:20 UTC

Big endian support was first included in sanlock-3.2.0, so 3.2.1 is good.

Comment 22 David Teigland 2014-10-28 14:14:27 UTC

(That includes hosts with different endianness sharing the same lockspace.)

Comment 23 Nir Soffer 2014-10-28 22:59:16 UTC

We can see in vdsm log that acquiring host id failed:

6c8cb449-0834-4aba-b6ca-5e33cc20c085::INFO::2014-10-23 08:09:31,310::clusterlock::184::SANLock::(acquireHostId) Acquiring host id for domain 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 (id: 1)
6c8cb449-0834-4aba-b6ca-5e33cc20c085::ERROR::2014-10-23 08:09:32,311::task::866::TaskManager.Task::(_setError) Task=`6c8cb449-0834-4aba-b6ca-5e33cc20c085`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('63dec6fe-eddf-45ed-bb0c-03a85af83ca4', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception'))
6c8cb449-0834-4aba-b6ca-5e33cc20c085::DEBUG::2014-10-23 08:09:32,324::task::885::TaskManager.Task::(_run) Task=`6c8cb449-0834-4aba-b6ca-5e33cc20c085`::Task._run: 6c8cb449-0834-4aba-b6ca-5e33cc20c085 () {} failed - stopping task

Looking in sanlock log we have few unexpected errors:

2014-10-22 11:35:32+0000 44 [35230]: sanlock daemon started 3.2.1 host d4bbe87a-1480-45e9-8ae9-884d04aa02ff.localhost
2014-10-22 11:35:32+0000 44 [35230]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2014-10-23 08:09:07+0000 59345 [35256]: s1 lockspace 63dec6fe-eddf-45ed-bb0c-03a85af83ca4:1:/dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids:0
2014-10-23 08:09:07+0000 59345 [8138]: verify_leader 1 wrong checksum 7c524fd6 2e00fb25 /dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids
2014-10-23 08:09:07+0000 59345 [8138]: leader1 delta_acquire_begin error -229 lockspace 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 host_id 1
2014-10-23 08:09:07+0000 59345 [8138]: leader2 path /dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids offset 0
2014-10-23 08:09:07+0000 59345 [8138]: leader3 m 12212010 v 30002 ss 512 nh 0 mh 1 oi 1 og 1 lv 0
2014-10-23 08:09:07+0000 59345 [8138]: leader4 sn 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 rn 91e42962-4467-442e-8b2b-82086579a14d.dell-r210i ts 0 cs 7c524fd6
2014-10-23 08:09:08+0000 59346 [35256]: s1 add_lockspace fail result -229

David, can you check sanlock log and explain the nature of these errors (see attachment 949732 [details])?

How would you recommend to debug this issue?

Comment 24 David Teigland 2014-10-29 16:13:48 UTC

Created attachment 951840 [details]
sanlock lease file from little endian machine

I don't know what the problem is, but suspect something related to endianness.
Are these problems occuring in a mixed endian cluster with both big and little endian machine?  I'm assuming that's the case.  Please try the following:

1. Download the binary file "sanlock-init-little" that I've attached, and copy it to a big endian machine.  Then send the output of running this command:

sanlock direct read_leader -s test:1:sanlock-init-little:0

2. Run these commands on a big endian machine and send me both the output of the commands and the resulting file "sanlock-init-big":

touch sanlock-init-big
sanlock direct init -s test:0:sanlock-init-big:0
sanlock direct read_leader -s test:1:sanlock-init-big:0

3. Copy the binary file sanlock-init-big to one of your little endian machines and send me the output of the command:

sanlock direct read_leader -s test:1:sanlock-init-big:0

Comment 25 David Teigland 2014-10-29 16:31:51 UTC

Created attachment 951844 [details]
test program

sanlock uses a crc function to compute the checksums, and I suspect that the crc computation produces different results on le/be machines.  Please download the attached testcrc.c file, copy it to a big endian machine, compile (gcc testcrc.c), run ./a.out, and send me the output.

On my little endian machine I get the output:
# ./a.out 
hash ea12f9b4 (teststring)

If the result is different on a big endian machine, then I will need to either fix the crc I'm using or disable the checksumming.

Comment 26 David Teigland 2014-10-29 17:00:11 UTC

I got access to a BE machine and was able to run the tests in comment 24.
The BE machine could not install gcc, so I was not able to run the test in comment 25.

Comment 27 David Teigland 2014-10-29 20:04:57 UTC

I may have a solution to this problem, but will need access to a big endian machine with gcc so I can compile and run a small C program.

Comment 28 Scott Herold 2014-10-30 00:42:41 UTC

Gil - Is there a way to get David access to a BE host with gss?  I'm quite certain it won't be possible to get it using the IBM image due to no yum, locked down image, etc...  Perhaps there is a way to compile from a BE VM that has more flexibility on getting gcc installed?

Comment 30 Gil Klein 2014-10-30 10:02:51 UTC

While we are running the IBM image on all of the PPC server, I think the only option would be to copy the gcc binaries into it.

Comment 31 Michal Skrivanek 2014-10-30 14:19:00 UTC

(In reply to Gil Klein from comment #30)
> While we are running the IBM image on all of the PPC server, I think the
> only option would be to copy the gcc binaries into it.

You only need a guest. Just create a VM and install Fedora or anything with gcc…

Comment 32 Nir Soffer 2014-10-30 15:55:06 UTC

We have now a ppc64 vm with gcc on QE setup.

Comment 33 David Teigland 2014-10-30 17:06:27 UTC

I've identified the bug (computing crc checksum on data before the data was endian-swapped), and am working on the fix.

Comment 34 David Teigland 2014-10-30 21:35:08 UTC

sanlock-3.2.2-2test1 is a scratch build with the fix:

https://brewweb.devel.redhat.com/taskinfo?taskID=8180588

It looks like this bz will need to be changed to a RHEL 7.1 sanlock bz.

Comment 35 David Teigland 2014-10-30 22:08:54 UTC

Note that you need to recreate any lockspaces that were created with the previous version.

Comment 36 Nir Soffer 2014-11-02 14:08:10 UTC

Jiri, can you verify that the new sanlock build (see comment 34) does solve this issue?

Note that you cannot test is with a storage domain created by the current ppc version, since it contains invalid checksum (see comment 35). You must create a new storage domain.

I think the verification should be:

1. Activate ppc host as spm
2. Create storage domain
   tests creation of storage domain from ppc host, and acquiring host id
3. Activate x86 host
   tests acquiring host id from x86 host on lockspace created by ppc host
4. Switch spm to the x86 host
   tests acquire spm lease when leases initialized by ppc host

Expected result:
Both hosts should be up, spm runs on x86 host

Repeat this again switching x86 and ppc (start with spm on x86)

David, do you think these tests cover everything?

Comment 37 David Teigland 2014-11-03 14:53:46 UTC

Yes, that should cover it.

Comment 38 Nir Soffer 2014-11-06 13:57:58 UTC

David, there are no files now at https://brewweb.devel.redhat.com/taskinfo?taskID=8180588

Can you do a new build or point us to a place where the rpms for both x86 and ppc are?

Comment 39 Nir Soffer 2014-11-06 14:43:42 UTC

David started a new build here:
https://brewweb.devel.redhat.com/taskinfo?taskID=8206100

Jiri, please take these packages before they expire.

Comment 40 Jiri Belka 2014-11-06 16:38:20 UTC

tested based on #36

- adding ppc64
- adding iscsi storage
- kill vm
- add x86_64 host
- migrate spm

plus added a test for real functionality (new vm, start vm) between steps.

and reverse order of hosts for other flow.

Comment 41 David Teigland 2014-11-06 17:24:47 UTC

sanlock-3.2.2-2.fc20 includes the checksum endianness fix
http://koji.fedoraproject.org/koji/buildinfo?buildID=590689

Comment 42 Eyal Edri 2014-11-13 13:36:53 UTC

this bug is propose to clone to 3.4.z, but missed the 3.4.4 builds.
moving to 3.4.5 - please clone once ready.

Comment 43 Allon Mureinik 2014-11-17 16:22:30 UTC

(In reply to David Teigland from comment #41)
> sanlock-3.2.2-2.fc20 includes the checksum endianness fix
> http://koji.fedoraproject.org/koji/buildinfo?buildID=590689

Elad - can we test with this build and verify it fixes it?

Comment 44 Michal Skrivanek 2014-11-21 08:34:48 UTC

I gave it a try on existing DC created in the GA version of RHEV For Power. As expected the lockspace is no longer compatible and DC needs to be re-created once sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64 is used

c5493ac7-2207-4c0b-9385-e6ed6a4d14d1::INFO::2014-11-21 08:29:25,835::clusterlock::184::SANLock::(acquireHostId) Acquiring host id for domain 1cbffa17-988c-4f23-a706-b2573f53acf4 (id: 2)
c5493ac7-2207-4c0b-9385-e6ed6a4d14d1::ERROR::2014-11-21 08:29:26,836::task::866::TaskManager.Task::(_setError) Task=`c5493ac7-2207-4c0b-9385-e6ed6a4d14d1`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 334, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm
    self.masterDomain.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId
    self._clusterLock.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('1cbffa17-988c-4f23-a706-b2573f53acf4', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception')

sanlock failure:
2014-11-21 08:30:10+0000 668233 [96454]: verify_leader 2 wrong checksum 7d85b535 52fc8a69

Comment 45 Michal Skrivanek 2014-11-21 09:55:07 UTC

...and I managed to verify it
once DC is created with sanlock 3.2.2 it works ok in mixed environment, tested SPM on ppc and x86 within the same DC. A host with sanlock 3.2.1 in that DC can't become SPM.

the only thing is the painful procedure of creating new DC and moving all the data there (the only way is export&import the VMs, make sure all the storage domains are removed/unattached while there is still the "old" DC and some host there)

Comment 46 Allon Mureinik 2014-11-27 11:04:48 UTC

Removed outdated doctext.

Comment 48 David Teigland 2014-12-03 15:37:10 UTC

This bz is not doing any good because the actual fix is in sanlock, and the sanlock bz for this is 1159821, which does not have necessary flags.

Comment 49 David Teigland 2014-12-08 16:36:10 UTC

sanlock-3.2.2-2.el7 is now built with the endian fix,
https://brewweb.devel.redhat.com/taskinfo?taskID=8344598

Comment 50 Allon Mureinik 2014-12-08 16:58:53 UTC

(In reply to David Teigland from comment #49)
> sanlock-3.2.2-2.el7 is now built with the endian fix,
> https://brewweb.devel.redhat.com/taskinfo?taskID=8344598

Elad, https://brewweb.devel.redhat.com/buildinfo?buildID=402377 now contains both PPC and x86 builds - can you retest please?

Comment 51 Aharon Canan 2014-12-14 09:50:56 UTC

We do not have the HW to verify/reproduce, 

Jiri, can you help please with this one?

Comment 52 Jiri Belka 2014-12-15 09:02:18 UTC

iiuc storage has to be re-created with newer sanlock, otherwise it cannot be used (see discussion above).

qa can help to verify, just move to on_qa, comment #40 states that newer sanlock solved the issue.

Comment 53 Michal Skrivanek 2014-12-15 10:15:31 UTC

this should work
DC created on PPC using sanlock 3.2.2+
DC created on x86 using any sanlock version

Comment 54 Jiri Belka 2014-12-17 11:34:03 UTC

ok, tested based on #53

sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64
vdsm-4.14.18-0.pkvm2_1_1.1.ppc64

sanlock-3.1.0-2.el7.x86_64
vdsm-4.14.18-5.el7ev.x86_64

Comment 55 Eyal Edri 2014-12-21 10:36:08 UTC

bug has 3.4.z flag proposed, please clone to 3.4.5 if needed.

Comment 57 Nir Soffer 2014-12-21 10:55:24 UTC

(In reply to Jiri Belka from comment #54)
> ok, tested based on #53
> 
> sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64
> vdsm-4.14.18-0.pkvm2_1_1.1.ppc64

vdsm still requires sanlock >= 2.8

On PPC, sanlock 3.2.2 is available, but on regular rhel versions, we need to require new version in vdsm.