Bug 1150008
| Summary: | [ppc64] Cannot mix ppc and x86 hosts on same storage domain due to sanlock | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Jiri Belka <jbelka> | ||||||||||||
| Component: | vdsm | Assignee: | Nir Soffer <nsoffer> | ||||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Jiri Belka <jbelka> | ||||||||||||
| Severity: | urgent | Docs Contact: | |||||||||||||
| Priority: | urgent | ||||||||||||||
| Version: | 3.4.0 | CC: | acanan, amureini, bazulay, dfediuck, ebenahar, ecohen, eedri, gklein, hannsj_uhl, iheim, jbelka, lpeer, lsurette, lsvaty, michal.skrivanek, ogofen, scohen, sherold, teigland, tnisan, yeylon | ||||||||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||||||||
| Target Release: | 3.5.0 | ||||||||||||||
| Hardware: | ppc64 | ||||||||||||||
| OS: | Linux | ||||||||||||||
| Whiteboard: | storage | ||||||||||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||||||||||
| Doc Text: |
Sanlock version before 3.2.2 for PPC was not compatible with sanlock for x86. Data centers created with this version could not be used by x86 hosts, and data centers created with x86 host could not be used by PPC hosts.
Sanlock 3.2.2 fixed this issue and we require now this version.
|
Story Points: | --- | ||||||||||||
| Clone Of: | |||||||||||||||
| : | 1159821 1176396 (view as bug list) | Environment: | |||||||||||||
| Last Closed: | 2015-02-16 13:37:01 UTC | Type: | Bug | ||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||
| Documentation: | --- | CRM: | |||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
| Embargoed: | |||||||||||||||
| Bug Depends On: | 1159821 | ||||||||||||||
| Bug Blocks: | 1122979, 1148013, 1176396 | ||||||||||||||
| Attachments: |
|
||||||||||||||
Please attach these logs, showing the time frame where you get this error. /var/log/messages /var/log/sanlock.log /var/log/audit/audit.log What do you mean by "How reproducible: ??"? How many time did you attempt to add a host and how many times it failed? Looks like a potential blocker but I'm not sure how reproducible it is. Jiri, can you try reproduce this? Jiri, update? It would mean we can't use mixed hosts in a DC Lukas, can you please try reproduce as soon as the hosts are re-installed with the latest image? I was unable to add iSCSI to 3.4.3 engine with various errors mostly: Operation Canceled Error while executing action New SAN Storage Domain: Network error during communication with the Host. At the moment this is blocking reproduction. Was there some blocker with adding iSCSi to engine? Or adding it as master storage domain? Once this is fixed or WA proposed setup is ready for testing. I left my non-ppc host on testing engine so this will be reproducible easier. Gil this issue is caused due to either another rhev/vdsm iSCSi connection bug, or wrong | unsupported iSCSi configuration (exmaple for unsupported conf is:one lun to one iqn to one host correspondence). I don't think this will be a PPC issue, but, never say never. Jiri, I asked for additional logs (see comment 2) on 2014-10-14. Can you attach these files? lsvaty, Ok then, Dell servers have a "special" design that basically exposes one Lun per iSCSi Target(see the targets chapter, http://www.thomas-krenn.com/en/wiki/ISCSI_Basics ), in addition there is a flag (which is enabled by default) that prevents multiple hosts to initiate an iscsiadm sessions to the same target while another host is already connected to it(security reasons, this can be disabled via Dell SSux). so what probably happened, is that the removed host didn't "give up" his iscsiadm sessions, hence, remained logged in (to at least one session), the second host wants to "see" the domain but he can't since no multiple login sessions are allowed on target and in the Dell storage design every Lun is unique to it's target, so every thing crashes. reproduced: - bos pserver8 - x86 dell brq server - engine 3.4.3 (av12.3) in brq - iscsi in brq Created attachment 949729 [details]
engine logs
Created attachment 949732 [details]
p8 logs
x86 host which was having iscsi domain added doesn't have iscsi storage added anymore: [root@dell-r210ii-04 ~]# /etc/init.d/iscsi status No active sessions More precise steps: - new DC - new CL x86 - add x86 host - add iscsi storage - host into maintenance - new CL ppc64 - add ppc64 host didn't we say one cannot mix x86 and ppc in same DC due sanlock not being BE-friendly still in the current image hence we cannot mix BE and LE sanlock in same DC? David, can you confirm that sanlock can work with mixed hosts (big endian, little endian) using same lockspace/leases? (In reply to Itamar Heim from comment #18) > didn't we say one cannot mix x86 and ppc in same DC due sanlock not being > BE-friendly still in the current image hence we cannot mix BE and LE sanlock > in same DC? sanlock-3.2.1 is supposed to solve this AFAIK, but let's wait for David to confirm (see needinfo in comment 19). Big endian support was first included in sanlock-3.2.0, so 3.2.1 is good. (That includes hosts with different endianness sharing the same lockspace.) We can see in vdsm log that acquiring host id failed:
6c8cb449-0834-4aba-b6ca-5e33cc20c085::INFO::2014-10-23 08:09:31,310::clusterlock::184::SANLock::(acquireHostId) Acquiring host id for domain 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 (id: 1)
6c8cb449-0834-4aba-b6ca-5e33cc20c085::ERROR::2014-10-23 08:09:32,311::task::866::TaskManager.Task::(_setError) Task=`6c8cb449-0834-4aba-b6ca-5e33cc20c085`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 873, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/storage/task.py", line 334, in run
return self.cmd(*self.argslist, **self.argsdict)
File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm
self.masterDomain.acquireHostId(self.id)
File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId
self._clusterLock.acquireHostId(hostId, async)
File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId
raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('63dec6fe-eddf-45ed-bb0c-03a85af83ca4', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception'))
6c8cb449-0834-4aba-b6ca-5e33cc20c085::DEBUG::2014-10-23 08:09:32,324::task::885::TaskManager.Task::(_run) Task=`6c8cb449-0834-4aba-b6ca-5e33cc20c085`::Task._run: 6c8cb449-0834-4aba-b6ca-5e33cc20c085 () {} failed - stopping task
Looking in sanlock log we have few unexpected errors:
2014-10-22 11:35:32+0000 44 [35230]: sanlock daemon started 3.2.1 host d4bbe87a-1480-45e9-8ae9-884d04aa02ff.localhost
2014-10-22 11:35:32+0000 44 [35230]: set scheduler RR|RESET_ON_FORK priority 99 failed: Operation not permitted
2014-10-23 08:09:07+0000 59345 [35256]: s1 lockspace 63dec6fe-eddf-45ed-bb0c-03a85af83ca4:1:/dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids:0
2014-10-23 08:09:07+0000 59345 [8138]: verify_leader 1 wrong checksum 7c524fd6 2e00fb25 /dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids
2014-10-23 08:09:07+0000 59345 [8138]: leader1 delta_acquire_begin error -229 lockspace 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 host_id 1
2014-10-23 08:09:07+0000 59345 [8138]: leader2 path /dev/63dec6fe-eddf-45ed-bb0c-03a85af83ca4/ids offset 0
2014-10-23 08:09:07+0000 59345 [8138]: leader3 m 12212010 v 30002 ss 512 nh 0 mh 1 oi 1 og 1 lv 0
2014-10-23 08:09:07+0000 59345 [8138]: leader4 sn 63dec6fe-eddf-45ed-bb0c-03a85af83ca4 rn 91e42962-4467-442e-8b2b-82086579a14d.dell-r210i ts 0 cs 7c524fd6
2014-10-23 08:09:08+0000 59346 [35256]: s1 add_lockspace fail result -229
David, can you check sanlock log and explain the nature of these errors (see attachment 949732 [details])?
How would you recommend to debug this issue?
Created attachment 951840 [details]
sanlock lease file from little endian machine
I don't know what the problem is, but suspect something related to endianness.
Are these problems occuring in a mixed endian cluster with both big and little endian machine? I'm assuming that's the case. Please try the following:
1. Download the binary file "sanlock-init-little" that I've attached, and copy it to a big endian machine. Then send the output of running this command:
sanlock direct read_leader -s test:1:sanlock-init-little:0
2. Run these commands on a big endian machine and send me both the output of the commands and the resulting file "sanlock-init-big":
touch sanlock-init-big
sanlock direct init -s test:0:sanlock-init-big:0
sanlock direct read_leader -s test:1:sanlock-init-big:0
3. Copy the binary file sanlock-init-big to one of your little endian machines and send me the output of the command:
sanlock direct read_leader -s test:1:sanlock-init-big:0
Created attachment 951844 [details]
test program
sanlock uses a crc function to compute the checksums, and I suspect that the crc computation produces different results on le/be machines. Please download the attached testcrc.c file, copy it to a big endian machine, compile (gcc testcrc.c), run ./a.out, and send me the output.
On my little endian machine I get the output:
# ./a.out
hash ea12f9b4 (teststring)
If the result is different on a big endian machine, then I will need to either fix the crc I'm using or disable the checksumming.
I got access to a BE machine and was able to run the tests in comment 24. The BE machine could not install gcc, so I was not able to run the test in comment 25. I may have a solution to this problem, but will need access to a big endian machine with gcc so I can compile and run a small C program. Gil - Is there a way to get David access to a BE host with gss? I'm quite certain it won't be possible to get it using the IBM image due to no yum, locked down image, etc... Perhaps there is a way to compile from a BE VM that has more flexibility on getting gcc installed? While we are running the IBM image on all of the PPC server, I think the only option would be to copy the gcc binaries into it. (In reply to Gil Klein from comment #30) > While we are running the IBM image on all of the PPC server, I think the > only option would be to copy the gcc binaries into it. You only need a guest. Just create a VM and install Fedora or anything with gcc… We have now a ppc64 vm with gcc on QE setup. I've identified the bug (computing crc checksum on data before the data was endian-swapped), and am working on the fix. sanlock-3.2.2-2test1 is a scratch build with the fix: https://brewweb.devel.redhat.com/taskinfo?taskID=8180588 It looks like this bz will need to be changed to a RHEL 7.1 sanlock bz. Note that you need to recreate any lockspaces that were created with the previous version. Jiri, can you verify that the new sanlock build (see comment 34) does solve this issue? Note that you cannot test is with a storage domain created by the current ppc version, since it contains invalid checksum (see comment 35). You must create a new storage domain. I think the verification should be: 1. Activate ppc host as spm 2. Create storage domain tests creation of storage domain from ppc host, and acquiring host id 3. Activate x86 host tests acquiring host id from x86 host on lockspace created by ppc host 4. Switch spm to the x86 host tests acquire spm lease when leases initialized by ppc host Expected result: Both hosts should be up, spm runs on x86 host Repeat this again switching x86 and ppc (start with spm on x86) David, do you think these tests cover everything? Yes, that should cover it. David, there are no files now at https://brewweb.devel.redhat.com/taskinfo?taskID=8180588 Can you do a new build or point us to a place where the rpms for both x86 and ppc are? David started a new build here: https://brewweb.devel.redhat.com/taskinfo?taskID=8206100 Jiri, please take these packages before they expire. tested based on #36 - adding ppc64 - adding iscsi storage - kill vm - add x86_64 host - migrate spm plus added a test for real functionality (new vm, start vm) between steps. and reverse order of hosts for other flow. sanlock-3.2.2-2.fc20 includes the checksum endianness fix http://koji.fedoraproject.org/koji/buildinfo?buildID=590689 this bug is propose to clone to 3.4.z, but missed the 3.4.4 builds. moving to 3.4.5 - please clone once ready. (In reply to David Teigland from comment #41) > sanlock-3.2.2-2.fc20 includes the checksum endianness fix > http://koji.fedoraproject.org/koji/buildinfo?buildID=590689 Elad - can we test with this build and verify it fixes it? I gave it a try on existing DC created in the GA version of RHEV For Power. As expected the lockspace is no longer compatible and DC needs to be re-created once sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64 is used
c5493ac7-2207-4c0b-9385-e6ed6a4d14d1::INFO::2014-11-21 08:29:25,835::clusterlock::184::SANLock::(acquireHostId) Acquiring host id for domain 1cbffa17-988c-4f23-a706-b2573f53acf4 (id: 2)
c5493ac7-2207-4c0b-9385-e6ed6a4d14d1::ERROR::2014-11-21 08:29:26,836::task::866::TaskManager.Task::(_setError) Task=`c5493ac7-2207-4c0b-9385-e6ed6a4d14d1`::Unexpected error
Traceback (most recent call last):
File "/usr/share/vdsm/storage/task.py", line 873, in _run
return fn(*args, **kargs)
File "/usr/share/vdsm/storage/task.py", line 334, in run
return self.cmd(*self.argslist, **self.argsdict)
File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm
self.masterDomain.acquireHostId(self.id)
File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId
self._clusterLock.acquireHostId(hostId, async)
File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId
raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: ('1cbffa17-988c-4f23-a706-b2573f53acf4', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception')
sanlock failure:
2014-11-21 08:30:10+0000 668233 [96454]: verify_leader 2 wrong checksum 7d85b535 52fc8a69
...and I managed to verify it once DC is created with sanlock 3.2.2 it works ok in mixed environment, tested SPM on ppc and x86 within the same DC. A host with sanlock 3.2.1 in that DC can't become SPM. the only thing is the painful procedure of creating new DC and moving all the data there (the only way is export&import the VMs, make sure all the storage domains are removed/unattached while there is still the "old" DC and some host there) Removed outdated doctext. This bz is not doing any good because the actual fix is in sanlock, and the sanlock bz for this is 1159821, which does not have necessary flags. sanlock-3.2.2-2.el7 is now built with the endian fix, https://brewweb.devel.redhat.com/taskinfo?taskID=8344598 (In reply to David Teigland from comment #49) > sanlock-3.2.2-2.el7 is now built with the endian fix, > https://brewweb.devel.redhat.com/taskinfo?taskID=8344598 Elad, https://brewweb.devel.redhat.com/buildinfo?buildID=402377 now contains both PPC and x86 builds - can you retest please? We do not have the HW to verify/reproduce, Jiri, can you help please with this one? iiuc storage has to be re-created with newer sanlock, otherwise it cannot be used (see discussion above). qa can help to verify, just move to on_qa, comment #40 states that newer sanlock solved the issue. this should work DC created on PPC using sanlock 3.2.2+ DC created on x86 using any sanlock version ok, tested based on #53 sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64 vdsm-4.14.18-0.pkvm2_1_1.1.ppc64 sanlock-3.1.0-2.el7.x86_64 vdsm-4.14.18-5.el7ev.x86_64 bug has 3.4.z flag proposed, please clone to 3.4.5 if needed. (In reply to Jiri Belka from comment #54) > ok, tested based on #53 > > sanlock-3.2.2-2.1.pkvm2_1_1.1.ppc64 > vdsm-4.14.18-0.pkvm2_1_1.1.ppc64 vdsm still requires sanlock >= 2.8 On PPC, sanlock 3.2.2 is available, but on regular rhel versions, we need to require new version in vdsm. |
Created attachment 944469 [details] engine.log, vdsm.log Description of problem: I cannot make ISCSI DC up while using ppc64 host. b65c01b8-71ff-4f6a-a9ed-b627d52b283a::ERROR::2014-10-07 04:31:08,109::task::866::TaskManager.Task::(_setError) Task=`b65c01b8-71ff-4f6a-a9ed-b627d52b283a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 334, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 269, in startSpm self.masterDomain.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 468, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 199, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('0bd899ea-85a6-4384-a98c-624d4bd6d584', SanlockException(-229, 'Sanlock lockspace add failure', 'Sanlock exception')) Thread-2190::ERROR::2014-10-07 04:31:19,265::dispatcher::68::Storage.Dispatcher.Protect::(run) Secured object is not in safe state Traceback (most recent call last): File "/usr/share/vdsm/storage/dispatcher.py", line 60, in run result = ctask.prepare(self.func, *args, **kwargs) File "/usr/share/vdsm/storage/task.py", line 103, in wrapper return m(self, *a, **kw) File "/usr/share/vdsm/storage/task.py", line 1176, in prepare raise self.error SecureError: Secured object is not in safe state Version-Release number of selected component (if applicable): iscsi-initiator-utils-6.2.0.873-21.pkvm2_1.1.ppc64 vdsm-4.14.17-1.mrkev.ppc64 libiscsi-1.7.0-4.pkvm2_1.2.ppc64 libvirt-lock-sanlock-1.1.3-1.pkvm2_1.17.6.ppc64 sanlock-python-3.2.1-1.pkvm2_1.1.ppc64 iscsi-initiator-utils-iscsiuio-6.2.0.873-21.pkvm2_1.1.ppc64 sanlock-lib-3.2.1-1.pkvm2_1.1.ppc64 sanlock-3.2.1-1.pkvm2_1.1.ppc64 How reproducible: ?? Steps to Reproduce: 1. make dc, cl and add a non-ppc64 host, add iscsi storage 2. remove non-ppc64 host 3. add ppc64 host into this dc/cl Actual results: Invalid status on Data Center iscsi. Setting status to Non Responsive. Expected results: should work Additional info: well i'm not 100% sure about reproducible steps because i had there iscsi storage without a host. but as i used non-ppc64 hosts before i conclude the flow is as written above.