Bug 1140437
| Summary: | spmProtects fences SPM if fails to renew ISO domain leases file during Attach ISO Domain flow | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Marina Kalinin <mkalinin> | ||||
| Component: | vdsm | Assignee: | Federico Simoncelli <fsimonce> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Aharon Canan <acanan> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 3.3.0 | CC: | bazulay, danken, ecohen, fsimonce, gklein, iheim, laravot, lpeer, scohen, tnisan, yeylon | ||||
| Target Milestone: | --- | Keywords: | Triaged | ||||
| Target Release: | 3.5.0 | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | storage | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2014-10-24 17:18:55 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Marina Kalinin
2014-09-11 01:23:35 UTC
My suggestion is to remove the fence command from spmprotect logic, since there is no reason to fence vdsm or a host, if ISO domain encounters a failure. ISO domain is not critical and should endanger the whole environment.
From /usr/libexec/vdsm/spmprotect.sh
~~~
function start_renewal_loop() {
local renewed curr i tl TPID
while true ; do
curr=`date +%s`
debug "last renewal = $LAST_RENEWAL, curr = $curr"
tl=$((LEASE_TIME_MS/1000-(curr*1000000-LAST_RENEWAL)/1000000))
if [ "$tl" -gt "0" ] ; then
(sleep $tl && fence) 2>/dev/null &
disown
TPID=$!
else
fence
fi
~~~
Copying here Federico's comment for another bug, which I believe was meant for this bug: ~~~ This happens only when "acquire" succeeds and "renew" fails. The positive flow is quite fast usually (few seconds) but if the iso domains become unreachable between acquire and renew then the host is fenced. So if you try to attach an iso that is not reachable you usually don't get fenced (unless as I said the first write is successful). ~~~ Reading the comments above, does not seem right to me, what we see in the log and what the code wants us to do. Let me copy the spm-lock.log here again: ~~~~~~~~~~~~~ [2014-08-27 13:55:13] Protecting spm lock for vdsm pid 71254 [2014-08-27 13:55:13] Trying to acquire lease - sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 lease_file=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases id=3 lease_time_ms=5000 io_op_to_ms=1000 [2014-08-27 13:55:15] Lease acquired sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases, TS=1409147713666790 [2014-08-27 13:55:15] Protecting spm lock for vdsm pid 71254 [2014-08-27 13:55:15] Started renewal process (pid=21899) for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases ---> trying to fence 6 seconds later [2014-08-27 13:55:21] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases /bin/kill -USR1 71254 ---> 2 seconds later - another attempt to fence [2014-08-27 13:55:23] Fencing sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases /bin/kill -USR1 71254 ---> 13 seconds later from initial attempt to renew - not 20 seconds. [2014-08-27 13:55:28] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases /bin/kill 71254 /bin/kill -9 71254 [2014-08-27 13:55:30] Trying to stop vdsm for sdUUID=50885a30-5bec-4b34-98ed-ed3c3306bad0 id=3 lease_path=/rhev/data-center/mnt/10.10.10.147:_var_lib_exports_iso/50885a30-5bec-4b34-98ed-ed3c3306bad0/dom_md/leases /bin/kill 71254 kill 71254: No such process /bin/kill -9 71254 kill 71254: No such process sudo /sbin/reboot -f sudo /sbin/reboot -f ~~~~ And yes, I understand, removing the fence from the logic completely is wrong for the case when we have Data storage domains of type V1, i.e. NFS storage DC. However, the logic of spmProtect should be changed to distinguish between Data Domains and non-Data domains. As discussed with Fede: 1. This is a very rare condition happening when we succeeded to acquire the lease, but fail to renew it right away. Cannot reproduce locally. 2. Ask the customer to reproduce, since not reproducable locally. Test first, if we can remove the leases file and create new one from scratch. If works suggest customer to do this as well. Do it on customer site as well. 3. Should we change the code to differentiate between ISO domain and Data domain, and prevent vdsm fencing in the first case? Not really. Since if vdsm acquired the lease, it needs to release it, so that other host would be able acquire it. But if something wrong happened with vdsm and it cannot reach the leases file, it would not be able releasing it as well. However, we should consider not rebooting the host, if succeeded to reboot vdsmd. Created attachment 940592 [details]
leases file in the correct format
Update: cannot remove leases file and use an empty one. The file should be in a specific format. clearing my needinfo, as Federico has explained the matter. Federico, given the input from Marina, do we have any news? Just to let you know, customer didn't try attaching the ISO domain again yet, since he didn't want to endanger his production environment. Next step we agreed would be trying to attach the same ISO domain to a different Data Center. (In reply to Tal Nisan from comment #15) > Federico, given the input from Marina, do we have any news? Status update is in comment 16, clearing needinfo. |