Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1108176

Summary: Sanlock issues after upgrading to 3.4
Product: [Retired] oVirt Reporter: Jairo Rizzo <jairo.rizzo>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED WORKSFORME QA Contact: Gil Klein <gklein>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.4CC: acathrow, amureini, bazulay, bugs, gklein, iheim, mgoldboi, yeylon
Target Milestone: ---   
Target Release: 3.4.3   
Hardware: x86_64   
OS: Linux   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-18 11:39:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Includes the following logs from HV1 (main HV that runs ovirt engine): messages, engine, sanlock, server and vdsm none

Description Jairo Rizzo 2014-06-11 14:02:48 UTC
Created attachment 907681 [details]
Includes the following logs from HV1 (main HV that runs ovirt engine): messages,  engine,  sanlock,  server and vdsm

Description of problem:

I have a small 2-node cluster setup running Glusterfs in replication mode . Basically I was running ovirt-engine v 3.3 for months fine and then upgraded to latest version of 3.3.X two days ago and could not join the nodes to the cluster due to a version mismatch,basically this:
https://www.mail-archive.com/users@ovirt.org/msg17241.html  . After trying to correct this problem I ended up upgrading to 3.4 which created a new and challenng problem for me. Every couple of hours I get error messages like this:

Jun  7 13:40:01 hv1 sanlock[2341]: 2014-06-07 13:40:01-0400 19647 [2341]: s3 check_our_lease warning 70 last_success 19577
Jun  7 13:40:02 hv1 sanlock[2341]: 2014-06-07 13:40:02-0400 19648 [2341]: s3 check_our_lease warning 71 last_success 19577
Jun  7 13:40:03 hv1 sanlock[2341]: 2014-06-07 13:40:03-0400 19649 [2341]: s3 check_our_lease warning 72 last_success 19577
Jun  7 13:40:04 hv1 sanlock[2341]: 2014-06-07 13:40:04-0400 19650 [2341]: s3 check_our_lease warning 73 last_success 19577
Jun  7 13:40:05 hv1 sanlock[2341]: 2014-06-07 13:40:05-0400 19651 [2341]: s3 check_our_lease warning 74 last_success 19577
Jun  7 13:40:06 hv1 sanlock[2341]: 2014-06-07 13:40:06-0400 19652 [2341]: s3 check_our_lease warning 75 last_success 19577
Jun  7 13:40:07 hv1 sanlock[2341]: 2014-06-07 13:40:07-0400 19653 [2341]: s3 check_our_lease warning 76 last_success 19577
Jun  7 13:40:08 hv1 sanlock[2341]: 2014-06-07 13:40:08-0400 19654 [2341]: s3 check_our_lease warning 77 last_success 19577
Jun  7 13:40:09 hv1 wdmd[2330]: test warning now 19654 ping 19644 close 0 renewal 19577 expire 19657 client 2341 sanlock_1e8615b0-7876-4a03-bdb0-352087fad0f3:1
Jun  7 13:40:09 hv1 wdmd[2330]: /dev/watchdog closed unclean
Jun  7 13:40:09 hv1 kernel: SoftDog: Unexpected close, not stopping watchdog!
Jun  7 13:40:09 hv1 sanlock[2341]: 2014-06-07 13:40:09-0400 19655 [2341]: s3 check_our_lease warning 78 last_success 19577
Jun  7 13:40:10 hv1 wdmd[2330]: test warning now 19655 ping 19644 close 19654 renewal 19577 expire 19657 client 2341 sanlock_1e8615b0-7876-4a03-bdb0-352087fad0f3:1
Jun  7 13:40:10 hv1 sanlock[2341]: 2014-06-07 13:40:10-0400 19656 [2341]: s3 check_our_lease warning 79 last_success 19577
Jun  7 13:40:11 hv1 wdmd[2330]: test warning now 19656 ping 19644 close 19654 renewal 19577 expire 19657 client 2341 sanlock_1e8615b0-7876-4a03-bdb0-352087fad0f3:1
Jun  7 13:40:11 hv1 sanlock[2341]: 2014-06-07 13:40:11-0400 19657 [2341]: s3 check_our_lease failed 80
Jun  7 13:40:11 hv1 sanlock[2341]: 2014-06-07 13:40:11-0400 19657 [2341]: s3 all pids clear

Jun  7 13:40:11 hv1 wdmd[2330]: /dev/watchdog reopen
Jun  7 13:41:32 hv1 sanlock[2341]: 2014-06-07 13:41:32-0400 19738 [5050]: s3 delta_renew write error -202
Jun  7 13:41:32 hv1 sanlock[2341]: 2014-06-07 13:41:32-0400 19738 [5050]: s3 renewal error -202 delta_length 140 last_success 19577
Jun  7 13:41:42 hv1 sanlock[2341]: 2014-06-07 13:41:42-0400 19748 [5050]: 1e8615b0 close_task_aio 0 0x7fd3040008c0 busy
Jun  7 13:41:52 hv1 sanlock[2341]: 2014-06-07 13:41:52-0400 19758 [5050]: 1e8615b0 close_task_aio 0 0x7fd3040008c0 busy

This makes one of the nodes not be able to see the storage and all its VMs will go into pause mode/stop. Wondering if you could provide some advice. Thank you

Version-Release number of selected component (if applicable):
CentOS v6.5 
kernel-2.6.32-431.17.1.el6.x86_64
vdsm-4.14.6-0.el6.x86_64
glusterfs-3.5.0-2.el6.x86_64
ovirt-engine-3.4.0-1.el6.noarch  (on 1 node)

How reproducible:

Last issue ocurred on Jun  9 08:20:11 :
hv1 sanlock[2376]: 2014-06-09 08:20:11-0400 4286 [2376]: dead 3078 ci 3 count 7

However, this last time it only made one VM on HV1 go into'unknown' state.

Steps to Reproduce:
There are no steps really, just start the VMs and assign them to run on any of the Hypervisors


Additional info:
I'm attaching all logs as requested. Thank you.

Comment 1 Jairo Rizzo 2014-06-12 11:47:22 UTC
A more recent issue just happend about 2 hours on HV2:

Jun  8 04:59:51 hv2 sanlock[2163]: 2014-06-08 04:59:51-0400 40884 [2163]: s1 check_our_lease warning 71 last_success 40813
Jun  8 04:59:52 hv2 sanlock[2163]: 2014-06-08 04:59:52-0400 40885 [2163]: s1 check_our_lease warning 72 last_success 40813
Jun  8 04:59:53 hv2 sanlock[2163]: 2014-06-08 04:59:53-0400 40886 [2163]: s1 check_our_lease warning 73 last_success 40813
Jun  8 04:59:58 hv2 wdmd[2152]: test warning now 40891 ping 40881 close 0 renewal 40813 expire 40893 client 2163 sanlock_84975a07-3abb-4c9a-bf37-f64920298b0c:2
Jun  8 04:59:58 hv2 wdmd[2152]: /dev/watchdog closed unclean
Jun  8 04:59:58 hv2 kernel: SoftDog: Unexpected close, not stopping watchdog!
Jun  8 04:59:59 hv2 sanlock[2163]: 2014-06-08 04:59:59-0400 40892 [2163]: s1 check_our_lease warning 79 last_success 40813
Jun  8 04:59:59 hv2 wdmd[2152]: test warning now 40892 ping 40881 close 40891 renewal 40813 expire 40893 client 2163 sanlock_84975a07-3abb-4c9a-bf37-f64920298b0c:2
Jun  8 05:00:00 hv2 sanlock[2163]: 2014-06-08 05:00:00-0400 40893 [2163]: s1 check_our_lease failed 80
Jun  8 05:00:00 hv2 wdmd[2152]: test failed rem 58 now 40893 ping 40881 close 40891 renewal 40813 expire 40893 client 2163 sanlock_84975a07-3abb-4c9a-bf37-f64920298b0c:2
Jun  8 05:00:00 hv2 sanlock[2163]: 2014-06-08 05:00:00-0400 40893 [2163]: s1 all pids clear
Jun  8 05:00:01 hv2 wdmd[2152]: /dev/watchdog reopen
Jun  8 05:01:20 hv2 sanlock[2163]: 2014-06-08 05:01:20-0400 40973 [3038]: s1 delta_renew write error -202
Jun  8 05:01:20 hv2 sanlock[2163]: 2014-06-08 05:01:20-0400 40973 [3038]: s1 renewal error -202 delta_length 140 last_success 40813
Jun  8 05:01:30 hv2 sanlock[2163]: 2014-06-08 05:01:30-0400 40983 [3038]: 84975a07 close_task_aio 0 0x7ff5a80008c0 busy
Jun  8 05:02:20 hv2 sanlock[2163]: 2014-06-08 05:02:20-0400 41033 [3038]: 84975a07 close_task_aio 0 0x7ff5a80008c0 busy
Jun  8 05:02:27 hv2 kernel: INFO: task python:3013 blocked for more than 120 seconds.
Jun  8 05:02:27 hv2 kernel:      Not tainted 2.6.32-431.3.1.el6.x86_64 #1
Jun  8 05:02:27 hv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  8 05:02:27 hv2 kernel: python        D 0000000000000002     0  3013   2852 0x00000084
Jun  8 05:02:27 hv2 kernel: ffff8807f62a7a78 0000000000000082 ffff8807f62a7a18 ffffffff81054839
Jun  8 05:02:27 hv2 kernel: ffff8807f62a7a08 0000000300000000 ffff8807f62a7a18 ffff88080f65dc40
Jun  8 05:02:27 hv2 kernel: ffff880812a805f8 ffff8807f62a7fd8 000000000000fbc8 ffff880812a805f8
Jun  8 05:02:27 hv2 kernel: Call Trace:
Jun  8 05:02:27 hv2 kernel: [<ffffffff81054839>] ? __wake_up_common+0x59/0x90

Comment 2 Jairo Rizzo 2014-06-16 16:51:40 UTC
I reset back the GlusterFS volume options (for the two replicated volumes I have) to the recommended values below:

Option Key  Option Value 
network.remote-dio enable 
performance.io-cache off 
cluster.quorum-type auto 
performance.stat-prefetch off 
cluster.server-quorum-type server 
performance.quick-read off 
cluster.eager-lock  enable 
performance.read-ahead  off 
storage.owner-uid 36 
storage.owner-gid 36 

The above settings work perfectly and my cluster has been stable for the past 48 hours. I believe the issue arises when enabling some performance options above but dont have test lab to test. Thanks

--Rizzo

Comment 3 Federico Simoncelli 2014-06-18 11:39:11 UTC
Hi Jairo please consider to read also:

 http://www.ovirt.org/Gluster_Storage_Domain_Reference

I am closing this now, best guess is that the previous volume configuration was a getting your kernel overloaded, as per message:

Jun  8 05:02:27 hv2 kernel: INFO: task python:3013 blocked for more than 120 seconds.
Jun  8 05:02:27 hv2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Feel free to reopen in case you hit this problem again. Thanks.