Bug 1542934

Summary: Seeing timer errors in the rebalance logs
Product: [Community] GlusterFS Reporter: Mohit Agrawal <moagrawa>
Component: rpcAssignee: Mohit Agrawal <moagrawa>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 3.12CC: amukherj, bugs, mchangir, moagrawa, rcyriac, rhinduja, rhs-bugs, sheggodu, storage-qa-internal, tdesala
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.6 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1467536 Environment:
Last Closed: 2018-03-05 07:14:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1467536, 1538427    

Description Mohit Agrawal 2018-02-07 11:59:42 UTC
+++ This bug was initially created as a clone of Bug #1467536 +++

Description of problem:
=======================
I have seen below timer error in rebalance logs after remove-brick completion,
however I didn't see any functionality impact here, remove-brick rebalance completed without any failures.

[2017-07-04 05:34:58.215846] W [glusterfsd.c:1290:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) [0x7f07e0176e25] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x561a54853005] -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x561a54852e2b] ) 0-: received signum (15), shutting down
[2017-07-04 05:34:58.220109] I [timer.c:212:gf_timer_registry_init] (-->/lib64/libglusterfs.so.0(+0x3782d) [0x7f07e132482d] -->/lib64/libgfrpc.so.0(+0xfac3) [0x7f07e10dfac3] -->/lib64/libglusterfs.so.0(gf_timer_call_after+0x215) [0x7f07e1324b65] ) 0-timer: ctx cleanup started
[2017-07-04 05:34:58.220307] E [timer.c:44:gf_timer_call_after] (-->/lib64/libglusterfs.so.0(+0x3782d) [0x7f07e132482d] -->/lib64/libgfrpc.so.0(+0xfac3) [0x7f07e10dfac3] -->/lib64/libglusterfs.so.0(gf_timer_call_after+0x25b) [0x7f07e1324bab] ) 0-timer: !reg
[2017-07-04 05:34:58.220352] W [rpc-clnt.c:168:call_bail] 0-glusterfs: Cannot create bailout timer for 127.0.0.1:24007

Version-Release number of selected component (if applicable):
3.8.4-32.el7rhgs.x86_64

How reproducible:
Reporting the first occurrence.

Steps to Reproduce:
===================
1) Create a distrbuted-replicate volume and start it.
2) cifs mount the volume on multiple clients.
3) From client-1 -> start creating files and from a diff terminal start continuous lookups.
   wait for some time and from client-2 -> Start creating hardlinks
4) Remove few bricks and wait till the remove-brick operation completes.

After completion of remove-brick, check rebalance logs and you will see above timer error messages.

Actual results:
===============
Seeing timer errors in rebalance logs.

Expected results:
=================
No timer error messages should come in rebalance logs.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-07-04 03:21:58 EDT ---

This bug is automatically being proposed for the current release of Red Hat Gluster Storage 3 under active development, by setting the release flag 'rhgs‑3.3.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Prasad Desala on 2017-07-04 04:09:02 EDT ---

sosreports and samba logs@ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/Prasad/1467536/

Volname: distrep
cifs mounted
Enabled nl-cache settings
Clients: 10.70.41.254 --> Creating files (f$i), lookups 
10.70.42.64 -->  Creating hardlinks (fl$i)
10.70.42.21 --> Lookups

[root@dhcp43-49 tmp]# gluster v status distrep
Status of volume: distrep
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.43.49:/bricks/brick0/b0         49152     0          Y       13595
Brick 10.70.43.41:/bricks/brick0/b0         49154     0          Y       24115
Brick 10.70.43.35:/bricks/brick0/b0         49152     0          Y       27192
Brick 10.70.43.37:/bricks/brick0/b0         49152     0          Y       25812
Brick 10.70.43.31:/bricks/brick0/b0         49152     0          Y       7173 
Brick 10.70.43.49:/bricks/brick1/b1         49154     0          Y       13614
Brick 10.70.43.41:/bricks/brick1/b1         49155     0          Y       31325
Brick 10.70.43.35:/bricks/brick1/b1         49154     0          Y       2248 
Brick 10.70.43.37:/bricks/brick1/b1         49154     0          Y       793  
Brick 10.70.43.31:/bricks/brick1/b1         49153     0          Y       14261
Brick 10.70.43.49:/bricks/brick2/b2         49155     0          Y       26731
Brick 10.70.43.41:/bricks/brick2/b2         49156     0          Y       31345
Brick 10.70.43.35:/bricks/brick2/b2         49155     0          Y       2268 
Brick 10.70.43.37:/bricks/brick2/b2         49155     0          Y       813  
Brick 10.70.43.31:/bricks/brick2/b2         49155     0          Y       14281
Brick 10.70.43.49:/bricks/brick3/b3         49156     0          Y       26751
Self-heal Daemon on localhost               N/A       N/A        Y       26772
Self-heal Daemon on 10.70.43.35             N/A       N/A        Y       2510 
Self-heal Daemon on 10.70.43.31             N/A       N/A        Y       14302
Self-heal Daemon on 10.70.43.41             N/A       N/A        Y       31583
Self-heal Daemon on 10.70.43.37             N/A       N/A        Y       1153 
 
Task Status of Volume distrep
------------------------------------------------------------------------------
Task                 : Remove brick        
ID                   : 8d70a762-a0d3-41e3-a070-4e9204c41a4c
Removed bricks:     
10.70.43.31:/bricks/brick2/b2
10.70.43.49:/bricks/brick3/b3
Status               : completed           

Volume Name: distrep
Type: Distributed-Replicate
Volume ID: 540ea72f-6e16-49e1-a59d-f878e4d41ad5
Status: Started
Snapshot Count: 0
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: 10.70.43.49:/bricks/brick0/b0
Brick2: 10.70.43.41:/bricks/brick0/b0
Brick3: 10.70.43.35:/bricks/brick0/b0
Brick4: 10.70.43.37:/bricks/brick0/b0
Brick5: 10.70.43.31:/bricks/brick0/b0
Brick6: 10.70.43.49:/bricks/brick1/b1
Brick7: 10.70.43.41:/bricks/brick1/b1
Brick8: 10.70.43.35:/bricks/brick1/b1
Brick9: 10.70.43.37:/bricks/brick1/b1
Brick10: 10.70.43.31:/bricks/brick1/b1
Brick11: 10.70.43.49:/bricks/brick2/b2
Brick12: 10.70.43.41:/bricks/brick2/b2
Brick13: 10.70.43.35:/bricks/brick2/b2
Brick14: 10.70.43.37:/bricks/brick2/b2
Brick15: 10.70.43.31:/bricks/brick2/b2
Brick16: 10.70.43.49:/bricks/brick3/b3
Options Reconfigured:
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.nl-cache: on
storage.batch-fsync-delay-usec: 0
server.allow-insecure: on
transport.address-family: inet
nfs.disable: on

--- Additional comment from Prasad Desala on 2017-07-06 02:28:48 EDT ---

I didn't see any functionality impact here so not proposing this bug as blocker for 3.3. Both DEV and QE agreed for it.

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-07-06 08:47:18 EDT ---

This BZ having been considered, and subsequently not approved to be fixed at the RHGS 3.3.0 release, is being proposed for the next minor release of RHGS

--- Additional comment from Nithya Balachandran on 2017-08-18 05:41:19 EDT ---

I think this is related to the rpc_clnt_ping_timer. I started a rebalance and set a breakpoint in gf_timer_call_after.

The only traces I saw were :
Breakpoint 3, gf_timer_call_after (ctx=0xa62010, delta=..., callbk=0x7f4ee97996a5 <call_bail>, data=0x7f4ed8026030) at timer.c:28
28	        gf_timer_registry_t *reg = NULL;


Breakpoint 3, gf_timer_call_after (ctx=0xa62010, delta=..., callbk=0x7f4ee97a17fc <rpc_clnt_start_ping>, data=0x7f4ed8021ce0) at timer.c:28
28	        gf_timer_registry_t *reg = NULL;


Breakpoint 3, gf_timer_call_after (ctx=0xa62010, delta=..., callbk=0x7f4ee97a108c <rpc_clnt_ping_timer_expired>, data=0x7f4ed8021ce0) at timer.c:28
28	        gf_timer_registry_t *reg = NULL;



I do not think this is a problem but changing component to RPC so the rpc devs can confirm.

--- Additional comment from Milind Changire on 2017-08-19 00:32:23 EDT ---

The log with " E !reg" is not serious since it comes after "cleanup started".

The gf_timer_registry_init() function find the glusterfs_ctx_t parameter flag "cleanup_started" set to true and prints a message flagged as " E ". Since gf_timer_registry_init() is called from various contexts including the cleanup path it prints the message as error.

Maybe we should change the log level to INFO when it gets printed from the cleanup path, implying that the situation is expected.

--- Additional comment from Atin Mukherjee on 2017-10-17 10:27:02 EDT ---

Is this on track for RHGS 3.4.0?

--- Additional comment from Mohit Agrawal on 2017-10-18 05:55:40 EDT ---

Yes, we are targeting it for 3.4.0.


Regards
Mohit Agrawal

--- Additional comment from Red Hat Bugzilla Rules Engine on 2017-11-13 10:35:36 EST ---

Since this bug has has been approved for the RHGS 3.4.0 release of Red Hat Gluster Storage 3, through release flag 'rhgs-3.4.0+', and through the Internal Whiteboard entry of '3.4.0', the Target Release is being automatically set to 'RHGS 3.4.0'

--- Additional comment from Mohit Agrawal on 2018-01-24 23:44:12 EST ---

Posted an upstream patch for review

https://review.gluster.org/#/c/19320/1

Comment 1 Worker Ant 2018-02-07 12:09:42 UTC
REVIEW: https://review.gluster.org/19520 (rpc: Showing some unusual timer error logs during brick stop) posted (#1) for review on release-3.12 by MOHIT AGRAWAL

Comment 2 Worker Ant 2018-02-12 10:11:51 UTC
COMMIT: https://review.gluster.org/19520 committed in release-3.12 by "jiffin tony Thottan" <jthottan> with a commit message- rpc: Showing some unusual timer error logs during brick stop

Solution: Update msg condition in gf_timer_call_after function
          to avoid the message

> BUG: 1538427
> Change-Id: I849e8e052a8259cf977fd5e7ff3aeba52f9b5f27
> Signed-off-by: Mohit Agrawal <moagrawa>
> (cherry picked from commit c142d26e44436d805e476f2d13ac8726052a59c4)

BUG: 1542934
Change-Id: I849e8e052a8259cf977fd5e7ff3aeba52f9b5f27
Signed-off-by: Mohit Agrawal <moagrawa>

Comment 3 Jiffin 2018-03-05 07:14:45 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.6, please open a new bug report.

glusterfs-3.12.6 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2018-February/033552.html
[2] https://www.gluster.org/pipermail/gluster-users/