Bug 1588408

Summary: Fops are sent to glusterd and uninitialized brick stack when client reconnects to brick
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Raghavendra G <rgowdapp>
Component: protocolAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: Rajesh Madaka <rmadaka>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, rgowdapp, rhs-bugs, rkavunga, rmadaka, sankarshan, storage-qa-internal, vdas
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-13 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:49:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503137    

Description Raghavendra G 2018-06-07 09:07:56 UTC
Description of problem:
We've earlier seen msgs in glusterd like:

https://bugzilla.redhat.com/show_bug.cgi?id=1584581#c12

> Glusterd logs: [2018-05-27 08:08:02.530619] W
> [rpcsvc.c:265:rpcsvc_program_actor] 0-rpc-service: RPC program not available
> (req 1298437 330) for 10.75.149.13:1020 [2018-05-27 08:08:02.547670] E
> [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to
> complete successfully [2018-05-27 08:08:02.548126] W
> [rpcsvc.c:265:rpcsvc_program_actor] 0-rpc-service: RPC program not available
> (req 1298437 330) for 10.75.149.13:1020 [2018-05-27 08:08:02.548140] E
> [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to
> complete successfully [2018-05-27 08:08:02.548209] W
> [rpcsvc.c:265:rpcsvc_program_actor] 0-rpc-service: RPC program not available
> (req 1298437 330) for 10.75.149.13:1020
> 
> Note the program number - 1298437 - corresponds to Glusterfs Fop program.
> Question is why are fops sent to Glusterd? They should only go to bricks.

which is Duplicate of bz 1583937.

Bz 1583937 is caused due to client setting its "connected" to true even when handshake with brick is not complete. This means,

1. Fops can be sent to glusterd, as client connection to brick is a two step process - first to glusterd, get the port of brick and then connect to brick. This is the scenario seen in https://bugzilla.redhat.com/show_bug.cgi?id=1584581#c12

2. Fops can be sent to brick when brick stack is not initialized causing crashes like bz 1503137.

A fix has been merged upstream:
https://review.gluster.org/20101

We need to take this to downstream rhgs-3.4.0 branch as its a long standing issue.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 6 Raghavendra G 2018-06-15 06:16:54 UTC
(In reply to Raghavendra G from comment #0)
> 
> 2. Fops can be sent to brick when brick stack is not initialized causing
> crashes like bz 1503137.

bz 1520374 and bz 1583937

Comment 8 Rajesh Madaka 2018-08-21 07:38:54 UTC
As suggested by dev, i have followed the steps from # bz 1583937.


After upgraded from RHGS-3.3.1(RHEL-7.4)  to RHGS-3.4(RHEL-7.5), upgraded node bricks went to offline for most of the volumes.

sosreport copied in below location:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rajesh/1588408/

Comment 9 Raghavendra G 2018-08-21 07:49:30 UTC
(In reply to Rajesh Madaka from comment #8)
> As suggested by dev, i have followed the steps from # bz 1583937.

Did bricks crash? Are cores copied in sosreport?

> 
> 
> After upgraded from RHGS-3.3.1(RHEL-7.4)  to RHGS-3.4(RHEL-7.5), upgraded
> node bricks went to offline for most of the volumes.
> 
> sosreport copied in below location:
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rajesh/1588408/

Comment 10 Rajesh Madaka 2018-08-21 09:04:28 UTC
No cores are generated, I don't think so its a brick crash, bricks didn't come to online.

Comment 11 Raghavendra G 2018-08-23 06:35:57 UTC
(In reply to Rajesh Madaka from comment #10)
> No cores are generated, I don't think so its a brick crash, bricks didn't
> come to online.

Can you explain what do you mean by bricks didn't come online? How were you observing bricks - through gluster volume status, through client connecting to brick, or didn't see the brick process etc?

Comment 12 Rajesh Madaka 2018-08-23 07:09:44 UTC
I am observing brick status through gluster volume status.most of the bricks status showing N/A for ugraded node bricks.

Comment 13 Atin Mukherjee 2018-08-23 11:20:45 UTC
Based on the discussion with QE, moving this BZ to ON_QA again.

Comment 14 Atin Mukherjee 2018-08-23 13:27:53 UTC
Just to clarify why this BZ has been moved to ON_QA, there's absolutely no relation of brick not coming up with the fix which this bug brings in.

Comment 15 Rajesh Madaka 2018-08-23 14:16:34 UTC
Can you please provide steps to verify this bug?

Comment 16 Raghavendra G 2018-08-24 04:31:28 UTC
(In reply to Rajesh Madaka from comment #15)
> Can you please provide steps to verify this bug?

I think if you don't see a brick crash, the bug can be marked as verified. As we discussed on chat, clients are able to connect to bricks and mount is successful. Bricks not being shown online in gluster v status might be a different bug.

Comment 17 Rajesh Madaka 2018-08-24 11:51:48 UTC
I have verified this bug with below two scenarios.

First scenario:

I have followed steps mentioned in bz #1583937

Didn't find any brick crashes or mount point disconnections, but bricks went to offline,will be raising different bug for this.

gluster-build version: glusterfs-fuse-3.12.2-16

Second scenario:

-> Created 3 node cluster 
-> Created volume
-> Mounted volume on client
-> Then rebooted one of the gluster node.

Didn't find any brick crashes or mount disconnections.

Moving this bug to verified state.

Gluster-build version: glusterfs-fuse-3.12.2-17

Comment 18 errata-xmlrpc 2018-09-04 06:49:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607