Bug 1787294

Summary: Improve logging in EC, client and lock xlator
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Ashish Pandey <aspandey>
Component: disperseAssignee: Ashish Pandey <aspandey>
Status: CLOSED ERRATA QA Contact: Pranav Prakash <prprakas>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: bugs, nchilaka, pprakash, puebele, rhs-bugs, rkothiya, sheggodu, storage-qa-internal
Target Milestone: ---Keywords: Improvement
Target Release: RHGS 3.5.z Batch Update 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0-38 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1779760 Environment:
Last Closed: 2020-12-17 04:51:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1779760, 1906420    
Bug Blocks:    

Comment 9 Pranav Prakash 2020-11-05 11:09:51 UTC
The following scenario was performed to verify the logging improvement.

1. Create a dispersed volume
2. Mount to 2 or more clients
3. Perfom IO on the clients
4. Kill few bricks
5. Bring the bricks up
6. Verify the logs


[root@dhcp43-237 ~]# gluster v status
Status of volume: testvol_dispersed
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.43.188:/gluster/bricks/brick1/t
estvol_dispersed_brick0                     49152     0          Y       107025
Brick 10.70.43.5:/gluster/bricks/brick1/tes
tvol_dispersed_brick1                       49152     0          Y       67553
Brick 10.70.43.245:/gluster/bricks/brick1/t
estvol_dispersed_brick2                     49152     0          Y       67689
Brick 10.70.41.159:/gluster/bricks/brick1/t
estvol_dispersed_brick3                     49152     0          Y       68209
Brick 10.70.43.237:/gluster/bricks/brick1/t
estvol_dispersed_brick4                     49152     0          Y       67796
Brick 10.70.43.224:/gluster/bricks/brick1/t
estvol_dispersed_brick5                     49152     0          Y       67553
Self-heal Daemon on localhost               N/A       N/A        Y       67813
Self-heal Daemon on 10.70.43.245            N/A       N/A        Y       67706
Self-heal Daemon on 10.70.43.224            N/A       N/A        Y       67571
Self-heal Daemon on 10.70.41.159            N/A       N/A        Y       68226
Self-heal Daemon on 10.70.43.5              N/A       N/A        Y       67570
Self-heal Daemon on dhcp43-188.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       107042
 
Task Status of Volume testvol_dispersed
------------------------------------------------------------------------------
There are no active volume tasks
 

Performed IO on clients, brought bricks down on couple of nodes and then up. 

----------
OBSERVATIONS:

The logs, for e.g 'remote option failed' log in the previous versions, are now providing additional information, thus improving the logging. 
Below is one eg. 

```
[2020-11-03 05:35:32.171859] W [MSGID: 114031] [client-rpc-fops_v2.c:2635:client4_0_lookup_cbk] 0-testvol_dispersed-client-2: remote operation failed. Path: /user1/testfile1.txt (00000000-0000-0000-0000-000000000000) [Transport endpoint is not connected]

[2020-11-03 05:37:34.346233] W [MSGID: 114031] [client-rpc-fops_v2.c:2116:client4_0_create_cbk] 3-testvol_dispersed-client-2: remote operation failed. Path: /file114 [Input/output error]

```

Now additional information about which operation and why it is failed can be identified from the logs.

Similarly,
```
[2020-11-03 05:37:34.737417] W [MSGID: 122053] [ec-common.c:329:ec_check_status] 3-testvol_dispersed-disperse-0: Operation failed on 1 of 6 subvolumes.(up=111111, mask=111111, remaining=000000, good=111011, bad=000100, FOP : 'CREATE' failed on '/file116' with gfid 00000000-0000-0000-0000-000000000000)
[2020-11-03 05:37:34.741320] W [MSGID: 122053] [ec-common.c:329:ec_check_status] 3-testvol_dispersed-disperse-0: Operation failed on 1 of 6 subvolumes.(up=111111, mask=111011, remaining=000000, good=111011, bad=000100, FOP : 'FLUSH' failed on gfid 09c1992b-3190-45b0-9ab8-b39e27e12e32)
[2020-11-03 05:37:34.761148] W [MSGID: 122053] [ec-common.c:329:ec_check_status] 3-testvol_dispersed-disperse-0: Operation failed on 1 of 6 subvolumes.(up=111111, mask=111011, remaining=000000, good=111011, bad=000100, FOP : 'WRITE' failed on gfid 09c1992b-3190-45b0-9ab8-b39e27e12e32)

[2020-11-03 05:38:00.132336] W [MSGID: 122053] [ec-common.c:329:ec_check_status] 3-testvol_dispersed-disperse-0: Operation failed on 1 of 6 subvolumes.(up=101111, mask=101111, remaining=000000, good=101011, bad=000100, FOP : 'LOOKUP' failed on '/user38' with gfid 585b50e9-303c-47f4-a67c-75db7fa73878)

2020-11-03 05:38:02.168378] W [MSGID: 122033] [ec-common.c:1914:ec_locked] 3-testvol_dispersed-disperse-0: Failed to complete preop lock [Stale file handle]


```


The above observations are verified in the following glusterfs version : 
```
glusterfs-6.0-46.el8rhgs.x86_64

```
------------------------------------
Marking it as Verifed.

Comment 11 errata-xmlrpc 2020-12-17 04:51:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (glusterfs bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5603