Description of problem: ======================= on a 16 node setup, using 3 clients to run IO (dd, linux untar and mkdir, one from each client) and within a day one of the client is hitting throttling issue. rpc outstanding requests by default is 16. volume info : [root@rhs-client17 ~]# gluster v info ec_tier Volume Name: ec_tier Type: Tier Volume ID: 84855431-e6cf-41e9-9cfc-7a735f2685ed Status: Started Number of Bricks: 44 Transport-type: tcp Hot Tier : Hot Tier Type : Distributed-Replicate Number of Bricks: 4 x 2 = 8 Brick1: 10.70.35.77:/rhs/brick4/ec-ht8 Brick2: 10.70.35.191:/rhs/brick4/ec-ht7 Brick3: 10.70.35.202:/rhs/brick4/ec-ht6 Brick4: 10.70.35.49:/rhs/brick4/ec-ht5 Brick5: 10.70.36.41:/rhs/brick4/ec-ht4 Brick6: 10.70.35.196:/rhs/brick4/ec-ht3 Brick7: 10.70.35.38:/rhs/brick4/ec-ht2 Brick8: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick4/ec-ht1 Cold Tier: Cold Tier Type : Distributed-Disperse Number of Bricks: 3 x (8 + 4) = 36 Brick9: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick1/ec Brick10: 10.70.35.38:/rhs/brick1/ec Brick11: 10.70.35.196:/rhs/brick1/ec Brick12: 10.70.36.41:/rhs/brick1/ec Brick13: 10.70.35.49:/rhs/brick1/ec Brick14: 10.70.35.202:/rhs/brick1/ec Brick15: 10.70.35.191:/rhs/brick1/ec Brick16: 10.70.35.77:/rhs/brick1/ec Brick17: 10.70.35.98:/rhs/brick1/ec Brick18: 10.70.35.132:/rhs/brick1/ec Brick19: 10.70.35.35:/rhs/brick1/ec Brick20: 10.70.35.51:/rhs/brick1/ec Brick21: 10.70.35.138:/rhs/brick1/ec Brick22: 10.70.35.122:/rhs/brick1/ec Brick23: 10.70.36.43:/rhs/brick1/ec Brick24: 10.70.36.42:/rhs/brick1/ec Brick25: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick2/ec Brick26: 10.70.35.38:/rhs/brick2/ec Brick27: 10.70.35.196:/rhs/brick2/ec Brick28: 10.70.36.41:/rhs/brick2/ec Brick29: 10.70.35.49:/rhs/brick2/ec Brick30: 10.70.35.202:/rhs/brick2/ec Brick31: 10.70.35.191:/rhs/brick2/ec Brick32: 10.70.35.77:/rhs/brick2/ec Brick33: 10.70.35.98:/rhs/brick2/ec Brick34: 10.70.35.132:/rhs/brick2/ec Brick35: 10.70.35.35:/rhs/brick2/ec Brick36: 10.70.35.51:/rhs/brick2/ec Brick37: 10.70.35.138:/rhs/brick2/ec Brick38: 10.70.35.122:/rhs/brick2/ec Brick39: 10.70.36.43:/rhs/brick2/ec Brick40: 10.70.36.42:/rhs/brick2/ec Brick41: dhcp35-153.lab.eng.blr.redhat.com:/rhs/brick3/ec Brick42: 10.70.35.38:/rhs/brick3/ec Brick43: 10.70.35.196:/rhs/brick3/ec Brick44: 10.70.36.41:/rhs/brick3/ec Options Reconfigured: performance.readdir-ahead: on features.uss: on features.quota: on features.inode-quota: on features.quota-deem-statfs: on features.ctr-enabled: on cluster.tier-mode: cache features.barrier: disable features.bitrot: off features.scrub: Inactive nfs.outstanding-rpc-limit: 16 features.scrub-freq: hourly [root@rhs-client17 ~]# Version-Release number of selected component (if applicable): ============================================================ 3.7.5-18 How reproducible: ================= 100% Steps to Reproduce: 1. Create an EC volume (8+4) and attach a dist-rep tier volume 2. NFS mount on the client and run the IO.(dd and linux untar) Actual results: ============== client mount hang Expected results: ================ IO should not hang Additional info:
This issue doesn't seem to exactly relate to throttling but most likely arises only when it is enabled. Have requested QE to confirm that by reproducing this issue with throttling limit set to 'zero'.
I tried 2 scenarios setting throttling to 0. 1. Serial IO from 2 clients (dd and linux untar) The memory of nfs process has shot up to 7GB (in a 16GB node) within 30 mins and stays there though there's not much requests from clients. I am seeing NFS server not responding, still trying and OK messages. 2. Parallel IO from 1 client (dd and linux untar) The memory consumption has gone up to 2GB and is constant at it. There's no crash seen though in both cases but likely to be hit if the IO is smooth from clients.
it confirms that it is not related to throttling issue. Will take pkt trace and statedump of the processes and update further.
The work-around for this issue is to re-start gluster-NFS server.