Bug 1409568
Summary: | seeing socket disconnects and transport endpoint not connected frequently on systemic setup | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> |
Component: | rpc | Assignee: | Milind Changire <mchangir> |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | rhgs-3.2 | CC: | amukherj, bkunal, mchangir, moagrawa, nbalacha, rcyriac, rgowdapp, rhs-bugs, sanandpa, sheggodu, smali, storage-qa-internal, vdas |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | RHGS 3.4.z Batch Update 4 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | rpc-ping-timeout, rebase | ||
Fixed In Version: | glusterfs-3.12.2-41 | Doc Type: | Bug Fix |
Doc Text: |
Cause: Due to heavy load, ping requests are not read fast enough so that they can be responded within the timeout.
Consequence: When ping requests are not responded within a timeout value, transport is disconnected.
Fix:
Two fixes:
1. Event threads no longer execute any code of Glusterfs program. Instead requests are queued to be consumed by threads of Glusterfs program. Since no code of Glusterfs program is executed by event-threads, they are not slowed down.
2. Even with queuing, under still high load default number of event threads are not sufficient to read requests. Hence the number of event-threads are suggested to be increased. In testing of this bug, client.event-threads and server.event-threads were set to 8. But this value can be experimented and we suggest to try a value of 4 and use 8 threads only if it doesn't resolve the issue.
Result:
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-27 03:43:36 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1657798 |
Description
Nag Pavan Chilakam
2017-01-02 13:39:12 UTC
note that 2 bricks are down for which seperate bz has been raised as below 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] But i am seeing disconnects even for bricks which are not down [root@dhcp35-20 ~]# gluster v status Status of volume: sysvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.20:/rhs/brick1/sysvol 49154 0 Y 2404 Brick 10.70.37.86:/rhs/brick1/sysvol 49154 0 Y 26789 Brick 10.70.35.156:/rhs/brick1/sysvol N/A N/A N N/A Brick 10.70.37.154:/rhs/brick1/sysvol 49154 0 Y 2192 Brick 10.70.35.20:/rhs/brick2/sysvol 49155 0 Y 2424 Brick 10.70.37.86:/rhs/brick2/sysvol N/A N/A N N/A Brick 10.70.35.156:/rhs/brick2/sysvol 49155 0 Y 26793 Brick 10.70.37.154:/rhs/brick2/sysvol 49155 0 Y 2212 Snapshot Daemon on localhost 49156 0 Y 2449 Self-heal Daemon on localhost N/A N/A Y 3131 Quota Daemon on localhost N/A N/A Y 3139 Snapshot Daemon on 10.70.37.86 49156 0 Y 26832 Self-heal Daemon on 10.70.37.86 N/A N/A Y 2187 Quota Daemon on 10.70.37.86 N/A N/A Y 2195 Snapshot Daemon on 10.70.35.156 49156 0 Y 26816 Self-heal Daemon on 10.70.35.156 N/A N/A Y 1376 Quota Daemon on 10.70.35.156 N/A N/A Y 1384 Snapshot Daemon on 10.70.37.154 49156 0 Y 2235 Self-heal Daemon on 10.70.37.154 N/A N/A Y 9321 Quota Daemon on 10.70.37.154 N/A N/A Y 9329 Task Status of Volume sysvol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-20 ~]# gluster v info Volume Name: sysvol Type: Distributed-Replicate Volume ID: 4efd4f77-85c7-4eb9-b958-6769d31d84c8 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.20:/rhs/brick1/sysvol Brick2: 10.70.37.86:/rhs/brick1/sysvol Brick3: 10.70.35.156:/rhs/brick1/sysvol Brick4: 10.70.37.154:/rhs/brick1/sysvol Brick5: 10.70.35.20:/rhs/brick2/sysvol Brick6: 10.70.37.86:/rhs/brick2/sysvol Brick7: 10.70.35.156:/rhs/brick2/sysvol Brick8: 10.70.37.154:/rhs/brick2/sysvol Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: enable transport.address-family: inet performance.readdir-ahead: on nfs.disable: on also there were traces in dmesg [85321.290723] INFO: task tee:26646 blocked for more than 120 seconds. [85321.290762] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [85321.290796] tee D ffff8804726fa980 0 26646 6196 0x00000080 [85321.290800] ffff88043274fe18 0000000000000082 ffff88020f3a2280 ffff88043274ffd8 [85321.290805] ffff88043274ffd8 ffff88043274ffd8 ffff88020f3a2280 ffff880472993200 [85321.290808] ffff88027360b800 ffff88043274fe48 ffff8804729932e0 ffff8804726fa980 [85321.290811] Call Trace: [85321.290822] [<ffffffff8163a909>] schedule+0x29/0x70 [85321.290828] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [85321.290832] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [85321.290838] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [85321.290842] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [85321.290845] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [85321.290849] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [85321.290855] [<ffffffff811dc254>] filp_close+0x34/0x80 [85321.290859] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [85321.290861] [<ffffffff811dd963>] SyS_close+0x23/0x50 [85321.290866] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [104761.281512] INFO: task tee:26536 blocked for more than 120 seconds. [104761.281545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [104761.281579] tee D ffff880036502200 0 26536 6196 0x00000080 [104761.281584] ffff8800622fbe18 0000000000000082 ffff8801f8fa1700 ffff8800622fbfd8 [104761.281588] ffff8800622fbfd8 ffff8800622fbfd8 ffff8801f8fa1700 ffff880078ef1db0 [104761.281591] ffff88027360b800 ffff8800622fbe48 ffff880078ef1e90 ffff880036502200 [104761.281595] Call Trace: [104761.281605] [<ffffffff8163a909>] schedule+0x29/0x70 [104761.281611] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [104761.281615] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [104761.281620] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [104761.281625] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [104761.281628] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [104761.281632] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [104761.281637] [<ffffffff811dc254>] filp_close+0x34/0x80 [104761.281641] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [104761.281644] [<ffffffff811dd963>] SyS_close+0x23/0x50 [104761.281648] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [121321.273512] INFO: task tee:29783 blocked for more than 120 seconds. [121321.273546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [121321.273579] tee D ffff8804726fae00 0 29783 6196 0x00000080 [121321.273584] ffff88042f817e18 0000000000000086 ffff88043e0eb980 ffff88042f817fd8 [121321.273589] ffff88042f817fd8 ffff88042f817fd8 ffff88043e0eb980 ffff880472991450 [121321.273592] ffff88027360b800 ffff88042f817e48 ffff880472991530 ffff8804726fae00 [121321.273595] Call Trace: [121321.273606] [<ffffffff8163a909>] schedule+0x29/0x70 [121321.273612] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [121321.273615] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [121321.273621] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [121321.273626] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [121321.273629] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [121321.273633] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [121321.273639] [<ffffffff811dc254>] filp_close+0x34/0x80 [121321.273643] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [121321.273645] [<ffffffff811dd963>] SyS_close+0x23/0x50 [121321.273650] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [122041.273141] INFO: task tee:8444 blocked for more than 120 seconds. [122041.273173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [122041.273208] tee D ffff880273016f00 0 8444 6196 0x00000080 [122041.273212] ffff880272a6be18 0000000000000086 ffff88006aa40000 ffff880272a6bfd8 [122041.273217] ffff880272a6bfd8 ffff880272a6bfd8 ffff88006aa40000 ffff880078ef20d0 [122041.273220] ffff88027360b800 ffff880272a6be48 ffff880078ef21b0 ffff880273016f00 [122041.273223] Call Trace: [122041.273233] [<ffffffff8163a909>] schedule+0x29/0x70 [122041.273239] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [122041.273243] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [122041.273249] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [122041.273253] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [122041.273256] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [122041.273260] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [122041.273266] [<ffffffff811dc254>] filp_close+0x34/0x80 [122041.273270] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [122041.273273] [<ffffffff811dd963>] SyS_close+0x23/0x50 [122041.273278] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [133921.267488] INFO: task tee:17200 blocked for more than 120 seconds. [133921.267520] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [133921.267554] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [133921.267558] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [133921.267563] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [133921.267566] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [133921.267569] Call Trace: [133921.267580] [<ffffffff8163a909>] schedule+0x29/0x70 [133921.267585] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [133921.267589] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [133921.267595] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [133921.267599] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [133921.267602] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [133921.267606] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [133921.267612] [<ffffffff811dc254>] filp_close+0x34/0x80 [133921.267616] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [133921.267619] [<ffffffff811dd963>] SyS_close+0x23/0x50 [133921.267623] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [134041.267457] INFO: task tee:17200 blocked for more than 120 seconds. [134041.267491] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [134041.267525] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [134041.267529] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [134041.267534] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [134041.267537] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [134041.267540] Call Trace: [134041.267550] [<ffffffff8163a909>] schedule+0x29/0x70 [134041.267556] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [134041.267560] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [134041.267566] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [134041.267570] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [134041.267574] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [134041.267577] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [134041.267583] [<ffffffff811dc254>] filp_close+0x34/0x80 [134041.267587] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [134041.267590] [<ffffffff811dd963>] SyS_close+0x23/0x50 [134041.267594] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [134161.267401] INFO: task tee:17200 blocked for more than 120 seconds. [134161.267435] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [134161.267469] tee D ffff8801f8c2eb00 0 17200 6196 0x00000080 [134161.267474] ffff8801edf8be18 0000000000000082 ffff8801edc70000 ffff8801edf8bfd8 [134161.267478] ffff8801edf8bfd8 ffff8801edf8bfd8 ffff8801edc70000 ffff880078ef1a90 [134161.267481] ffff88027360b800 ffff8801edf8be48 ffff880078ef1b70 ffff8801f8c2eb00 [134161.267485] Call Trace: [134161.267496] [<ffffffff8163a909>] schedule+0x29/0x70 [134161.267502] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [134161.267506] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [134161.267512] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [134161.267516] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [134161.267519] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [134161.267523] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [134161.267529] [<ffffffff811dc254>] filp_close+0x34/0x80 [134161.267533] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [134161.267535] [<ffffffff811dd963>] SyS_close+0x23/0x50 [134161.267540] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [150001.259811] INFO: task tee:29470 blocked for more than 120 seconds. [150001.259844] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [150001.259878] tee D ffff880246a4a700 0 29470 6196 0x00000080 [150001.259882] ffff8801edfdbe18 0000000000000086 ffff88025f5edc00 ffff8801edfdbfd8 [150001.259887] ffff8801edfdbfd8 ffff8801edfdbfd8 ffff88025f5edc00 ffff880078ef0e10 [150001.259890] ffff88027360b800 ffff8801edfdbe48 ffff880078ef0ef0 ffff880246a4a700 [150001.259893] Call Trace: [150001.259904] [<ffffffff8163a909>] schedule+0x29/0x70 [150001.259910] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [150001.259914] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [150001.259919] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [150001.259924] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [150001.259927] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [150001.259931] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [150001.259937] [<ffffffff811dc254>] filp_close+0x34/0x80 [150001.259941] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [150001.259943] [<ffffffff811dd963>] SyS_close+0x23/0x50 [150001.259948] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [156721.256615] INFO: task tee:15511 blocked for more than 120 seconds. [156721.256648] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [156721.256681] tee D ffff8804726fa800 0 15511 6196 0x00000080 [156721.256686] ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8 [156721.256690] ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0 [156721.256694] ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800 [156721.256697] Call Trace: [156721.256707] [<ffffffff8163a909>] schedule+0x29/0x70 [156721.256712] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [156721.256716] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [156721.256722] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [156721.256726] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [156721.256729] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [156721.256733] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [156721.256739] [<ffffffff811dc254>] filp_close+0x34/0x80 [156721.256743] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [156721.256746] [<ffffffff811dd963>] SyS_close+0x23/0x50 [156721.256750] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [156841.256561] INFO: task tee:15511 blocked for more than 120 seconds. [156841.256591] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [156841.256625] tee D ffff8804726fa800 0 15511 6196 0x00000080 [156841.256628] ffff880414487e18 0000000000000082 ffff880422fa1700 ffff880414487fd8 [156841.256632] ffff880414487fd8 ffff880414487fd8 ffff880422fa1700 ffff8804729936b0 [156841.256635] ffff88027360b800 ffff880414487e48 ffff880472993790 ffff8804726fa800 [156841.256639] Call Trace: [156841.256645] [<ffffffff8163a909>] schedule+0x29/0x70 [156841.256650] [<ffffffffa086c53d>] __fuse_request_send+0x13d/0x2c0 [fuse] [156841.256654] [<ffffffffa086fa80>] ? fuse_get_req_nofail_nopages+0xc0/0x1e0 [fuse] [156841.256658] [<ffffffff81197088>] ? handle_mm_fault+0x5b8/0xf50 [156841.256660] [<ffffffff810a6ae0>] ? wake_up_atomic_t+0x30/0x30 [156841.256664] [<ffffffffa086c6d2>] fuse_request_send+0x12/0x20 [fuse] [156841.256667] [<ffffffffa087577f>] fuse_flush+0xff/0x150 [fuse] [156841.256672] [<ffffffff811dc254>] filp_close+0x34/0x80 [156841.256676] [<ffffffff811fcb98>] __close_fd+0x78/0xa0 [156841.256678] [<ffffffff811dd963>] SyS_close+0x23/0x50 [156841.256682] [<ffffffff81645909>] system_call_fastpath+0x16/0x1b [271499.233421] SELinux: initialized (dev binfmt_misc, type binfmt_misc), uses genfs_contexts [271503.126098] nr_pdflush_threads exported in /proc is scheduled for removal [271506.509137] warning: `turbostat' uses 32-bit capabilities (legacy support in use) client sosreports are available at scp -r /var/tmp/$HOSTNAME qe@rhsqe-repo:/var/www/html/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/ Hi Nag, I have checked bricks logs on the systematic server and mnt logs on the clients.Below is my analysis specific to getting disconnect messages on the server Total there are 4 servers and 2 of them are showing disconnect messages(for client/internal-daemon both) and rest of the 2 servers are not showing any disconnect messages (for client , disconnect messages only for internal daemon like quotad,glusterfsshd) Below are the servers those are not showing any disconnect messages for client [root@dhcp35-20 bricks]# grep -i disconnect * | grep rhs-client | grep -v "2017-04-13" [root@dhcp35-156 log]# grep -i disconnect * | grep rhs-client | grep -v "2017-04-13" Server those are showing disconnect message for client [root@dhcp37-86 bricks]# [root@dhcp37-154 log] I have observed one thing on above servers as compared to server those are not showing any disconnect messages there are so many watchdog messages thrown by kernel minimum stuck time is 22s and maximum stuck time is more than 1 min(120 sec). IMO while task is stuck at the same time if client will try to communicate with server it can get ping-timer-expired message , i have checked mnt logs also almost same time client also showing timer-expired messages so i believe in latest package(glusterfs-3.8.4-22.el7rhgs.x86_64) we are not able to reproduce this issue. Below are the some messages those are thrown by kernel Apr 16 19:22:29 dhcp37-154 kernel: NMI watchdog: BUG: soft lockup - CPU#7 stuck for 44s! [glusterfsd:17965] Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#2 stuck for 160s! [glusterfsd:8060] Apr 18 06:02:21 dhcp37-86 kernel: NMI watchdog: BUG: soft lockup - CPU#3 stuck for 149s! [glusterfsd:21010] As per your comment you have also seen Call trace for task those were stuck more than 2 minute on the client side but this time there is no call trace for any task in dmesg that is hung more than 2 minutes. Thanks Mohit Agrawal upstream patch : https://review.gluster.org/#/c/17105/ Can we have a probable release candidate target assigned for this bug? based on above comments(read from comment#23), QE is moving this bug to verified. test version:3.12.2-43 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0658 |