Description of problem: ========================= When we set shd threads to say 4 and wait for the heal. We can see that the selfheal deamon consumes abotu 150% cpu at an average consitenty as long as heal goes on. THis can really put resources on strain. For that reason, I am guessing even my systemic testing where heal is pending and mtsh is set to 4 are so busy with kernel hung messages that I can't even log in. I can only ping to the machines. I know mtsh comes with the trade off of resources but there must be a cap to it Version-Release number of selected component (if applicable): =========== 3.8.4-5
For eg, I have attached some cpu log files, you can notice that 1102 is the self heal deamon process of the source brick with shd option set to default 1 ===>avg cpu consumption is <100% However with the threads set to 4, the avg cpu consumption is ~150% (refer 3354) leg_newlog.log ==>legacy gluster processes top o/p newlog.log ==>with mtsh set to 4
Created attachment 1223175 [details] log files
Dev comment : We can't fix this bug atm as per the design, raising number of threads would definitely eat up more CPU. We'd need to loop in Perf team to assess the h/w recommendation for MT-self heal usage which needs to be documented. Pranith will follow up with perf team. A decision has been arrived to take this bug out of 3.2.0 in today's triage meeting between Dev, QE & PM. More details at https://docs.google.com/spreadsheets/d/1ew4cafcvIVEWuJ4tLDuZ4ao7ZTYpsRz5NwCtQ4JVZaQ/edit#gid=0
Seeing similar stuff on physical perf machines. I see CPU usage shooting upto 150% (and it stays there),though I could login etc to my machines since the machines I ma using are pretty high-end with 24 cores and 48G RAM.
Mohit's patch upstream: https://review.gluster.org/#/c/18404/
This bug has been verified as part of bug 1478395. Changing status to Verified.
Have updated the doc text. kindly review and confirm.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days