Description of problem: Too frequent gluster related command issued resulting in failed locking Version-Release number of selected component (if applicable): RHV manager (v4.0) vdsm-4.19.15-1.el7ev.x86_64 How reproducible: Cu environment Steps to Reproduce: 1. Enable management through RHV manager to the gluster nodes 2. Locking failed issue arises on the nodes 3. Restarting glusterd on the node solves the issue Actual results: Vdsm triggering many call per status check Expected results: Vdsm should trigger only one call per status check Additional info:
Shubhendu, can you take a look at logs to see what's causing the frequent calls to gluster volume status?
Sahina, yes I would try to analyze this and would update here.
Abhishek, can you check the value of `GlusterRefreshRateHeavy` in `vdc_options` table? Ideally the value should be 300 secs.
Also please check if some value set for `vds_retries` in `vdc_options`
Abhishek, Kindly check if task monitoring is critical is this scenario. If not you may try to set the value of `GlusterRefreshRateTasks` to 300 or even 600 and see if the issue still persists.
Abhishek, As discussed over IRC, you can use the below commands to check the existing value and change them - sudo -i -u postgres psql engine -c "select * from vdc_options WHERE option_name = 'GlusterRefreshRateTasks';" - sudo -i -u postgres psql engine -c "update vdc_options set option_value = '600' where option_name = 'GlusterRefreshRateTasks';"
Abhishek, Also post discussion with gluster team, we would certainly need cmd_history from all the storage nodes and the no of volumes. Kindly share the details.
Atin, The requested cmd_history is available as attachment. Please check and comment.
To fix the locking issue, RHV monitoring of gluster will need to change to use get-state and aggregate information collected from each node as suggested by Atin. This will need to be raised as an RFE in RHV. The other option to minimise this happening is to increase the polling frequency of gluster status commands in RHV. This has an impact on the status reporting in RHV being stale.
Bipin, since we have the RHV bug tracking the request for this customer case, can we close this bug? There are no changes that can be done in vdsm to address this