Description of problem (please be detailed as possible and provide log snippests): Steps are missing in this docs to replace failed drives: https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/3.11/html/operations_guide/chap-Documentation-Red_Hat_Gluster_Storage_Container_Native_with_OpenShift_Platform-Managing_Clusters#Replacing_Device 1) Step #5. If the migration is complete, to correct the device entry in heketi use '--force-forget' as an option to 'heketi-cli device delete' command. But note that this can be potentially dangerous if the failed device data (bricks) hasn't been migrated to the other devices at the node end, so we have to be careful about that. 2) performance.read-ahead must be disabled in order to allow the heal to complete: gluster volume set VOLUME performance.read-ahead off 3) extra shd's must be started if >100,000 volumes require healing. KBase article should be included or linked in the docs: https://access.redhat.com/solutions/3794011 Version of all relevant components (if applicable): 3.4.11 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes w/o #1 the heal never starts w/o #2 the heal starts but halts w/o #3 the heal progresses at a very slow rate and can take days to complete. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 5
Hi, I have created a draft of the content to be added if you can review this and let me know if there are any changes to be done. Link to the doc - https://docs.google.com/document/d/1hbG-B-7WDpv4_yNil9qHyHn-jyPMlia_TA-T0UDkoP8/edit Thank you! -Disha Walvekar
Disha - let's see if Yaniv and Anton can add more clarification to the "warning" section. Thanks, Dan