Description of problem: basic recovery senario, healthy cluster running I/O. Two nodes are shot (morph-02 and morph-03) and an attempt to bring them into the cluster causes clvmd to get stuck in the "join" or "update" state. morph-01: [root@morph-01 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 6 2 1 3 4] DLM Lock Space: "clvmd" 2 3 update U-4,1,3 [5 1 2 6 3] DLM Lock Space: "corey0" 3 4 run - [5 6 2 1] DLM Lock Space: "corey1" 5 6 run - [5 6 2 1] DLM Lock Space: "corey2" 7 8 run - [5 6 2 1] DLM Lock Space: "corey3" 9 10 run - [5 6 2 1] GFS Mount Group: "corey0" 4 5 run - [5 6 2 1] GFS Mount Group: "corey1" 6 7 run - [5 6 2 1] GFS Mount Group: "corey2" 8 9 run - [5 6 2 1] GFS Mount Group: "corey3" 10 11 run - [5 6 2 1] morph-02: [root@morph-02 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 1 5 6 3 4] DLM Lock Space: "clvmd" 2 3 join S-6,20,5 [2 1 5 6 3] morph-03: [root@morph-03 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 3 5 6 4] DLM Lock Space: "clvmd" 0 3 join S-1,80,6 [] morph-04: [root@morph-04 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 6 2 1 3 4] DLM Lock Space: "clvmd" 2 3 update U-4,1,3 [5 1 2 6 3] DLM Lock Space: "corey0" 3 4 run - [5 6 2 1] DLM Lock Space: "corey1" 5 6 run - [5 6 2 1] DLM Lock Space: "corey2" 7 8 run - [5 6 2 1] DLM Lock Space: "corey3" 9 10 run - [5 6 2 1] GFS Mount Group: "corey0" 4 5 run - [5 6 2 1] GFS Mount Group: "corey1" 6 7 run - [5 6 2 1] GFS Mount Group: "corey2" 8 9 run - [5 6 2 1] GFS Mount Group: "corey3" 10 11 run - [5 6 2 1] morph-05: [root@morph-05 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [2 5 6 1 3 4] DLM Lock Space: "clvmd" 2 3 update U-4,1,3 [1 2 5 6 3] DLM Lock Space: "corey0" 3 4 run - [2 5 6 1] DLM Lock Space: "corey1" 5 6 run - [2 5 6 1] DLM Lock Space: "corey2" 7 8 run - [2 5 6 1] DLM Lock Space: "corey3" 9 10 run - [2 5 6 1] GFS Mount Group: "corey0" 4 5 run - [2 5 6 1] GFS Mount Group: "corey1" 6 7 run - [2 5 6 1] GFS Mount Group: "corey2" 8 9 run - [2 5 6 1] GFS Mount Group: "corey3" 10 11 run - [2 5 6 1] morph-06: [root@morph-06 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 2 5 6 3 4] DLM Lock Space: "clvmd" 2 3 update U-4,1,3 [1 5 2 6 3] DLM Lock Space: "corey0" 3 4 run - [1 2 5 6] DLM Lock Space: "corey1" 5 6 run - [1 2 5 6] DLM Lock Space: "corey2" 7 8 run - [1 2 5 6] DLM Lock Space: "corey3" 9 10 run - [1 2 5 6] GFS Mount Group: "corey0" 4 5 run - [1 2 5 6] GFS Mount Group: "corey1" 6 7 run - [1 2 5 6] GFS Mount Group: "corey2" 8 9 run - [1 2 5 6] GFS Mount Group: "corey3" 10 11 run - [1 2 5 6] How reproducible: Sometimes
reproduced again today
I have also seen the GFS mount group services get stuck in the update/join state. [root@morph-05 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 4 2 1 6 3] DLM Lock Space: "clvmd" 2 3 run - [5 4 2 1 6 3] DLM Lock Space: "corey0" 3 4 run - [5 4 2 1 6 3] GFS Mount Group: "corey0" 4 5 join S-6,20,6 [5 4 2 1 6 3] All others in cluster: [root@morph-04 tmp]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [5 1 4 6 2 3] DLM Lock Space: "clvmd" 2 3 run - [1 5 4 6 2 3] DLM Lock Space: "corey0" 3 4 run - [5 1 4 6 2 3] DLM Lock Space: "corey1" 5 6 run - [5 1 4 6 2] GFS Mount Group: "corey0" 4 5 update U-4,1,3 [5 1 4 6 2 3] GFS Mount Group: "corey1" 6 7 run - [5 1 4 6 2]
Bumping priority because all my recovery testing ends up hitting this bug eventually and doesn't get very far.
Assigning this to Dave and Patrick, since it seems like it is probably infrastructure, not the clients.
I think should be fixed by the checkin I did to cman late last week. The symptom was that if two nodes died at the same time, only one was notified to DLM so the TCP connection did not get severed.
Hit this exact same senario with the latest code: [root@morph-01 tmp]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [1 3 4 5 2 6] DLM Lock Space: "clvmd" 2 3 update U-4,1,6 [1 3 4 5 2 6] DLM Lock Space: "corey0" 3 4 run - [1 3 4 5 2] DLM Lock Space: "corey1" 5 6 run - [1 3 4 5 2] DLM Lock Space: "corey2" 7 8 run - [1 3 4 5 2] GFS Mount Group: "corey0" 4 5 run - [1 3 4 5 2] GFS Mount Group: "corey1" 6 7 run - [1 3 4 5 2] GFS Mount Group: "corey2" 8 9 run - [1 3 4 5 2] [root@morph-06 root]# cat /proc/cluster/services Service Name GID LID State Code Fence Domain: "default" 1 2 run - [3 2 1 4 5 6] DLM Lock Space: "clvmd" 2 3 join S-6,20,6 [3 2 1 4 5 6]
Updating version to the right level in the defects. Sorry for the storm.
These cases where lockspace recovery wasn't happening is quite possibly the effect of the dlm_recoverd thread problems we were having. The wakeup to the recovery thread was being missed so no recovery would ever happen (or later with dynamically created threads, no new recovery thread would be created to do the recovery). If this was the problem it should be fixed. More info on the state of the dlm kernel threads would be needed if this is seen again. I'm dismissing the comment about mount-group recovery since that would be a different and unrelated problem.
FYI, the comment about mount-group services getting stuck is bz143269
Hmmm, it must automatically move from NEEDINFO to ASSIGNED regardless of info given, moving back to NEEDINFO
This bug seemed to be specifically about recovery problems with the DLM/clvmd service. I have not seen those in a long time. All other DLM/GFS services having problems with recovery should be documented in bug 145683.