Red Hat Bugzilla – Bug 203916
groupd daemon segfault and mount hang
Last modified: 2009-04-16 18:45:10 EDT
Description of problem:
If I mount a certain number of gfs mount points on any given node
in a cluster, the mount will hang and the groupd daemon will segfault.
Steps to Reproduce:
From a fresh boot of a 5-node (smoke) cluster:
1. service cman start on all 5 nodes
2. service clvmd start on all 5 nodes
3. Mount a gfs file system on 4 out of 5 nodes.
4. Mount a gfs2 file system on the same 4 nodes.
5. On the fifth node, do five mounts of different file systems.
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster2 /mnt/SmokeCluster2/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster3 /mnt/SmokeCluster3/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster4 /mnt/SmokeCluster4/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster5 /mnt/SmokeCluster5/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster6 /mnt/SmokeCluster6/
The first four mounts work correctly. The fifth mount hangs
(but you can interrupt it) and the groupd daemon segfaults,
causing the other daemons to stop as well.
You should be able to mount more than 4 gfs mount points.
The groupd daemon was allocating two chunks of memory for node
information relating to the mounts. The array was allocated at 16
entries. When the 17th entry was needed, only one of the arrays
was increased in size. When the second array was used, the segfault
Created attachment 134812 [details]
Proposed patch to fix the problem
Created attachment 134920 [details]
Better patch for mount hangs
The previous patch was still hanging because the pollfd array
that was allocated did not initialize its 'revents'. That caused
the system to try to execute revents that didn't exist, and somehow
that caused the hang. One side-effect was socket write errors in
Also, in the process of debugging this, I learned that the
gfs_controld daemon was also not dynamically growing its pollfd either.
That did have ramifications, but I don't know the full extent of that.
I do know that group_tool -v would not show you the proper list of
groups when gfs_controld did not dynamically grow its list. This
occurred when a node tried to allocate 5 or more gfs mount points.
When gfs_controld is allowed to dynamically grow its pollfd array,
the proper group list is displayed by group_tool.
This improved patch fixes both problems and I've tested it by
mounting ten GFS file systems without a problem.
I highly suspect these problems might have been causing some or
most of the problems encountered by the QE team with the tank
Our original mount hangs seem to be fixed. We'll file new bugs when we run
into new hangs.
Passing through verified to get metric correct.
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.