Bug 203916

Summary: groupd daemon segfault and mount hang
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: cmanAssignee: Robert Peterson <rpeterso>
Status: CLOSED CURRENTRELEASE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cluster-maint, teigland
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-10-11 14:04:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Proposed patch to fix the problem
none
Better patch for mount hangs none

Description Robert Peterson 2006-08-24 14:03:15 UTC
Description of problem:
If I mount a certain number of gfs mount points on any given node
in a cluster, the mount will hang and the groupd daemon will segfault.

How reproducible:
Always

Steps to Reproduce:
From a fresh boot of a 5-node (smoke) cluster:
1. service cman start on all 5 nodes
2. service clvmd start on all 5 nodes
3. Mount a gfs file system on 4 out of 5 nodes.
4. Mount a gfs2 file system on the same 4 nodes.
5. On the fifth node, do five mounts of different file systems.

mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster2 /mnt/SmokeCluster2/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster3 /mnt/SmokeCluster3/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster4 /mnt/SmokeCluster4/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster5 /mnt/SmokeCluster5/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster6 /mnt/SmokeCluster6/
  
Actual results:
The first four mounts work correctly.  The fifth mount hangs 
(but you can interrupt it) and the groupd daemon segfaults, 
causing the other daemons to stop as well.

Expected results:
You should be able to mount more than 4 gfs mount points.

Additional info:
The groupd daemon was allocating two chunks of memory for node
information relating to the mounts.  The array was allocated at 16
entries.  When the 17th entry was needed, only one of the arrays
was increased in size.  When the second array was used, the segfault
occurred.

Comment 1 Robert Peterson 2006-08-24 14:03:15 UTC
Created attachment 134812 [details]
Proposed patch to fix the problem

Comment 2 Robert Peterson 2006-08-25 14:38:21 UTC
Created attachment 134920 [details]
Better patch for mount hangs

The previous patch was still hanging because the pollfd array
that was allocated did not initialize its 'revents'.  That caused 
the system to try to execute revents that didn't exist, and somehow
that caused the hang.  One side-effect was socket write errors in 
the daemon.

Also, in the process of debugging this, I learned that the 
gfs_controld daemon was also not dynamically growing its pollfd either.
That did have ramifications, but I don't know the full extent of that.
I do know that group_tool -v would not show you the proper list of
groups when gfs_controld did not dynamically grow its list.  This
occurred when a node tried to allocate 5 or more gfs mount points.
When gfs_controld is allowed to dynamically grow its pollfd array,
the proper group list is displayed by group_tool.

This improved patch fixes both problems and I've tested it by
mounting ten GFS file systems without a problem.

I highly suspect these problems might have been causing some or
most of the problems encountered by the QE team with the tank
and such.

Comment 3 Nate Straz 2006-10-11 13:23:15 UTC
Our original mount hangs seem to be fixed.   We'll file new bugs when we run
into new hangs.

Comment 4 Nate Straz 2006-10-11 14:03:22 UTC
Passing through verified to get metric correct.

Comment 5 Nate Straz 2007-12-13 17:22:10 UTC
Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.