Bug 203916

Summary:

groupd daemon segfault and mount hang

Product:

Red Hat Enterprise Linux 5

Reporter:

Robert Peterson <rpeterso>

Component:

cman

Assignee:

Robert Peterson <rpeterso>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

5.0

CC:

cluster-maint, teigland

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

5.0.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2006-10-11 14:04:24 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Proposed patch to fix the problem	none
Better patch for mount hangs	none

Description Robert Peterson 2006-08-24 14:03:15 UTC

Description of problem:
If I mount a certain number of gfs mount points on any given node
in a cluster, the mount will hang and the groupd daemon will segfault.

How reproducible:
Always

Steps to Reproduce:
From a fresh boot of a 5-node (smoke) cluster:
1. service cman start on all 5 nodes
2. service clvmd start on all 5 nodes
3. Mount a gfs file system on 4 out of 5 nodes.
4. Mount a gfs2 file system on the same 4 nodes.
5. On the fifth node, do five mounts of different file systems.

mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster2 /mnt/SmokeCluster2/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster3 /mnt/SmokeCluster3/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster4 /mnt/SmokeCluster4/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster5 /mnt/SmokeCluster5/
mount -tgfs /dev/Smoke_Cluster/Smoke_Cluster6 /mnt/SmokeCluster6/
  
Actual results:
The first four mounts work correctly.  The fifth mount hangs 
(but you can interrupt it) and the groupd daemon segfaults, 
causing the other daemons to stop as well.

Expected results:
You should be able to mount more than 4 gfs mount points.

Additional info:
The groupd daemon was allocating two chunks of memory for node
information relating to the mounts.  The array was allocated at 16
entries.  When the 17th entry was needed, only one of the arrays
was increased in size.  When the second array was used, the segfault
occurred.

Comment 1 Robert Peterson 2006-08-24 14:03:15 UTC

Created attachment 134812 [details]
Proposed patch to fix the problem

Comment 2 Robert Peterson 2006-08-25 14:38:21 UTC

Created attachment 134920 [details]
Better patch for mount hangs

The previous patch was still hanging because the pollfd array
that was allocated did not initialize its 'revents'.  That caused 
the system to try to execute revents that didn't exist, and somehow
that caused the hang.  One side-effect was socket write errors in 
the daemon.

Also, in the process of debugging this, I learned that the 
gfs_controld daemon was also not dynamically growing its pollfd either.
That did have ramifications, but I don't know the full extent of that.
I do know that group_tool -v would not show you the proper list of
groups when gfs_controld did not dynamically grow its list.  This
occurred when a node tried to allocate 5 or more gfs mount points.
When gfs_controld is allowed to dynamically grow its pollfd array,
the proper group list is displayed by group_tool.

This improved patch fixes both problems and I've tested it by
mounting ten GFS file systems without a problem.

I highly suspect these problems might have been causing some or
most of the problems encountered by the QE team with the tank
and such.

Comment 3 Nate Straz 2006-10-11 13:23:15 UTC

Our original mount hangs seem to be fixed.   We'll file new bugs when we run
into new hangs.

Comment 4 Nate Straz 2006-10-11 14:03:22 UTC

Passing through verified to get metric correct.

Comment 5 Nate Straz 2007-12-13 17:22:10 UTC

Moving all RHCS ver 5 bugs to RHEL 5 so we can remove RHCS v5 which never existed.