Bug 98531

Summary:	ypserv master died with surprising error message after errata
Product:	[Retired] Red Hat Linux	Reporter:	Seth Vidal <skvidal>
Component:	ypserv	Assignee:	Chris Feist <cfeist>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.3	CC:	aoliva, bellman, brian-redhat-bugzilla, cmc, damorep, felicity, jim, joey, k.georgiou, laroche, menscher, mgalgoci, mkc14, ngaywood, paul.flanders, Per.t.Sjoholm, rmalouf, vanhoof, ziselman
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2006-0203	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-03-21 14:20:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	143573

Description Seth Vidal 2003-07-03 12:08:14 UTC

Description of problem:

After the ypserv-2.8-0.73E errata my primary ypserv master died with:
svc_run: - poll failed: No child processes
svc_run returned

ypserv died after not being able to contact its child processes.

I've never seen this error before and I'm not sure what is causing it. I emailed
the nis maintainer to see about any explanation. I wanted to file a bug here in
case others have this problem.

Comment 1 Alexandre Oliva 2003-07-04 17:55:51 UTC

I just got the same error with ypserv-2.8-0.9E on Red Hat Linux 9.

Comment 2 Seth Vidal 2003-07-05 13:39:59 UTC

Happened on another server for a different nis domain.


appears to only be happening on the masters not the secondary nis servers.

Comment 3 Seth Vidal 2003-07-08 11:14:40 UTC

Five days after the first restart ypserv crashed again on the same machine.

Comment 5 Bill Nottingham 2003-08-01 02:14:16 UTC

*** Bug 101428 has been marked as a duplicate of this bug. ***

Comment 6 Florian La Roche 2003-08-08 11:11:33 UTC

Bug #101428 has been on a slave server within RH.

greetings,

Florian La Roche

Comment 7 Matthew Galgoci 2003-08-21 18:26:57 UTC

We've seen ypserv die on slaves in production:

Aug 10 23:59:36 rainier ypserv[1830]: svc_run: - poll failed: No child processes
Aug 10 23:59:36 rainier ypserv[1830]: svc_run returned

So it isn't limited to masters only.

Comment 8 Seth Vidal 2003-08-22 00:56:12 UTC

Failed on slaves here too - I just forgot to add it.

Resorted to watching rpcinfo via cron to make sure it's up and happy.

Not a happy solution but better than losing an nis server.

Comment 9 Alexandre Oliva 2003-08-22 14:30:54 UTC

FWIW, here's the work around I adopted at the uni:

# crontab -u root -l  | grep ypserv
* * * * * { ypwhich ; ypwhich || /sbin/service ypserv restart | mail -s
"Restarted ypserv" root ; } > /tmp/ypcron.log 2>&1

Comment 10 Phil D'Amore 2003-08-25 13:23:41 UTC

One thing I've noticed is that we are seeing this on our 7.2 and 7.3 boxes, but
not our AS2.1 boxes.  My YP master is AS2.1 and it sees much more traffic than
some of the other YP servers that seem to die regularly.  All 3 distros seem to
be using ypserv 2.8.  I've not checked each to see if the local patches we apply
are different to each distto, but I can't see how they could be so different. 
Also, at least for 7.2 and AS2.1, they seem to be running roughly the same
glibc.  The most glaring difference is the kernel.  We are running 2.4.9-e.23 on
the AS2.1 box, and our other boxes are either some flavor of 2.4.18 or the
latest 2.4.20 errata.  That combined with the fact that ypserv is reporting
poll() as returning what I feel is an impossible error code (No child processes?
 That seems semantically impossible, or at least really brain-dead).  Perhaps
syslog(3) is merely misinterpreting errno when it processes the %m *shrug*, but
I doubt it.  Could this be kernel-related?

Comment 11 Seth Vidal 2003-08-25 13:26:08 UTC

seems unlikely in my case. We upgraded ypserv w/o changing kernels and ypserv
started dying.

If the kernel I'm running is the problem then you'd think it might have
triggered it in the older ypserv too.

Comment 12 Matthew Galgoci 2003-08-25 16:55:23 UTC

Remember though that the new ypserv has new semantics as well. It forks child
processes to handle map transfer requests and such that a single transfer request
cannot monopolize the parent ypserv process. The previous ypserv would not 
fork children to handle requests. Phil's speclation about poll() still holds I
think.

Comment 13 Matthew Galgoci 2003-08-25 16:59:58 UTC

Additionally, I would not rule out glibc either.

Comment 14 Konstantin Olchanski 2003-08-25 22:26:28 UTC

I confirm this problem on NIS slave server on RHL 9 with ypserv-2.8-2 from rawhide.

Since ypserv is a critical service, maybe Alexandre's "restarter" cron job
should be made part of the ypserv rpm?

K.O.

Comment 15 Alexandre Oliva 2003-08-26 05:01:31 UTC

If it were to be added to the default package, at the very least, it should be
changed to condrestart, such that if you stop the service, it doesn't
mysteriously come back :-)

Comment 16 Per Sjoholm 2003-09-15 20:40:48 UTC

Have same problem on RH 9

Comment 17 Lord of All Creation 2003-09-19 19:14:40 UTC

metoo, hoping for a fix sometime soon...  RH9+errata is my config.

Comment 18 Brian Keifer 2003-09-23 14:18:42 UTC

We're having the same problems on an ES 2.1 machine here.  The box has been 
fully patched with all of the RHN errata.

Comment 19 Steve Dickson 2003-10-02 18:57:27 UTC

*** Bug 105661 has been marked as a duplicate of this bug. ***

Comment 20 Steve Dickson 2003-10-02 18:59:48 UTC

Please try the ypserv in http://people.redhat.com/steved/ypserv/2.8-4/

Comment 21 Brian Keifer 2003-10-15 18:14:27 UTC

Any ETA on an updated package for RHEL ES 2.1?

Comment 22 Need Real Name 2003-10-30 01:49:52 UTC

After up2date to the newest ypserv, the process would stop after a few days of
operation and my NIS clients would not be able to login.  I'd have to use
/sbin/service ypserv restart to run the process again and allow NIS clients to
work.  This never happened before and It's hampering our cluster.  Please advise
on a solution if it exists.  Thank you.

Our ypserv version is: ypserv-2.8-0.73E and we're running redhat 7.3.

Comment 23 Steve Dickson 2003-10-30 20:44:38 UTC

Are there any type of error messages in /var/log/messages?

Comment 24 Alexandre Oliva 2003-11-28 12:19:17 UTC

I get the same messages that Matt Galgoci mentioned above.  Just
happened to me on Fedora Core 1 too.  Would you please release the
2.8-4 build as a testing update or something?  It has been working
flawlessly for me on RHL9, and it's much better than random crashes of
ypserv anyway.

Comment 25 Theo Van Dinter 2004-01-02 22:37:04 UTC

Just to chime in, I've seen this now multiple times on RH9 (yp slave) and RHEL AS 2.1 
(yp master).

RPM versions and syslog snippets:
RH9, ypserv-2.8-0.9E

Dec 23 12:05:39 mrweed ypserv[29466]: svc_run: - poll failed: No child processes
Dec 23 12:05:39 mrweed ypserv[29466]: svc_run returned


RHEL AS 2.1, ypserv-2.8-0.AS21E

Jan  1 10:39:59 rupert ypserv[6068]: svc_run: - poll failed: No child processes
Jan  1 10:39:59 rupert ypserv[6068]: svc_run returned
Dec 12 13:32:20 rupert ypserv[833]: svc_run: - poll failed: No child processes
Dec 12 13:32:20 rupert ypserv[833]: svc_run returned

Comment 29 Michael 2004-01-25 16:37:10 UTC

I am running RedHat 7.3 and ypserv (fully updated via up2date) on a 
clustered system (on the head node), and am having the same problem 
as well.

One strange thing I noticed -- the times:

Jan 14 08:00:01 ... ypserv[1094]: svc_run: - poll failed: No child 
processes
Jan 14 08:00:01 ... ypserv[1094]: svc_run returned
Jan 23 18:25:00 ... ypserv[12585]: svc_run: - poll failed: No child 
processes
Jan 23 18:25:00 ... ypserv[12585]: svc_run returned

They seems to fall exactly on the 5 minute mark.  Perhaps this is 
coincidence?  Could some cronjob be affecting it?

Comment 30 Florian La Roche 2004-01-25 17:18:53 UTC

The newest ypserv rpm in the development tree has some important
fixes in it. Would be great if someone could test this and report
if that fixes the seen problems.

greetings,

Florian La Roche

Comment 31 Damian Menscher 2004-01-26 00:19:51 UTC

I find it interesting that Michael noticed the 5-minute-mark aspect 
of the times.  Others reports don't seem to have this property, but I 
think there's something to it.  My evidence comes from an internal 
email I sent on Dec 20:

~~~~~~~~~~~~~
Ok, here's an interesting statistic... look at when it's died:

Nov 30 01:40:00 zeus ypserv[745]: svc_run: - poll failed: No child 
processes
Dec 13 02:50:00 zeus ypserv[738]: svc_run: - poll failed: No child 
processes
Dec 14 03:55:00 zeus ypserv[30240]: svc_run: - poll failed: No child 
processes

Notice that it always dies at night (probably no big deal) and it's 
always at some even 5-minute time.  I'm suspicious that a cron job is 
triggering the failures.  But the tricky thing is that it could be a 
cron job on *any* of the machines.  So tracking it down is not 
exactly trivial.  Still, if this pattern continues, we could maybe 
find out what's killing it.  That'd probably provide enough info that 
RedHat could reproduce it and come up with a patch.
~~~~~~~~~~

After that we tried running a sniffer for a while, to see if we could 
get a packet capture at the time of death, but it never died again, 
and we lost interest.

Hope this information is useful.

Comment 32 Florian La Roche 2004-02-21 21:38:00 UTC

There is some speculation that the errno code is impossible and the
newest release has a bug fixed to properly save errno in the signal
handler. That looks like a good fix to include, but I haven't seen
any testing if that is the important fix.

greetings,

Florian La Roche

Comment 33 Geronimo A. Ordanza II 2004-03-10 19:03:17 UTC

Hi, 
 
Additional Input:  Customer encountering the same problem describe 
on this ticket.  CRM # 311242.  Customer is using a 2.1 system that 
has been updated. 
 
Gene Ordanza II 
RH Technical Support

Comment 34 Luc Lalonde 2004-03-10 21:07:50 UTC

Hello,

I'm getting the same results here with RedHat 9.0 (ypserv-2.8-0.9E):

Mar 10 06:00:01 primnis ypserv[1765]: svc_run: - poll failed: No child
processes
Mar 10 06:00:01 primnis ypserv[1765]: svc_run returned

Is this problem resolved with the version in:

http://people.redhat.com/steved/ypserv/2.8-4/rh9

I've had two YPSERV crashes since I upgraded the package...

Comment 35 Paul Flanders 2004-03-18 09:40:05 UTC

I also have this problem

Comment 36 Joe Pruett 2004-03-25 04:00:56 UTC

been having this problem ever since upgrading to rh9.  seeing it on 
multiple boxes.  i use a little different cron script:

rpcinfo -t ypserv localhost || /etc/rc.d/init.d/ypserv restart

Comment 37 C.M. Connelly 2004-04-02 23:29:02 UTC

Looks like you actually have to do

rpcinfo -t localhost ypserv || /etc/rc.d/init.d/ypserv restart

(at least on RHL 9).

Comment 38 Jim Robinson 2004-04-20 12:34:05 UTC

Just thought I would add my errors to the mix.
Fedora Core 1 - fully patched as of today (04/20/04)

Anyone have any idea on a possible fix for this issue?

Thanks,

Jim
jim_at_linux-sp.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>
Apr 18 17:43:41 janus ypserv[888]: svc_run: - poll failed: No child
processes
Apr 18 17:43:41 janus ypserv[888]: svc_run returned
Apr 18 18:10:00 janus ypxfr[3835]: ypxfr: Can't get master address
Apr 18 18:10:00 janus ypserv[3755]: refused connect from 127.0.0.1:625
to procedure ypproc_master (nis.roman.array,publicke$Apr 18 18:10:00
janus ypxfr[3841]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7897]: YPXFR: RPC: Program not registered
Apr 18 18:55:00 janus ypxfr[7897]: ypxfr: RPC failure talking to server
Apr 18 18:55:00 janus ypxfr[7904]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7905]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7906]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7907]: ypxfr: Can't get master address
Apr 18 19:10:01 janus ypxfr[9552]: ypxfr: Can't get master address
Apr 18 19:10:01 janus ypserv[9463]: refused connect from 127.0.0.1:825
to procedure ypproc_master (nis.roman.array,publicke$Apr 18 19:10:01
janus ypxfr[9553]: ypxfr: Can't get master address
Apr 18 20:10:26 janus ypxfr[15372]: masterOrderNum: RPC: Timed out

Comment 39 Jeff Sheltren 2004-10-29 14:11:03 UTC

We're occasionally experiencing this on an FC1 box using ypserv-2.8-3.

Is anyone experiencing this problem on FC2, or is it fixed in the
newer version of ypserv (2.12.1-2)?

Comment 40 Konstantin Olchanski 2005-02-07 21:59:12 UTC

FYI, bug 105661
(https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=105661) explains
the race condition leading to this problem and even suggests a source
code patch.
K.O.

Comment 43 Chris Feist 2005-03-15 20:21:44 UTC

This should be fixed in ypserv 2.8-9.22 (RHEL21) & ypserv 2.8-13
(RHEL3).  I'm unable to test for the error because it is so difficult
to replicate, but I believe I've found the problem detailed in bug
105661 and backported the fix from upstream.

Comment 44 Chris Feist 2005-03-17 19:21:09 UTC

I've posted ypserv 2.8-13 and ypserv-2.8-9.22 on my peoples pages.  You get get
the rpms there until they become available in RHEL 3 & RHEL 2.1:

http://people.redhat.com/cfeist/ypserv/

Comment 45 Joe Pruett 2005-03-17 19:56:31 UTC

Does this mean that RHEL4 isn't affected by the bug?

Comment 46 Chris Feist 2005-03-17 20:27:13 UTC

Yes, RHEL4 should not be affected by this  bug. RHEL4 uses ypserv-2.13 and this
bug was fixed in ypserv-2.10 (and was backported into ypserv-2.8 for RHEL 3 &
RHEL 2.1.

Comment 47 Tim Powers 2005-05-19 23:34:48 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2005-352.html

Comment 49 Red Hat Bugzilla 2006-03-21 14:20:20 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0203.html