Bug 98531
Summary: | ypserv master died with surprising error message after errata | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Seth Vidal <skvidal> |
Component: | ypserv | Assignee: | Chris Feist <cfeist> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 7.3 | CC: | aoliva, bellman, brian-redhat-bugzilla, cmc, damorep, felicity, jim, joey, k.georgiou, laroche, menscher, mgalgoci, mkc14, ngaywood, paul.flanders, Per.t.Sjoholm, rmalouf, vanhoof, ziselman |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2006-0203 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-21 14:20:19 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 143573 |
Description
Seth Vidal
2003-07-03 12:08:14 UTC
I just got the same error with ypserv-2.8-0.9E on Red Hat Linux 9. Happened on another server for a different nis domain. appears to only be happening on the masters not the secondary nis servers. Five days after the first restart ypserv crashed again on the same machine. *** Bug 101428 has been marked as a duplicate of this bug. *** Bug #101428 has been on a slave server within RH. greetings, Florian La Roche We've seen ypserv die on slaves in production: Aug 10 23:59:36 rainier ypserv[1830]: svc_run: - poll failed: No child processes Aug 10 23:59:36 rainier ypserv[1830]: svc_run returned So it isn't limited to masters only. Failed on slaves here too - I just forgot to add it. Resorted to watching rpcinfo via cron to make sure it's up and happy. Not a happy solution but better than losing an nis server. FWIW, here's the work around I adopted at the uni: # crontab -u root -l | grep ypserv * * * * * { ypwhich ; ypwhich || /sbin/service ypserv restart | mail -s "Restarted ypserv" root ; } > /tmp/ypcron.log 2>&1 One thing I've noticed is that we are seeing this on our 7.2 and 7.3 boxes, but not our AS2.1 boxes. My YP master is AS2.1 and it sees much more traffic than some of the other YP servers that seem to die regularly. All 3 distros seem to be using ypserv 2.8. I've not checked each to see if the local patches we apply are different to each distto, but I can't see how they could be so different. Also, at least for 7.2 and AS2.1, they seem to be running roughly the same glibc. The most glaring difference is the kernel. We are running 2.4.9-e.23 on the AS2.1 box, and our other boxes are either some flavor of 2.4.18 or the latest 2.4.20 errata. That combined with the fact that ypserv is reporting poll() as returning what I feel is an impossible error code (No child processes? That seems semantically impossible, or at least really brain-dead). Perhaps syslog(3) is merely misinterpreting errno when it processes the %m *shrug*, but I doubt it. Could this be kernel-related? seems unlikely in my case. We upgraded ypserv w/o changing kernels and ypserv started dying. If the kernel I'm running is the problem then you'd think it might have triggered it in the older ypserv too. Remember though that the new ypserv has new semantics as well. It forks child processes to handle map transfer requests and such that a single transfer request cannot monopolize the parent ypserv process. The previous ypserv would not fork children to handle requests. Phil's speclation about poll() still holds I think. Additionally, I would not rule out glibc either. I confirm this problem on NIS slave server on RHL 9 with ypserv-2.8-2 from rawhide. Since ypserv is a critical service, maybe Alexandre's "restarter" cron job should be made part of the ypserv rpm? K.O. If it were to be added to the default package, at the very least, it should be changed to condrestart, such that if you stop the service, it doesn't mysteriously come back :-) Have same problem on RH 9 metoo, hoping for a fix sometime soon... RH9+errata is my config. We're having the same problems on an ES 2.1 machine here. The box has been fully patched with all of the RHN errata. *** Bug 105661 has been marked as a duplicate of this bug. *** Please try the ypserv in http://people.redhat.com/steved/ypserv/2.8-4/ Any ETA on an updated package for RHEL ES 2.1? After up2date to the newest ypserv, the process would stop after a few days of operation and my NIS clients would not be able to login. I'd have to use /sbin/service ypserv restart to run the process again and allow NIS clients to work. This never happened before and It's hampering our cluster. Please advise on a solution if it exists. Thank you. Our ypserv version is: ypserv-2.8-0.73E and we're running redhat 7.3. Are there any type of error messages in /var/log/messages? I get the same messages that Matt Galgoci mentioned above. Just happened to me on Fedora Core 1 too. Would you please release the 2.8-4 build as a testing update or something? It has been working flawlessly for me on RHL9, and it's much better than random crashes of ypserv anyway. Just to chime in, I've seen this now multiple times on RH9 (yp slave) and RHEL AS 2.1 (yp master). RPM versions and syslog snippets: RH9, ypserv-2.8-0.9E Dec 23 12:05:39 mrweed ypserv[29466]: svc_run: - poll failed: No child processes Dec 23 12:05:39 mrweed ypserv[29466]: svc_run returned RHEL AS 2.1, ypserv-2.8-0.AS21E Jan 1 10:39:59 rupert ypserv[6068]: svc_run: - poll failed: No child processes Jan 1 10:39:59 rupert ypserv[6068]: svc_run returned Dec 12 13:32:20 rupert ypserv[833]: svc_run: - poll failed: No child processes Dec 12 13:32:20 rupert ypserv[833]: svc_run returned I am running RedHat 7.3 and ypserv (fully updated via up2date) on a clustered system (on the head node), and am having the same problem as well. One strange thing I noticed -- the times: Jan 14 08:00:01 ... ypserv[1094]: svc_run: - poll failed: No child processes Jan 14 08:00:01 ... ypserv[1094]: svc_run returned Jan 23 18:25:00 ... ypserv[12585]: svc_run: - poll failed: No child processes Jan 23 18:25:00 ... ypserv[12585]: svc_run returned They seems to fall exactly on the 5 minute mark. Perhaps this is coincidence? Could some cronjob be affecting it? The newest ypserv rpm in the development tree has some important fixes in it. Would be great if someone could test this and report if that fixes the seen problems. greetings, Florian La Roche I find it interesting that Michael noticed the 5-minute-mark aspect of the times. Others reports don't seem to have this property, but I think there's something to it. My evidence comes from an internal email I sent on Dec 20: ~~~~~~~~~~~~~ Ok, here's an interesting statistic... look at when it's died: Nov 30 01:40:00 zeus ypserv[745]: svc_run: - poll failed: No child processes Dec 13 02:50:00 zeus ypserv[738]: svc_run: - poll failed: No child processes Dec 14 03:55:00 zeus ypserv[30240]: svc_run: - poll failed: No child processes Notice that it always dies at night (probably no big deal) and it's always at some even 5-minute time. I'm suspicious that a cron job is triggering the failures. But the tricky thing is that it could be a cron job on *any* of the machines. So tracking it down is not exactly trivial. Still, if this pattern continues, we could maybe find out what's killing it. That'd probably provide enough info that RedHat could reproduce it and come up with a patch. ~~~~~~~~~~ After that we tried running a sniffer for a while, to see if we could get a packet capture at the time of death, but it never died again, and we lost interest. Hope this information is useful. There is some speculation that the errno code is impossible and the newest release has a bug fixed to properly save errno in the signal handler. That looks like a good fix to include, but I haven't seen any testing if that is the important fix. greetings, Florian La Roche Hi, Additional Input: Customer encountering the same problem describe on this ticket. CRM # 311242. Customer is using a 2.1 system that has been updated. Gene Ordanza II RH Technical Support Hello, I'm getting the same results here with RedHat 9.0 (ypserv-2.8-0.9E): Mar 10 06:00:01 primnis ypserv[1765]: svc_run: - poll failed: No child processes Mar 10 06:00:01 primnis ypserv[1765]: svc_run returned Is this problem resolved with the version in: http://people.redhat.com/steved/ypserv/2.8-4/rh9 I've had two YPSERV crashes since I upgraded the package... I also have this problem been having this problem ever since upgrading to rh9. seeing it on multiple boxes. i use a little different cron script: rpcinfo -t ypserv localhost || /etc/rc.d/init.d/ypserv restart Looks like you actually have to do rpcinfo -t localhost ypserv || /etc/rc.d/init.d/ypserv restart (at least on RHL 9). Just thought I would add my errors to the mix.
Fedora Core 1 - fully patched as of today (04/20/04)
Anyone have any idea on a possible fix for this issue?
Thanks,
Jim
jim_at_linux-sp.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>
Apr 18 17:43:41 janus ypserv[888]: svc_run: - poll failed: No child
processes
Apr 18 17:43:41 janus ypserv[888]: svc_run returned
Apr 18 18:10:00 janus ypxfr[3835]: ypxfr: Can't get master address
Apr 18 18:10:00 janus ypserv[3755]: refused connect from 127.0.0.1:625
to procedure ypproc_master (nis.roman.array,publicke$Apr 18 18:10:00
janus ypxfr[3841]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7897]: YPXFR: RPC: Program not registered
Apr 18 18:55:00 janus ypxfr[7897]: ypxfr: RPC failure talking to server
Apr 18 18:55:00 janus ypxfr[7904]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7905]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7906]: ypxfr: Can't get master address
Apr 18 18:55:00 janus ypxfr[7907]: ypxfr: Can't get master address
Apr 18 19:10:01 janus ypxfr[9552]: ypxfr: Can't get master address
Apr 18 19:10:01 janus ypserv[9463]: refused connect from 127.0.0.1:825
to procedure ypproc_master (nis.roman.array,publicke$Apr 18 19:10:01
janus ypxfr[9553]: ypxfr: Can't get master address
Apr 18 20:10:26 janus ypxfr[15372]: masterOrderNum: RPC: Timed out
We're occasionally experiencing this on an FC1 box using ypserv-2.8-3. Is anyone experiencing this problem on FC2, or is it fixed in the newer version of ypserv (2.12.1-2)? FYI, bug 105661 (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=105661) explains the race condition leading to this problem and even suggests a source code patch. K.O. This should be fixed in ypserv 2.8-9.22 (RHEL21) & ypserv 2.8-13 (RHEL3). I'm unable to test for the error because it is so difficult to replicate, but I believe I've found the problem detailed in bug 105661 and backported the fix from upstream. I've posted ypserv 2.8-13 and ypserv-2.8-9.22 on my peoples pages. You get get the rpms there until they become available in RHEL 3 & RHEL 2.1: http://people.redhat.com/cfeist/ypserv/ Does this mean that RHEL4 isn't affected by the bug? Yes, RHEL4 should not be affected by this bug. RHEL4 uses ypserv-2.13 and this bug was fixed in ypserv-2.10 (and was backported into ypserv-2.8 for RHEL 3 & RHEL 2.1. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2005-352.html An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0203.html |