1442904 – Met "error reloading router: wait: no child processes" in router pod logs

Bug 1442904 - Met "error reloading router: wait: no child processes" in router pod logs

Summary: Met "error reloading router: wait: no child processes" in router pod logs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.6.0
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Phil Cameron
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-18 01:05 UTC by zhaozhanqi
Modified:	2022-08-04 22:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-16 19:14:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description zhaozhanqi 2017-04-18 01:05:49 UTC

Description of problem:
Met one time error " E0417 08:40:19.657429       1 ratelimiter.go:52] error reloading router: wait: no child processes" in router pod logs when creating/deleting route

Version-Release number of selected component (if applicable):
openshift version
openshift v3.6.27
kubernetes v1.5.2+43a9be4
etcd 3.1.0


How reproducible:
met one time

Steps to Reproduce:
1. When I using a script to try always creating route and delete it. like:
  $cat check.sh
  while true;
  do
  oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/routing/unsecure/route_unsecure.json;
  oc describe route route | grep expose;
  if [ $? != 0 ]; then 
   echo "the route is not loading to router" >> fail.route ;
  fi
  oc delete route route;
  sleep 10
done

2. run this script in backgroud
   nohup ./check.sh &
3. met one time 'the route is not loading to router' after running about 24 hour

4. Check the router logs

Actual results:
 oc logs router-xxx
 ..
  - HAProxy port 1936 health check ok : 0 retry attempt(s).
I0417 08:40:09.418421       1 router.go:508] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).
E0417 08:40:19.657429       1 ratelimiter.go:52] error reloading router: wait: no child processes
 - Checking HAProxy /healthz on port 1936 ...
 - HAProxy port 1936 health check ok : 0 retry attempt(s).
I0417 08:40:19.978148       1 router.go:508] Router reloaded:
 - Checking HAProxy /healthz on port 1936 ...
..

Expected results:

no this kind of error and route can be loading to router.

Additional info:

Comment 1 Ben Bennett 2017-04-19 19:12:16 UTC

This is just something haproxy outputs when there are no kids to kill yet... it is harmless and should be ignored.  We should clean up the error to make it less scary (unless it repeats)

Comment 2 zhaozhanqi 2017-04-20 01:32:11 UTC

@Ben

it's not only the error in the log. at same time the route also cannot be loading to router.

   
 see the following script is checking if the route is loading to router:
  *****
  oc describe route route | grep expose;
  if [ $? != 0 ]; then 
   echo "the route is not loading to router" >> fail.route ;
  fi

and I can see the error 'the route is not loading to router' in fail.route when the router met 'error reloading router: wait: no child processes'

Comment 3 Marek Schmidt 2017-04-24 12:45:39 UTC

We seems to be hitting this on our testing OCP 3.5 cluster (with openshift3/ose-haproxy-router:v3.5.5.5 )

Seems that when then happens, the router stop being updated. (we are getting 503 for all new routes).

(also noting there are no additional errors in haproxy logs like we used to get with https://bugzilla.redhat.com/show_bug.cgi?id=1429823 )

Comment 4 Ben Bennett 2017-04-24 15:09:32 UTC

maschmid: You are probably hitting the router deadlock bug https://bugzilla.redhat.com/show_bug.cgi?id=1440977

Comment 5 Phil Cameron 2017-06-02 17:09:59 UTC

The fix for the Pop() panic also has a change to reduce the number of deleted routes in the database. This may help as well since it took 24 hours to happen. Could we retest this with the Pop() panic (bz1437441) fix?

Comment 6 Phil Cameron 2017-06-19 17:34:55 UTC

Is this still an issue?

Comment 7 Ben Bennett 2017-06-22 16:02:06 UTC


*** This bug has been marked as a duplicate of bug 1437441 ***

Comment 8 Josh Foots 2017-08-16 18:18:32 UTC

I don't think this bug is a dupe of the one above.

The problem seems to be the underlying golang itself see comment from tgross on following:

wait: no child processes · Issue #178 · joyent/containerpilot
https://github.com/joyent/containerpilot/issues/178

"I've run into this intermittently. The code section in question is in utils/run.go ExecuteAndWait. If you check out the golang source code for cmd.Run you'll see a race condition. The process is started and then we wait for it. But if the process completes and exits before the wait happens (because, say, the go runtime decides to do a GC pause right then or the goroutine yields for the syscall), then we'll get an error there."

Its probably a non-issue in later versions as we changed the version of golang we ship in later versions.

Comment 9 Eric Paris 2017-08-16 19:14:52 UTC

While the log message is not user friendly we do not, at this time, believe that message indicated a problem or will cause any harm to the system. We do not plan specifically to address this log output in any release. I apologize however I am going to close this BZ as WONTFIX.

Note You need to log in before you can comment on or make changes to this bug.