# Description of problem: Jenkins server need to be restart frequently and is reporting messages Warning on "WatchConnectionManager"[1] and SEVERE on "KubernetesLauncher"[2]. Both seems to be connected. **My theory** is that seems to be impacted by internal and/or external situations: - Internal: KubernetesLauncher is scheduling too many pods (kube-apiserver api calls), consuming the buckets allowed by P&F, receiving throttling, connection raise exception RejectedExecutionException and seems to crash in some time - after many exceptions. - External: Normal P&F could throttle the sync-plugin and the same behavior could be happened - External: When etcd deployments is not fast enough, consequently more errors will be found/returned by kube-apiserver Issues that we could understand and isolate: - Both messages are connected: KubernetesLauncher impacting in WatchConnectionManager? - Is the P&F gate could impact on sync-plugin operation/self-recover when being throttled? - Is that errors that could drive to Jenkins "freeze" and need to be restarted manually? - There is any health check that can detect that kind of situation on Jenkins? --- [1] WatchConnectionManager This error is repeating frequently and, based in some user cases it seems to crash the Jenkins needing a manual restart after Kubernetes APF (OpenShift P&F) was introduced in 4.5.15. The stack trace [1A] is repeating too many times[1B] in a short period of time. If it is connection retries when retrieving objects from Kubernetes API, we need to understand if we are having some issues with traffic control by P&F. [1A] stack trace "WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure" ~~~ 2021-06-04T20:53:27.424569522Z 2021-06-04 20:53:27.424+0000 [id=64143] WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure 2021-06-04T20:53:27.424569522Z java.util.concurrent.RejectedExecutionException: Task okhttp3.RealCall$AsyncCall@7aa11d55 rejected from java.util.concurrent.ThreadPoolExecutor@2b5433f8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12799] 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) 2021-06-04T20:53:27.424569522Z at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:182) 2021-06-04T20:53:27.424569522Z Caused: java.io.InterruptedIOException: executor rejected 2021-06-04T20:53:27.424569522Z at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:185) 2021-06-04T20:53:27.424569522Z at okhttp3.Dispatcher.promoteAndExecute(Dispatcher.java:186) 2021-06-04T20:53:27.424569522Z at okhttp3.Dispatcher.enqueue(Dispatcher.java:137) 2021-06-04T20:53:27.424569522Z at okhttp3.RealCall.enqueue(RealCall.java:126) 2021-06-04T20:53:27.424569522Z at okhttp3.internal.ws.RealWebSocket.connect(RealWebSocket.java:193) 2021-06-04T20:53:27.424569522Z at okhttp3.OkHttpClient.newWebSocket(OkHttpClient.java:435) 2021-06-04T20:53:27.424569522Z at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.runWatch(WatchConnectionManager.java:153) 2021-06-04T20:53:27.424569522Z at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$1200(WatchConnectionManager.java:48) 2021-06-04T20:53:27.424569522Z at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2$1.execute(WatchConnectionManager.java:316) 2021-06-04T20:53:27.424569522Z at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$NamedRunnable.run(WatchConnectionManager.java:405) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 2021-06-04T20:53:27.424569522Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 2021-06-04T20:53:27.424569522Z at java.base/java.lang.Thread.run(Thread.java:834) ~~~ [1B] error repating many times. ~~~ $ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |wc -l 5752 $ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |awk -F':' '{print$1":"$2}' |sort |uniq -c 266 2021-06-04T20:51 1061 2021-06-04T20:52 897 2021-06-04T20:53 1094 2021-06-04T20:54 1106 2021-06-04T20:55 1106 2021-06-04T20:56 222 2021-06-04T20:57 $ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |head -n1 2021-06-04T20:51:51.422655310Z 2021-06-04 20:51:51.422+0000 [id=64143] WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure $ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |tail -n1 2021-06-04T20:57:16.022918463Z 2021-06-04 20:57:16.022+0000 [id=77825] WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure ~~~ [2] Jenkins server is crashing when provisioning many jenkins-base pods in a short period of time: ~~~ ./inspect.local.1906246428634946458/namespaces/<custom_ns_cicd>/pods/jenkins-42-tf2rl/jenkins/jenkins/logs/current.log:132354:2021-06-04T20:56:59.980121155Z 2021-06-04 20:56:59.979+0000 [id=94871] SEVERE o.c.j.p.k.KubernetesLauncher#logLastLines: Error in provisioning; agent=KubernetesSlave name: jenkins-base-7n78j, template=PodTemplate{id='2cb93a09-c978-4d98-8ac6-993f27aca0fb'inheritFrom='', name='jenkins-base', namespace='', hostNetwork=false, instanceCap=25, idleMinutes=5, label='jenkins-base', nodeSelector='', nodeUsageMode=EXCLUSIVE, workspaceVolume=EmptyDirWorkspaceVolume [memory=false], volumes=[org.csanchez.jenkins.plugins.kubernetes.volumes.ConfigMapVolume@9aa5f96], containers=[ContainerTemplate{name='jnlp', image='image-registry.openshift-image-registry.svc:5000/openshift/ose-jenkins-agent-base', alwaysPullImage=true, workingDir='/tmp', command='', args='${computer.jnlpmac} ${computer.name}', ttyEnabled=true, resourceRequestCpu='', resourceRequestMemory='', resourceLimitCpu='', resourceLimitMemory='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}]}. Container jnlp. Logs: INFO: Connecting to 100.123.158.68:50000 $ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}'|wc -l 67 $ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}' |awk -F':' '{print$1":"$2}' |sort |uniq -c 4 2021-06-04T20:51 16 2021-06-04T20:52 9 2021-06-04T20:53 12 2021-06-04T20:54 12 2021-06-04T20:55 14 2021-06-04T20:56 $ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}' 2021-06-04T20:51:39.065580287Z jenkins-base-5zdfn, 2021-06-04T20:51:39.096242418Z jenkins-base-118gj, 2021-06-04T20:51:39.298225591Z jenkins-base-5lpnp, 2021-06-04T20:51:40.074201372Z jenkins-base-l3scz, 2021-06-04T20:52:01.436538575Z jenkins-base-tzc6b, 2021-06-04T20:52:01.448963948Z jenkins-base-mw59x, 2021-06-04T20:52:01.450036443Z jenkins-base-2gbql, 2021-06-04T20:52:02.434993030Z jenkins-base-7zmw2, [...] 2021-06-04T20:56:58.917758049Z jenkins-base-v323t, 2021-06-04T20:56:59.898721127Z jenkins-base-htgcn, 2021-06-04T20:56:59.927598141Z jenkins-base-bshzh, 2021-06-04T20:56:59.980121155Z jenkins-base-7n78j, ~~~ # Version-Release number of selected component (if applicable): # How reproducible: TBD (maybe creating too many jobs, and restricting sync-plugin with P&F rule) # Steps to Reproduce: TBD # Actual results: The Jenkins server randomly crashing. # Expected results: Jenkins server become more reliable - maybe when communicating with kube-apiserver. # Additional info: - I already tried to create a internal P&F rule but not works - Customers said that the unique solution is to restart the Jenkins
Consolidating this issue into Bug #1925524, since this is ultimately due to the sync plugin opening too many watches on the kube/OpenShift apiservers. The next version of the openshift-sync-plugin (1.0.47) will include performance enhancements that reduce the plugin's load on the apiservers. *** This bug has been marked as a duplicate of bug 1925524 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days