1968623 – Jenkins server is crashing frequently with errors in 'WatchConnectionManager' and 'KubernetesLauncher'

Bug 1968623 - Jenkins server is crashing frequently with errors in 'WatchConnectionManager' and 'KubernetesLauncher'

Summary: Jenkins server is crashing frequently with errors in 'WatchConnectionManager'...

Keywords:
Status:	CLOSED DUPLICATE of bug 1925524
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Jenkins
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Akram Ben Aissi
QA Contact:	Jitendar Singh
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-07 16:58 UTC by Marco Braga
Modified:	2024-12-20 20:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-30 20:26:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marco Braga 2021-06-07 16:58:54 UTC

# Description of problem:

Jenkins server need to be restart frequently and is reporting messages Warning on "WatchConnectionManager"[1] and SEVERE on "KubernetesLauncher"[2]. Both seems to be connected.

**My theory** is that seems to be impacted by internal and/or external situations:
- Internal: KubernetesLauncher is scheduling too many pods (kube-apiserver api calls), consuming the buckets allowed by P&F, receiving throttling, connection raise exception RejectedExecutionException and seems to crash in some time - after many exceptions.
- External: Normal P&F could throttle the sync-plugin and the same behavior could be happened
- External: When etcd deployments is not fast enough, consequently more errors will be found/returned by kube-apiserver 

Issues that we could understand and isolate:
- Both messages are connected: KubernetesLauncher impacting in WatchConnectionManager?
- Is the P&F gate could impact on sync-plugin operation/self-recover when being throttled?
- Is that errors that could drive to Jenkins "freeze" and need to be restarted manually?
- There is any health check that can detect that kind of situation on Jenkins?

---

[1] WatchConnectionManager

This error is repeating frequently and, based in some user cases it seems to crash the Jenkins needing a manual restart after Kubernetes APF (OpenShift P&F) was introduced in 4.5.15. The stack trace [1A] is repeating too many times[1B] in a short period of time. If it is connection retries when retrieving objects from Kubernetes API, we need to understand if we are having some issues with traffic control by P&F.

[1A] stack trace  "WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure"
~~~

2021-06-04T20:53:27.424569522Z 2021-06-04 20:53:27.424+0000 [id=64143]  WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure
2021-06-04T20:53:27.424569522Z java.util.concurrent.RejectedExecutionException: Task okhttp3.RealCall$AsyncCall@7aa11d55 rejected from java.util.concurrent.ThreadPoolExecutor@2b5433f8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12799]
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:182)
2021-06-04T20:53:27.424569522Z Caused: java.io.InterruptedIOException: executor rejected
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:185)
2021-06-04T20:53:27.424569522Z  at okhttp3.Dispatcher.promoteAndExecute(Dispatcher.java:186)
2021-06-04T20:53:27.424569522Z  at okhttp3.Dispatcher.enqueue(Dispatcher.java:137)
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall.enqueue(RealCall.java:126)
2021-06-04T20:53:27.424569522Z  at okhttp3.internal.ws.RealWebSocket.connect(RealWebSocket.java:193)
2021-06-04T20:53:27.424569522Z  at okhttp3.OkHttpClient.newWebSocket(OkHttpClient.java:435)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.runWatch(WatchConnectionManager.java:153)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$1200(WatchConnectionManager.java:48)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2$1.execute(WatchConnectionManager.java:316)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$NamedRunnable.run(WatchConnectionManager.java:405)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-06-04T20:53:27.424569522Z  at java.base/java.lang.Thread.run(Thread.java:834)
~~~

[1B] error repating many times.
~~~
$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |wc -l 
5752

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |awk -F':' '{print$1":"$2}' |sort |uniq -c
    266 2021-06-04T20:51
   1061 2021-06-04T20:52
    897 2021-06-04T20:53
   1094 2021-06-04T20:54
   1106 2021-06-04T20:55
   1106 2021-06-04T20:56
    222 2021-06-04T20:57

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |head -n1
2021-06-04T20:51:51.422655310Z 2021-06-04 20:51:51.422+0000 [id=64143]	WARNING	i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |tail -n1
2021-06-04T20:57:16.022918463Z 2021-06-04 20:57:16.022+0000 [id=77825]	WARNING	i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure
~~~


[2] Jenkins server is crashing when provisioning many jenkins-base pods in a short period of time:
~~~
./inspect.local.1906246428634946458/namespaces/<custom_ns_cicd>/pods/jenkins-42-tf2rl/jenkins/jenkins/logs/current.log:132354:2021-06-04T20:56:59.980121155Z 2021-06-04 20:56:59.979+0000 [id=94871]	SEVERE	o.c.j.p.k.KubernetesLauncher#logLastLines: Error in provisioning; agent=KubernetesSlave name: jenkins-base-7n78j, template=PodTemplate{id='2cb93a09-c978-4d98-8ac6-993f27aca0fb'inheritFrom='', name='jenkins-base', namespace='', hostNetwork=false, instanceCap=25, idleMinutes=5, label='jenkins-base', nodeSelector='', nodeUsageMode=EXCLUSIVE, workspaceVolume=EmptyDirWorkspaceVolume [memory=false], volumes=[org.csanchez.jenkins.plugins.kubernetes.volumes.ConfigMapVolume@9aa5f96], containers=[ContainerTemplate{name='jnlp', image='image-registry.openshift-image-registry.svc:5000/openshift/ose-jenkins-agent-base', alwaysPullImage=true, workingDir='/tmp', command='', args='${computer.jnlpmac} ${computer.name}', ttyEnabled=true, resourceRequestCpu='', resourceRequestMemory='', resourceLimitCpu='', resourceLimitMemory='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}]}. Container jnlp. Logs: INFO: Connecting to 100.123.158.68:50000

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}'|wc -l 
67

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}' |awk -F':' '{print$1":"$2}' |sort |uniq -c
      4 2021-06-04T20:51
     16 2021-06-04T20:52
      9 2021-06-04T20:53
     12 2021-06-04T20:54
     12 2021-06-04T20:55
     14 2021-06-04T20:56

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}'
2021-06-04T20:51:39.065580287Z jenkins-base-5zdfn,
2021-06-04T20:51:39.096242418Z jenkins-base-118gj,
2021-06-04T20:51:39.298225591Z jenkins-base-5lpnp,
2021-06-04T20:51:40.074201372Z jenkins-base-l3scz,
2021-06-04T20:52:01.436538575Z jenkins-base-tzc6b,
2021-06-04T20:52:01.448963948Z jenkins-base-mw59x,
2021-06-04T20:52:01.450036443Z jenkins-base-2gbql,
2021-06-04T20:52:02.434993030Z jenkins-base-7zmw2,
[...]
2021-06-04T20:56:58.917758049Z jenkins-base-v323t,
2021-06-04T20:56:59.898721127Z jenkins-base-htgcn,
2021-06-04T20:56:59.927598141Z jenkins-base-bshzh,
2021-06-04T20:56:59.980121155Z jenkins-base-7n78j,
~~~


# Version-Release number of selected component (if applicable):

# How reproducible: TBD (maybe creating too many jobs, and restricting sync-plugin with P&F rule)

# Steps to Reproduce: TBD


# Actual results:

The Jenkins server randomly crashing.

# Expected results:

Jenkins server become more reliable - maybe when communicating with kube-apiserver.

# Additional info:

- I already tried to create a internal P&F rule but not works
- Customers said that the unique solution is to restart the Jenkins

Comment 9 Adam Kaplan 2021-06-30 20:26:30 UTC

Consolidating this issue into Bug #1925524, since this is ultimately due to the sync plugin opening too many watches on the kube/OpenShift apiservers.
The next version of the openshift-sync-plugin (1.0.47) will include performance enhancements that reduce the plugin's load on the apiservers.

*** This bug has been marked as a duplicate of bug 1925524 ***

Comment 10 Red Hat Bugzilla 2023-09-15 01:09:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.