Bug 1968623

Summary:	Jenkins server is crashing frequently with errors in 'WatchConnectionManager' and 'KubernetesLauncher'
Product:	OpenShift Container Platform	Reporter:	Marco Braga <mrbraga>
Component:	Jenkins	Assignee:	Akram Ben Aissi <abenaiss>
Status:	CLOSED DUPLICATE	QA Contact:	Jitendar Singh <jitsingh>
Severity:	high	Docs Contact:
Priority:	medium
Version:	4.6.z	CC:	abenaiss, adam.kaplan, aos-bugs, pbhattac, vbobade
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-06-30 20:26:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Marco Braga 2021-06-07 16:58:54 UTC

# Description of problem:

Jenkins server need to be restart frequently and is reporting messages Warning on "WatchConnectionManager"[1] and SEVERE on "KubernetesLauncher"[2]. Both seems to be connected.

**My theory** is that seems to be impacted by internal and/or external situations:
- Internal: KubernetesLauncher is scheduling too many pods (kube-apiserver api calls), consuming the buckets allowed by P&F, receiving throttling, connection raise exception RejectedExecutionException and seems to crash in some time - after many exceptions.
- External: Normal P&F could throttle the sync-plugin and the same behavior could be happened
- External: When etcd deployments is not fast enough, consequently more errors will be found/returned by kube-apiserver 

Issues that we could understand and isolate:
- Both messages are connected: KubernetesLauncher impacting in WatchConnectionManager?
- Is the P&F gate could impact on sync-plugin operation/self-recover when being throttled?
- Is that errors that could drive to Jenkins "freeze" and need to be restarted manually?
- There is any health check that can detect that kind of situation on Jenkins?

---

[1] WatchConnectionManager

This error is repeating frequently and, based in some user cases it seems to crash the Jenkins needing a manual restart after Kubernetes APF (OpenShift P&F) was introduced in 4.5.15. The stack trace [1A] is repeating too many times[1B] in a short period of time. If it is connection retries when retrieving objects from Kubernetes API, we need to understand if we are having some issues with traffic control by P&F.

[1A] stack trace  "WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure"
~~~

2021-06-04T20:53:27.424569522Z 2021-06-04 20:53:27.424+0000 [id=64143]  WARNING i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure
2021-06-04T20:53:27.424569522Z java.util.concurrent.RejectedExecutionException: Task okhttp3.RealCall$AsyncCall@7aa11d55 rejected from java.util.concurrent.ThreadPoolExecutor@2b5433f8[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12799]
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:182)
2021-06-04T20:53:27.424569522Z Caused: java.io.InterruptedIOException: executor rejected
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall$AsyncCall.executeOn(RealCall.java:185)
2021-06-04T20:53:27.424569522Z  at okhttp3.Dispatcher.promoteAndExecute(Dispatcher.java:186)
2021-06-04T20:53:27.424569522Z  at okhttp3.Dispatcher.enqueue(Dispatcher.java:137)
2021-06-04T20:53:27.424569522Z  at okhttp3.RealCall.enqueue(RealCall.java:126)
2021-06-04T20:53:27.424569522Z  at okhttp3.internal.ws.RealWebSocket.connect(RealWebSocket.java:193)
2021-06-04T20:53:27.424569522Z  at okhttp3.OkHttpClient.newWebSocket(OkHttpClient.java:435)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.runWatch(WatchConnectionManager.java:153)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$1200(WatchConnectionManager.java:48)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2$1.execute(WatchConnectionManager.java:316)
2021-06-04T20:53:27.424569522Z  at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$NamedRunnable.run(WatchConnectionManager.java:405)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2021-06-04T20:53:27.424569522Z  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2021-06-04T20:53:27.424569522Z  at java.base/java.lang.Thread.run(Thread.java:834)
~~~

[1B] error repating many times.
~~~
$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |wc -l 
5752

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |awk -F':' '{print$1":"$2}' |sort |uniq -c
    266 2021-06-04T20:51
   1061 2021-06-04T20:52
    897 2021-06-04T20:53
   1094 2021-06-04T20:54
   1106 2021-06-04T20:55
   1106 2021-06-04T20:56
    222 2021-06-04T20:57

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |head -n1
2021-06-04T20:51:51.422655310Z 2021-06-04 20:51:51.422+0000 [id=64143]	WARNING	i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure

$ omg logs jenkins-42-tf2rl |grep "i.f.k.c.d.i.WatchConnectionManager" |tail -n1
2021-06-04T20:57:16.022918463Z 2021-06-04 20:57:16.022+0000 [id=77825]	WARNING	i.f.k.c.d.i.WatchConnectionManager$1#onFailure: Exec Failure
~~~


[2] Jenkins server is crashing when provisioning many jenkins-base pods in a short period of time:
~~~
./inspect.local.1906246428634946458/namespaces/<custom_ns_cicd>/pods/jenkins-42-tf2rl/jenkins/jenkins/logs/current.log:132354:2021-06-04T20:56:59.980121155Z 2021-06-04 20:56:59.979+0000 [id=94871]	SEVERE	o.c.j.p.k.KubernetesLauncher#logLastLines: Error in provisioning; agent=KubernetesSlave name: jenkins-base-7n78j, template=PodTemplate{id='2cb93a09-c978-4d98-8ac6-993f27aca0fb'inheritFrom='', name='jenkins-base', namespace='', hostNetwork=false, instanceCap=25, idleMinutes=5, label='jenkins-base', nodeSelector='', nodeUsageMode=EXCLUSIVE, workspaceVolume=EmptyDirWorkspaceVolume [memory=false], volumes=[org.csanchez.jenkins.plugins.kubernetes.volumes.ConfigMapVolume@9aa5f96], containers=[ContainerTemplate{name='jnlp', image='image-registry.openshift-image-registry.svc:5000/openshift/ose-jenkins-agent-base', alwaysPullImage=true, workingDir='/tmp', command='', args='${computer.jnlpmac} ${computer.name}', ttyEnabled=true, resourceRequestCpu='', resourceRequestMemory='', resourceLimitCpu='', resourceLimitMemory='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}]}. Container jnlp. Logs: INFO: Connecting to 100.123.158.68:50000

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}'|wc -l 
67

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}' |awk -F':' '{print$1":"$2}' |sort |uniq -c
      4 2021-06-04T20:51
     16 2021-06-04T20:52
      9 2021-06-04T20:53
     12 2021-06-04T20:54
     12 2021-06-04T20:55
     14 2021-06-04T20:56

$ omg logs jenkins-42-tf2rl |grep SEVERE |grep 'name: ' |awk '{print$1" "$12}'
2021-06-04T20:51:39.065580287Z jenkins-base-5zdfn,
2021-06-04T20:51:39.096242418Z jenkins-base-118gj,
2021-06-04T20:51:39.298225591Z jenkins-base-5lpnp,
2021-06-04T20:51:40.074201372Z jenkins-base-l3scz,
2021-06-04T20:52:01.436538575Z jenkins-base-tzc6b,
2021-06-04T20:52:01.448963948Z jenkins-base-mw59x,
2021-06-04T20:52:01.450036443Z jenkins-base-2gbql,
2021-06-04T20:52:02.434993030Z jenkins-base-7zmw2,
[...]
2021-06-04T20:56:58.917758049Z jenkins-base-v323t,
2021-06-04T20:56:59.898721127Z jenkins-base-htgcn,
2021-06-04T20:56:59.927598141Z jenkins-base-bshzh,
2021-06-04T20:56:59.980121155Z jenkins-base-7n78j,
~~~


# Version-Release number of selected component (if applicable):

# How reproducible: TBD (maybe creating too many jobs, and restricting sync-plugin with P&F rule)

# Steps to Reproduce: TBD


# Actual results:

The Jenkins server randomly crashing.

# Expected results:

Jenkins server become more reliable - maybe when communicating with kube-apiserver.

# Additional info:

- I already tried to create a internal P&F rule but not works
- Customers said that the unique solution is to restart the Jenkins

Comment 9 Adam Kaplan 2021-06-30 20:26:30 UTC

Consolidating this issue into Bug #1925524, since this is ultimately due to the sync plugin opening too many watches on the kube/OpenShift apiservers.
The next version of the openshift-sync-plugin (1.0.47) will include performance enhancements that reduce the plugin's load on the apiservers.

*** This bug has been marked as a duplicate of bug 1925524 ***

Comment 10 Red Hat Bugzilla 2023-09-15 01:09:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days