Bug 1619838

Summary:	Jenkins connection issues failing tests
Product:	[Community] GlusterFS	Reporter:	Yaniv Kaul <ykaul>
Component:	project-infrastructure	Assignee:	Nigel Babu <nigelb>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	bugs, gluster-infra, mscherer, nigelb
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-03 04:13:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yaniv Kaul 2018-08-21 21:17:28 UTC

Description of problem:
See https://build.gluster.org/job/smoke/43221/console for example:
23:39:34 Triggered by Gerrit: https://review.gluster.org/20776
23:39:35 Building remotely on builder24.int.rht.gluster.org (smoke7 rpm7) in workspace /home/jenkins/root/workspace/smoke
23:39:35 Wiping out workspace first.
23:39:35 Cloning the remote Git repository
23:39:35 Cloning repository git://review.gluster.org/glusterfs.git
23:39:35  > git init /home/jenkins/root/workspace/smoke # timeout=10
23:39:35 Fetching upstream changes from git://review.gluster.org/glusterfs.git
23:39:35  > git --version # timeout=10
23:39:35  > git fetch --tags --progress git://review.gluster.org/glusterfs.git +refs/heads/*:refs/remotes/origin/*
23:39:35 ERROR: Error cloning remote repo 'origin'
23:39:35 hudson.plugins.git.GitException: Command "git fetch --tags --progress git://review.gluster.org/glusterfs.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
23:39:35 stdout: 
23:39:35 stderr: fatal: read error: Connection reset by peer
23:39:35 
23:39:35 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2016)
23:39:35 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1735)
23:39:35 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:72)
23:39:35 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:420)
23:39:35 	at org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:629)
23:39:35 	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
23:39:35 	at org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
23:39:35 	at hudson.remoting.UserRequest.perform(UserRequest.java:212)
23:39:35 	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
23:39:35 	at hudson.remoting.Request$2.run(Request.java:369)
23:39:35 	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
23:39:35 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
23:39:35 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
23:39:35 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
23:39:35 	at java.lang.Thread.run(Thread.java:748)
23:39:35 	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to builder24.int.rht.gluster.org

Comment 1 Nigel Babu 2018-08-22 04:55:30 UTC

The bit that's failing here is the git clone, so we need to check the following bits:

* Did the clone attempt get logged by git daemon.
* Did the clone attempt get interrupted by a network event between the node and the Gerrit server.
* Did the clone attempt get interrupted by something on the Gerrit server itself.

I'm taking this bug to look at the Gerrit-side of things, but if it's not Gerrit, I'm going to bounce this one over to you, Michael.

Comment 2 Yaniv Kaul 2018-08-22 05:58:43 UTC

(In reply to Nigel Babu from comment #1)
> The bit that's failing here is the git clone, so we need to check the
> following bits:
> 
> * Did the clone attempt get logged by git daemon.

Can we take the opportunity and look at shallow clone?

Comment 3 Yaniv Kaul 2018-08-22 06:05:11 UTC

BTW, it doesn't happen on a single host. For example:
https://build.gluster.org/job/smoke/43219/console 

23:39:01 	Suppressed: hudson.remoting.Channel$CallSiteStackTrace: Remote call to builder23.int.rht.gluster.org
23:39:01 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1741)
23:39:01 		at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:357)

Comment 4 M. Scherer 2018-08-22 08:33:56 UTC

So not wanting to say network is perfect, but this started since the upgrade to new gerrit, no ? Could it be some issue where gerrit drop if there is too much client or something like this ?

Comment 5 M. Scherer 2018-08-22 08:36:49 UTC

[root@gerrit-new logs]# grep 'reset by peer' error_log |wc -l
18


Seems it happen quite often :/

Comment 6 M. Scherer 2018-08-22 08:39:05 UTC

And it happen since a long time too, so unlikely to be the upgrade, if the error in the log is the same as the one reported.

Comment 7 Yaniv Kaul 2018-08-22 09:09:29 UTC

(In reply to M. Scherer from comment #4)
> So not wanting to say network is perfect, but this started since the upgrade
> to new gerrit, no ? Could it be some issue where gerrit drop if there is too
> much client or something like this ?

I wonder if it happens when I 'flood' it with multiple patches (all in the same topic).

Comment 8 M. Scherer 2018-08-22 09:11:35 UTC

yeah, I wanted to explore that road too by doing *cough* load testing of the git server on the staging env, but no git port (or I am not awake enough).

Comment 9 Yaniv Kaul 2018-08-22 09:19:08 UTC

(In reply to Yaniv Kaul from comment #7)
> (In reply to M. Scherer from comment #4)
> > So not wanting to say network is perfect, but this started since the upgrade
> > to new gerrit, no ? Could it be some issue where gerrit drop if there is too
> > much client or something like this ?
> 
> I wonder if it happens when I 'flood' it with multiple patches (all in the
> same topic).

Can we look at Gerrit logs?

Comment 10 M. Scherer 2018-08-22 09:23:26 UTC

I did and didn't found anything that did seemed relevant. I may have missed something however, and Nigel is also looking. That's a transient issue, so not easy to diagnose.

Comment 11 M. Scherer 2018-08-22 10:37:00 UTC

So, Nigel pointed out that git is done by xinetd, and the log show nothing except some ipv6 errors. While it might be related, i think it is not, especially since taht's only for the rackspace builder, not the internal one.

Comment 12 M. Scherer 2018-08-22 10:45:20 UTC

Mhhh:

août 21 20:39:36 gerrit-new.rht.gluster.org xinetd[16437]: FAIL: git per_source_limit from=::ffff:8.43.85.181

Suspect that it might be the cause. 8.43.85.181 is the firewall ip.

Several solution:
- have a way to use internal IP
- add a exception for that IP.

The 2nd is easier, the 1st is cleaner. I will start by the 2nd.

Comment 13 M. Scherer 2018-08-22 10:55:22 UTC

So I pushed a fix, should deploy (unless the wifi break in my train). 

So the good news is that we can claim that we did made so more productivity progress that we did hit the limit, so that's positive :p

Comment 14 Nigel Babu 2018-10-03 04:13:01 UTC

The deploy seems to have fixed it. Closing this bug.