Bug 1498390 - Centos regressions fail on slave20
Summary: Centos regressions fail on slave20
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: project-infrastructure
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-04 08:12 UTC by Nithya Balachandran
Modified: 2017-12-06 08:46 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-06 08:46:24 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
Centos regression failure. (2.67 KB, text/plain)
2017-10-04 08:12 UTC, Nithya Balachandran
no flags Details

Description Nithya Balachandran 2017-10-04 08:12:13 UTC
Description of problem:

A failed run:
https://build.gluster.org/job/centos6-regression/6678/


Error reported:

Triggered by Gerrit: https://review.gluster.org/17851
ERROR: Issue with creating launcher for agent slave20.cloud.gluster.org. The agent is being disconnected


Complete exception log in the attachment.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Nithya Balachandran 2017-10-04 08:12:42 UTC
Created attachment 1334089 [details]
Centos regression failure.

Comment 2 M. Scherer 2017-10-04 08:27:38 UTC
slave27 is also affected, I am rebooting it, after taking slave20 out of rotation and inspecting the others.

Comment 3 M. Scherer 2017-10-04 13:14:35 UTC
Might be relatex to https://review.gluster.org/#/c/17789/ since this patch did ran on slave20 and slave27, who are broken and slave25, that I try to investigate right now.

Comment 4 M. Scherer 2017-10-04 13:30:19 UTC
So I reenabled slave27, I rebooted slave25. I guess I will try to dig a bit more on slave20 to dig what is the error, but likely reboot once I figure I have no idea to debug what is going with it.

Comment 5 M. Scherer 2017-10-04 14:01:23 UTC
The process on slave20 seems to have been started by https://review.gluster.org/18271 

# (for i in $(ps fax |grep gluster |awk '{ print $1}' ); do cat /proc/$i/environ |sed 's/=/\n/g' |grep -a -A 1 GERRIT_CHANGE_URL |sed 's/SUDO_COMMAND//' ; done;) |grep -a http | sort -u

https://review.gluster.org/18271


I am gonna reboot and keep a eye on this one. If any other builder fail, please ping me on irc.

Comment 6 M. Scherer 2017-10-05 10:57:07 UTC
So slave27 issue was different. I did had to wipe the /home/jenkins/root to make it work again for some reason. I looked on the rpm content, no issue, no issue with selinux, no disk full. I am a bit puzzled.

Comment 7 M. Scherer 2017-10-06 07:51:38 UTC
So today, that's slave22 with a ton of process broken after running regressions for https://review.gluster.org/#/c/18271/

Comment 8 M. Scherer 2017-10-09 14:11:57 UTC
For the record, I did reboot slave22, but forgot to write it down, sorry about that.

Comment 9 M. Scherer 2017-10-09 14:17:46 UTC
Slave22 wasn't fully recovered, i suspect we have a 2nd issue on our hands. I terminated all java process on that server, and did restart the agent. 

The log say nothing useful from where I look.

Comment 10 Nigel Babu 2017-12-06 08:46:24 UTC
This issue need a full node restarted and a disconnect/reconnect. We don't see this problem anymore.


Note You need to log in before you can comment on or make changes to this bug.