Description of problem: A failed run: https://build.gluster.org/job/centos6-regression/6678/ Error reported: Triggered by Gerrit: https://review.gluster.org/17851 ERROR: Issue with creating launcher for agent slave20.cloud.gluster.org. The agent is being disconnected Complete exception log in the attachment. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1334089 [details] Centos regression failure.
slave27 is also affected, I am rebooting it, after taking slave20 out of rotation and inspecting the others.
Might be relatex to https://review.gluster.org/#/c/17789/ since this patch did ran on slave20 and slave27, who are broken and slave25, that I try to investigate right now.
So I reenabled slave27, I rebooted slave25. I guess I will try to dig a bit more on slave20 to dig what is the error, but likely reboot once I figure I have no idea to debug what is going with it.
The process on slave20 seems to have been started by https://review.gluster.org/18271 # (for i in $(ps fax |grep gluster |awk '{ print $1}' ); do cat /proc/$i/environ |sed 's/=/\n/g' |grep -a -A 1 GERRIT_CHANGE_URL |sed 's/SUDO_COMMAND//' ; done;) |grep -a http | sort -u https://review.gluster.org/18271 I am gonna reboot and keep a eye on this one. If any other builder fail, please ping me on irc.
So slave27 issue was different. I did had to wipe the /home/jenkins/root to make it work again for some reason. I looked on the rpm content, no issue, no issue with selinux, no disk full. I am a bit puzzled.
So today, that's slave22 with a ton of process broken after running regressions for https://review.gluster.org/#/c/18271/
For the record, I did reboot slave22, but forgot to write it down, sorry about that.
Slave22 wasn't fully recovered, i suspect we have a 2nd issue on our hands. I terminated all java process on that server, and did restart the agent. The log say nothing useful from where I look.
This issue need a full node restarted and a disconnect/reconnect. We don't see this problem anymore.