Every centos6-regression run since 3:15pm Jenkins time yesterday has been failing in the same way, across multiple slaves. Latest example: https://build.gluster.org/job/centos6-regression/3656/console 04:47:38 Triggered by Gerrit: https://review.gluster.org/16903 04:47:38 Construction à distance sur slave1.cloud.gluster.org (rackspace_regression_2gb) in workspace /home/jenkins/root/workspace/centos6-regression 04:47:39 > git rev-parse --is-inside-work-tree # timeout=10 04:47:39 Fetching changes from the remote Git repository 04:47:39 > git config remote.origin.url git://review.gluster.org/glusterfs.git # timeout=10 04:47:39 ERROR: Error fetching remote repo 'origin' 04:47:39 hudson.plugins.git.GitException: Failed to fetch from git://review.gluster.org/glusterfs.git 04:47:39 at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:806) 04:47:39 at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1066) 04:47:39 at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1097) 04:47:39 at hudson.scm.SCM.checkout(SCM.java:485) 04:47:39 at hudson.model.AbstractProject.checkout(AbstractProject.java:1269) 04:47:39 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:607) 04:47:39 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86) 04:47:39 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:529) 04:47:39 at hudson.model.Run.execute(Run.java:1738) 04:47:39 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 04:47:39 at hudson.model.ResourceController.execute(ResourceController.java:98) 04:47:39 at hudson.model.Executor.run(Executor.java:410) 04:47:39 Caused by: hudson.plugins.git.GitException: Command "git config remote.origin.url git://review.gluster.org/glusterfs.git" returned status code 4: 04:47:39 stdout: 04:47:39 stderr: error: failed to write new configuration file .git/config.lock 04:47:39 04:47:39 at org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1784) It's amusing that some of the messages (especially for aborted tasks) are in French, but I don't think that's the real issue here. There seems to be some kind of permission problem affecting all machines.
Oops, I'm not sure what went wrong here. There was a long-running prove command. I killed that and rebooted the machine. Hopefully, that'll get it back and running. I've now done a retrigger that's been assigned to other machines.
I was having problems with jobs for 16905 hanging, but it seems like a bit of an unlikely coincidence that those would exactly account for the multiple machines observing this issue. Also, it's still not clear how that cause would lead to that effect). Until we figure out how they're connected, should we perhaps be more liberal about rebooting machines automatically after failed/aborted runs?
The two machines I had to involve myself in both ran out of space. It had a 15G big glusterd log file. Is there a new test that's creating a lot of volumes or generating a lot of log entries?
Turns out it was due to Jeff's patch. This should be fixed now.