I reinstalled our koji buildvm builders recently with Fedora 21.
For most builds they seem to work just fine, but after a while random builds get stuck on them and do not complete.
├─kojid,447 /usr/sbin/kojid --force-lock --verbose
│ ├─kojid,3555 /usr/sbin/kojid --force-lock --verbose
│ │ └─mock,3811 -tt /usr/sbin/mock -r koji/f22-build-2775703-442858 --no-clean ...
│ │ └─yum-builddep,3861 -tt /usr/bin/yum-builddep --installroot...
│ │ ├─urlgrabber-ext-,3864 /usr/libexec/urlgrabber-ext-down
│ │ └─urlgrabber-ext-,3867 /usr/libexec/urlgrabber-ext-down
root 3864 0.0 0.1 319144 15324 ? S 17:09 0:01 /usr/bin/python /usr/libexec/urlgrabber-ext-down
root 3867 99.8 0.1 319148 15276 ? R 17:09 150:32 /usr/bin/python /usr/libexec/urlgrabber-ext-down
# ls -l /proc/3867/fd
lr-x------. 1 root root 64 Jan 5 19:20 0 -> pipe:
l-wx------. 1 root root 64 Jan 5 19:20 1 -> pipe:
l-wx------. 1 root root 64 Jan 5 19:20 2 -> pipe:
lrwx------. 1 root root 64 Jan 5 19:20 3 -> socket:
l-wx------. 1 root root 64 Jan 5 19:20 4 -> /var/lib/mock/f22-build-2775703-442858/root/var/log/yum.log
l-wx------. 1 root root 64 Jan 5 19:20 5 -> /var/lib/mock/f22-build-2775703-442858/root/var/cache/yum/build/packages/opus-1.1.1-0.1.beta.fc22.i686.rpm
lrwx------. 1 root root 64 Jan 5 19:20 6 -> socket:
lrwx------. 1 root root 64 Jan 5 19:20 7 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Jan 5 19:20 8 -> socket:
#0 0x00007fff963fec29 in clock_gettime ()
#1 0x00007fe9ebd919fd in clock_gettime () from /lib64/libc.so.6
#2 0x00007fe9e0ba1f83 in curlx_tvnow () from /lib64/libcurl.so.4
#3 0x00007fe9e0ba34a2 in Curl_pgrsUpdate () from /lib64/libcurl.so.4
#4 0x00007fe9e0bcafce in multi_runsingle () from /lib64/libcurl.so.4
#5 0x00007fe9e0bcbd71 in curl_multi_perform () from /lib64/libcurl.so.4
#6 0x00007fe9e0bc2d0c in curl_easy_perform () from /lib64/libcurl.so.4
#7 0x00007fe9e0e0fef3 in do_curl_perform () from /usr/lib64/python2.7/site-packages/pycurl.so
#8 0x00007fe9eca4d81c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#12 0x00007fe9eca4e400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#13 0x00007fe9ec9da5dc in function_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fe9ec9c4815 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#16 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#17 0x00007fe9eca0c837 in slot_tp_init () from /lib64/libpython2.7.so.1.0
#18 0x00007fe9eca0b56f in type_call () from /lib64/libpython2.7.so.1.0
#19 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#20 0x00007fe9eca4b585 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#21 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#22 0x00007fe9eca4e400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#23 0x00007fe9eca4e4f9 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#24 0x00007fe9eca6790f in run_mod () from /lib64/libpython2.7.so.1.0
#25 0x00007fe9eca68b32 in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#26 0x00007fe9eca69d47 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#27 0x00007fe9eca7b2ea in Py_Main () from /lib64/libpython2.7.so.1.0
#28 0x00007fe9ebca2fe0 in __libc_start_main () from /lib64/libc.so.6
#29 0x000000000040071e in _start ()
strace/ltrace just hangs/doesn't output anything.
Happy to try and gather more info...
Thanks for the report!
Did it never happen before the update?
What was the version of libcurl before the update?
I failed to reproduce it locally.
Which mirror is actually used to download the packages from?
Before they were running Fedora 20, so:
Packages are downloaded from a local squid proxy, kojipkgs.fedoraproject.org
I do see a bunch of stuff in recv queue:
tcp ESTAB 2364240 0 10.5.125.105:41955 10.5.125.36:80
# fuser -n tcp 41955
# ps -axuww | grep 16175
root 16175 99.9 0.1 319148 15284 ? R Jan05 1130:15 /usr/bin/python /usr/libexec/urlgrabber-ext-down
It seems to work fine here with an squid proxy running on localhost.
Could you please check it with libcurl-7.32.0-17.fc20?
That version also includes a patch for bug #1166239 that already caused some issues elsewhere (bug #1172572).
Does it consume CPU or hangs passively when it happens?
(In reply to Kamil Dudka from comment #4)
> It seems to work fine here with an squid proxy running on localhost.
> Could you please check it with libcurl-7.32.0-17.fc20?
You want me to try that f20 package on an f21 host?
Note that the problem hosts here are f21. They worked fine _before_ as f20.
> That version also includes a patch for bug #1166239 that already caused some
> issues elsewhere (bug #1172572).
> Does it consume CPU or hangs passively when it happens?
It consumes 100% cpu.
(In reply to Kevin Fenzi from comment #5)
> You want me to try that f20 package on an f21 host?
Trying it on a f20 host should also be fine. You choose.
> Note that the problem hosts here are f21. They worked fine _before_ as f20.
You referred to libcurl-7.32.0-15.fc20 as the working version. However, it is actually libcurl-7.32.0-17.fc20 what is now tagged f20-updates and chances are it suffers from the same bug. That is why I am asking about f20.
Per our conversation on irc I tried curl-7.37.0-9.fc21.x86_64 on the f21 hosts.
The problem still persists. ;(
root 708 99.8 0.1 319412 15116 ? R 17:32 13:28 /usr/bin/python /usr/libexec/urlgrabber-ext-down
# ls -l /proc/708/fd/
lr-x------. 1 root root 64 Jan 6 17:43 0 -> pipe:
l-wx------. 1 root root 64 Jan 6 17:43 1 -> pipe:
l-wx------. 1 root root 64 Jan 6 17:43 2 -> pipe:
lrwx------. 1 root root 64 Jan 6 17:43 3 -> socket:
l-wx------. 1 root root 64 Jan 6 17:43 4 -> /var/lib/mock/f22-build-2780382-443173/root/var/log/yum.log
lrwx------. 1 root root 64 Jan 6 17:43 5 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Jan 6 17:43 6 -> socket:
l-wx------. 1 root root 64 Jan 6 17:43 7 -> /var/lib/mock/f22-build-2780382-443173/root/var/cache/yum/build/931b19ae55c88d785cb04e67f1c9ca417d1cc61271c078df8c6257824b64eadb-primary.sqlite.bz2
lrwx------. 1 root root 64 Jan 6 17:43 8 -> socket:
# ss -an | grep 80
tcp ESTAB 242708 0 10.5.125.150:55049 10.5.125.36:80
-rw-rw-r--. 1 root mock 10133504 Jan 6 17:32 /var/lib/mock/f22-build-2780382-443173/root/var/cache/yum/build/931b19ae55c88d785cb04e67f1c9ca417d1cc61271c078df8c6257824b64eadb-primary.sqlite.bz2
Thanks for the additional info!
Could you please install libcurl-7.37.0-11.fc21 on a f20 builder and check whether it triggers the bug?
We need to confirm that this bug is really caused by a change in curl...
ok. I upgraded our buildhw boxes (f20) to the f21 curl... and so far, no bug. ;(
It could just take longer on hw... I could try reinstalling a few vm's with f20 and f21 curl and see if it happens there.
Not sure where else to look for the cause here. python-urlgrabber, python-curl, yum are all not very different. glibc and python I guess could be... mock is, but the spin/hang/bug doesn't seem to be in mock anywhere.
Thank you for checking it!
It is possible that both the versions of libcurl suffer from the same bug and something else just triggers it on f21 but not on f20. Could you please install curl-7.40.0-1.fc22 on a problematic f21 host and check if the problem persists?
ok. I have updated curl/libcurl on buildvm-10 to buildvm-27. ;)
Will let you know in a bit if the bug appears...
And it popped up nice and quick. ;(
# ps -axuww | grep url
root 719 99.8 0.1 319168 15192 ? R 17:50 11:09 /usr/bin/python /usr/libexec/urlgrabber-ext-down
root 720 0.1 0.1 319164 15420 ? S 17:50 0:01 /usr/bin/python /usr/libexec/urlgrabber-ext-down
I have finally managed to reproduce the bug locally:
1. create a kvm-based x86_64 vm
2. install fresh f21 server on it
3. install and start a local squid proxy
4. run 'mock --installdeps ...' against a non-local yum mirror (over the proxy)
Then I attached gdb to the hanging process and discovered that it actually loops _inside_ clock_gettime(). This could hardly be a bug of libcurl.
102│ #ifndef HANDLED_REALTIME
103│ case CLOCK_REALTIME:
105│ struct timeval tv;
106│ retval = gettimeofday (&tv, NULL);
107│ if (retval == 0)
108│ TIMEVAL_TO_TIMESPEC (&tv, tp);
114│ #ifdef SYSDEP_GETTIME_CPU
115├> SYSDEP_GETTIME_CPU (clock_id, tp);
117│ #if HP_TIMING_AVAIL
118│ if ((clock_id & ((1 << CLOCK_IDFIELD_SIZE) - 1))
119│ == CLOCK_THREAD_CPUTIME_ID)
120│ retval = hp_timing_gettime (clock_id, tp);
123│ __set_errno (EINVAL);
126│ #if HP_TIMING_AVAIL && !defined HANDLED_CPUTIME
127│ case CLOCK_PROCESS_CPUTIME_ID:
128│ retval = hp_timing_gettime (clock_id, tp);
I suspect this could be a duplicate of bug #1174664.
A similar bug was also reported against Ubuntu:
Created attachment 978181 [details]
endless loop in the VDSO implementation of clock_gettime()
It loops in userspace. The actual loop over the glibc instructions is attached. I guess it could be caused by integral overflow? Or just by invalid data from the virtualization layer?
$ rpm -qa glibc\* | sort -V
$ rpm -q kernel
The hosting HW is running the same NVR of kernel as the VM.
(In reply to Kamil Dudka from comment #14)
> Created attachment 978181 [details]
> tight loop in glibc code
Actually, it does not loop in libc.so.6, but in the VDSO area, which starts at 0x7fff67ffe000 in the hanging process. I am switching the component to kernel.
Here the host is a rhel7 machine: 3.10.0-123.13.2.el7.x86_64
I am trying a 3.18.1 kernel (from rawhide) now on the f21 builders to see if it changes anything.
Seemed to take a bit longer, but the bug persists on 3.18.1-2.fc22.x86_64.
Another thing to note: The f20 builders which don't show this issue were using the almost same kernel:
f20 builders, no bug: 3.17.7-200.fc20.x86_64
f21 builders, yes bug: 3.17.7-300.fc21.x86_64
So I wonder if glibc might be more likely? Or I suppose it could be a host issue shared by the rhel7 and f21 kernels?
Adding Andy on CC since he's now the upstream VDSO maintainer. He's nice, and maybe he'll have an idea.
Interesting. I'm currently working on rewriting this code entirely, and I have a hunch. How many CPUs does the guest have? Can you post the output of:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
in the guest?
Can you try applying this patch:
and re-testing? If that patch doesn't apply, grep for __getcpu in arch/x86/include/asm and add the missing volatile. I hope that this patch will fix the bug for you, but, if it doesn't, then it will at least make what's going on much clearer.
I took a look at the disassembly you provided (thanks!), and I'm reasonably confident that my guess was right. If you manage to complete an iteration of the vread_pvclock loop with cpu != cpu1, then you'll never escape the loop, because the compiler will keep using cached cpu values on each iteration. I know this because "lsl" doesn't appear anywhere in the disassembly, so the __getcpu isn't happening, despite appearances in the C code.
Josh, I'm confident enough that this is a real problem and is probably the cause of this particular hang that it might be worth kicking off a scratch kernel build or something to make this easier to test.
(The reason I never looked too closely at the implications of the missing volatile was that it seems to be highly dependent on compiler details, phase of the moon, seemingly irrelevant nearby code, etc.)
(In reply to Andy Lutomirski from comment #22)
> I took a look at the disassembly you provided (thanks!), and I'm reasonably
> confident that my guess was right. If you manage to complete an iteration
> of the vread_pvclock loop with cpu != cpu1, then you'll never escape the
> loop, because the compiler will keep using cached cpu values on each
> iteration. I know this because "lsl" doesn't appear anywhere in the
> disassembly, so the __getcpu isn't happening, despite appearances in the C
> Josh, I'm confident enough that this is a real problem and is probably the
> cause of this particular hang that it might be worth kicking off a scratch
> kernel build or something to make this easier to test.
The function is in vsyscall.h in 3.17, so a small trivial backport of the patch.
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Guests have 4 cpus.
Will try that kernel once it's done building. Thanks!
ok. About an hour so far on that kernel and 0 cases of the bug... looking pretty good.
(In reply to Kevin Fenzi from comment #25)
> ok. About an hour so far on that kernel and 0 cases of the bug... looking
> pretty good.
Excellent. I'll try to get the patch included in Fedora git later this evening.
Fixed in git.
Test program posted at https://firstname.lastname@example.org/msg273843.html works well now for more than 1 hour.
Seems to be fixed for me. Can you please release a new official kernel version to the update repo ASAP. Thnx.
For the records the test program:
int main(int argc, char* argv)
struct timeval tv;
int i = 0;
if (i >= 10000000)
i = 0;
BTW: What was the main source/trigger of the bug? gcc differences between F20 and F21?
I'm asking as already discussed because same kernel version worked on F20.
I guess the answer is: Bug is in kernel but was not triggered with old F20 gcc and was triggered by F21 newer gcc version.
OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64 on host
NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64 on host
NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, qemu-kvm-2.2.0-1.fc21.x86_64 on host
kernel-3.18.2-200.fc21 has been submitted as an update for Fedora 21.
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.18.2-200.fc21'
as soon as you are able to, then reboot.
Please go to the following url:
then log in and leave karma (feedback).
kernel-3.18.3-201.fc21 has been submitted as an update for Fedora 21.
kernel-3.18.3-201.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report.