Bug 1178975 - endless loop in clock_gettime() on a kvm-based VM
Summary: endless loop in clock_gettime() on a kvm-based VM
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-01-05 19:42 UTC by Kevin Fenzi
Modified: 2015-01-26 02:30 UTC (History)
11 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2015-01-26 02:30:55 UTC


Attachments (Terms of Use)
endless loop in the VDSO implementation of clock_gettime() (3.05 KB, text/plain)
2015-01-09 14:32 UTC, Kamil Dudka
no flags Details

Description Kevin Fenzi 2015-01-05 19:42:58 UTC
I reinstalled our koji buildvm builders recently with Fedora 21. 

For most builds they seem to work just fine, but after a while random builds get stuck on them and do not complete. 

  ├─kojid,447 /usr/sbin/kojid --force-lock --verbose
  │   ├─kojid,3555 /usr/sbin/kojid --force-lock --verbose
  │   │   └─mock,3811 -tt /usr/sbin/mock -r koji/f22-build-2775703-442858 --no-clean ...
  │   │       └─yum-builddep,3861 -tt /usr/bin/yum-builddep --installroot...
  │   │           ├─urlgrabber-ext-,3864 /usr/libexec/urlgrabber-ext-down
  │   │           └─urlgrabber-ext-,3867 /usr/libexec/urlgrabber-ext-down

root      3864  0.0  0.1 319144 15324 ?        S    17:09   0:01 /usr/bin/python /usr/libexec/urlgrabber-ext-down
root      3867 99.8  0.1 319148 15276 ?        R    17:09 150:32 /usr/bin/python /usr/libexec/urlgrabber-ext-down

# ls -l /proc/3867/fd
total 0
lr-x------. 1 root root 64 Jan  5 19:20 0 -> pipe:[27684]
l-wx------. 1 root root 64 Jan  5 19:20 1 -> pipe:[27685]
l-wx------. 1 root root 64 Jan  5 19:20 2 -> pipe:[27670]
lrwx------. 1 root root 64 Jan  5 19:20 3 -> socket:[26080]
l-wx------. 1 root root 64 Jan  5 19:20 4 -> /var/lib/mock/f22-build-2775703-442858/root/var/log/yum.log
l-wx------. 1 root root 64 Jan  5 19:20 5 -> /var/lib/mock/f22-build-2775703-442858/root/var/cache/yum/build/packages/opus-1.1.1-0.1.beta.fc22.i686.rpm
lrwx------. 1 root root 64 Jan  5 19:20 6 -> socket:[26082]
lrwx------. 1 root root 64 Jan  5 19:20 7 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Jan  5 19:20 8 -> socket:[28719]

(gdb) where
#0  0x00007fff963fec29 in clock_gettime ()
#1  0x00007fe9ebd919fd in clock_gettime () from /lib64/libc.so.6
#2  0x00007fe9e0ba1f83 in curlx_tvnow () from /lib64/libcurl.so.4
#3  0x00007fe9e0ba34a2 in Curl_pgrsUpdate () from /lib64/libcurl.so.4
#4  0x00007fe9e0bcafce in multi_runsingle () from /lib64/libcurl.so.4
#5  0x00007fe9e0bcbd71 in curl_multi_perform () from /lib64/libcurl.so.4
#6  0x00007fe9e0bc2d0c in curl_easy_perform () from /lib64/libcurl.so.4
#7  0x00007fe9e0e0fef3 in do_curl_perform () from /usr/lib64/python2.7/site-packages/pycurl.so
#8  0x00007fe9eca4d81c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#9  0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#10 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#11 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#12 0x00007fe9eca4e400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#13 0x00007fe9ec9da5dc in function_call () from /lib64/libpython2.7.so.1.0
#14 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#15 0x00007fe9ec9c4815 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#16 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#17 0x00007fe9eca0c837 in slot_tp_init () from /lib64/libpython2.7.so.1.0
#18 0x00007fe9eca0b56f in type_call () from /lib64/libpython2.7.so.1.0
#19 0x00007fe9ec9b5903 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#20 0x00007fe9eca4b585 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#21 0x00007fe9eca4cbe6 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#22 0x00007fe9eca4e400 in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#23 0x00007fe9eca4e4f9 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#24 0x00007fe9eca6790f in run_mod () from /lib64/libpython2.7.so.1.0
#25 0x00007fe9eca68b32 in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#26 0x00007fe9eca69d47 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#27 0x00007fe9eca7b2ea in Py_Main () from /lib64/libpython2.7.so.1.0
#28 0x00007fe9ebca2fe0 in __libc_start_main () from /lib64/libc.so.6
#29 0x000000000040071e in _start ()

strace/ltrace just hangs/doesn't output anything.

mock-1.2.3-1.fc21.noarch
curl-7.37.0-11.fc21.x86_64
python-urlgrabber-3.10.1-2.fc21.noarch
yum-3.4.3-153.fc21.noarch

Happy to try and gather more info...

Comment 1 Kamil Dudka 2015-01-06 09:59:58 UTC
Thanks for the report!

Did it never happen before the update?

What was the version of libcurl before the update?

Comment 2 Kamil Dudka 2015-01-06 11:53:12 UTC
I failed to reproduce it locally.

Which mirror is actually used to download the packages from?

Comment 3 Kevin Fenzi 2015-01-06 15:43:27 UTC
Before they were running Fedora 20, so: 

curl-7.32.0-15.fc20.x86_64
libcurl-7.32.0-15.fc20.x86_64

Packages are downloaded from a local squid proxy, kojipkgs.fedoraproject.org

I do see a bunch of stuff in recv queue: 

tcp    ESTAB      2364240 0                                   10.5.125.105:41955                               10.5.125.36:80    

# fuser -n tcp 41955
41955/tcp:           16175
# ps -axuww | grep 16175
root     16175 99.9  0.1 319148 15284 ?        R    Jan05 1130:15 /usr/bin/python /usr/libexec/urlgrabber-ext-down

Comment 4 Kamil Dudka 2015-01-06 16:16:36 UTC
It seems to work fine here with an squid proxy running on localhost.

Could you please check it with libcurl-7.32.0-17.fc20?

That version also includes a patch for bug #1166239 that already caused some issues elsewhere (bug #1172572).

Does it consume CPU or hangs passively when it happens?

Comment 5 Kevin Fenzi 2015-01-06 16:51:40 UTC
(In reply to Kamil Dudka from comment #4)
> It seems to work fine here with an squid proxy running on localhost.
> 
> Could you please check it with libcurl-7.32.0-17.fc20?

You want me to try that f20 package on an f21 host? 

Note that the problem hosts here are f21. They worked fine _before_ as f20. 
> 
> That version also includes a patch for bug #1166239 that already caused some
> issues elsewhere (bug #1172572).
> 
> Does it consume CPU or hangs passively when it happens?

It consumes 100% cpu.

Comment 6 Kamil Dudka 2015-01-06 17:07:21 UTC
(In reply to Kevin Fenzi from comment #5)
> You want me to try that f20 package on an f21 host?

Trying it on a f20 host should also be fine.  You choose.

> Note that the problem hosts here are f21. They worked fine _before_ as f20.

You referred to libcurl-7.32.0-15.fc20 as the working version.  However, it is actually libcurl-7.32.0-17.fc20 what is now tagged f20-updates and chances are it suffers from the same bug.  That is why I am asking about f20.

Comment 7 Kevin Fenzi 2015-01-06 17:48:05 UTC
Per our conversation on irc I tried curl-7.37.0-9.fc21.x86_64 on the f21 hosts. 

The problem still persists. ;( 

root       708 99.8  0.1 319412 15116 ?        R    17:32  13:28 /usr/bin/python /usr/libexec/urlgrabber-ext-down

# ls -l /proc/708/fd/
total 0
lr-x------. 1 root root 64 Jan  6 17:43 0 -> pipe:[16023]
l-wx------. 1 root root 64 Jan  6 17:43 1 -> pipe:[16024]
l-wx------. 1 root root 64 Jan  6 17:43 2 -> pipe:[19263]
lrwx------. 1 root root 64 Jan  6 17:43 3 -> socket:[19265]
l-wx------. 1 root root 64 Jan  6 17:43 4 -> /var/lib/mock/f22-build-2780382-443173/root/var/log/yum.log
lrwx------. 1 root root 64 Jan  6 17:43 5 -> anon_inode:[eventpoll]
lrwx------. 1 root root 64 Jan  6 17:43 6 -> socket:[19267]
l-wx------. 1 root root 64 Jan  6 17:43 7 -> /var/lib/mock/f22-build-2780382-443173/root/var/cache/yum/build/931b19ae55c88d785cb04e67f1c9ca417d1cc61271c078df8c6257824b64eadb-primary.sqlite.bz2
lrwx------. 1 root root 64 Jan  6 17:43 8 -> socket:[19269]

# ss -an | grep 80
tcp    ESTAB      242708 0           10.5.125.150:55049       10.5.125.36:80    

-rw-rw-r--. 1 root mock 10133504 Jan  6 17:32 /var/lib/mock/f22-build-2780382-443173/root/var/cache/yum/build/931b19ae55c88d785cb04e67f1c9ca417d1cc61271c078df8c6257824b64eadb-primary.sqlite.bz2

Comment 8 Kamil Dudka 2015-01-07 16:03:56 UTC
Thanks for the additional info!

Could you please install libcurl-7.37.0-11.fc21 on a f20 builder and check whether it triggers the bug?

We need to confirm that this bug is really caused by a change in curl...

Comment 9 Kevin Fenzi 2015-01-07 20:47:55 UTC
ok. I upgraded our buildhw boxes (f20) to the f21 curl... and so far, no bug. ;( 

It could just take longer on hw... I could try reinstalling a few vm's with f20 and f21 curl and see if it happens there. 

Not sure where else to look for the cause here. python-urlgrabber, python-curl, yum are all not very different. glibc and python I guess could be... mock is, but the spin/hang/bug doesn't seem to be in mock anywhere. 

Suggestions welcome.

Comment 10 Kamil Dudka 2015-01-08 11:14:12 UTC
Thank you for checking it!

It is possible that both the versions of libcurl suffer from the same bug and something else just triggers it on f21 but not on f20.  Could you please install curl-7.40.0-1.fc22 on a problematic f21 host and check if the problem persists?

Comment 11 Kevin Fenzi 2015-01-08 17:45:44 UTC
ok. I have updated curl/libcurl on buildvm-10 to buildvm-27. ;) 

Will let you know in a bit if the bug appears...

Comment 12 Kevin Fenzi 2015-01-08 18:05:28 UTC
And it popped up nice and quick. ;( 

curl-7.40.0-1.fc22.x86_64
libcurl-7.40.0-1.fc22.x86_64

# ps -axuww | grep url
root       719 99.8  0.1 319168 15192 ?        R    17:50  11:09 /usr/bin/python /usr/libexec/urlgrabber-ext-down
root       720  0.1  0.1 319164 15420 ?        S    17:50   0:01 /usr/bin/python /usr/libexec/urlgrabber-ext-down

Comment 13 Kamil Dudka 2015-01-09 13:58:13 UTC
I have finally managed to reproduce the bug locally:

1. create a kvm-based x86_64 vm
2. install fresh f21 server on it
3. install and start a local squid proxy
4. run 'mock --installdeps ...' against a non-local yum mirror (over the proxy)

Then I attached gdb to the hanging process and discovered that it actually loops _inside_ clock_gettime().  This could hardly be a bug of libcurl.

/usr/src/debug/glibc-2.20/sysdeps/unix/clock_gettime.c
101│
102│ #ifndef HANDLED_REALTIME
103│     case CLOCK_REALTIME:
104│       {
105│         struct timeval tv;
106│         retval = gettimeofday (&tv, NULL);
107│         if (retval == 0)
108│           TIMEVAL_TO_TIMESPEC (&tv, tp);
109│       }
110│       break;
111│ #endif
112│
113│     default:
114│ #ifdef SYSDEP_GETTIME_CPU
115├>      SYSDEP_GETTIME_CPU (clock_id, tp);
116│ #endif
117│ #if HP_TIMING_AVAIL
118│       if ((clock_id & ((1 << CLOCK_IDFIELD_SIZE) - 1))
119│           == CLOCK_THREAD_CPUTIME_ID)
120│         retval = hp_timing_gettime (clock_id, tp);
121│       else
122│ #endif
123│         __set_errno (EINVAL);
124│       break;
125│
126│ #if HP_TIMING_AVAIL && !defined HANDLED_CPUTIME
127│     case CLOCK_PROCESS_CPUTIME_ID:
128│       retval = hp_timing_gettime (clock_id, tp);
129│       break;


I suspect this could be a duplicate of bug #1174664.

A similar bug was also reported against Ubuntu:

https://bugs.launchpad.net/ubuntu/+source/ubiquity/+bug/1020992

Comment 14 Kamil Dudka 2015-01-09 14:32:41 UTC
Created attachment 978181 [details]
endless loop in the VDSO implementation of clock_gettime()

It loops in userspace.  The actual loop over the glibc instructions is attached.  I guess it could be caused by integral overflow?  Or just by invalid data from the virtualization layer?

Comment 15 Kamil Dudka 2015-01-09 14:35:55 UTC
$ rpm -qa glibc\* | sort -V
glibc-2.20-5.fc21.x86_64
glibc-common-2.20-5.fc21.x86_64
glibc-debuginfo-2.20-5.fc21.x86_64
glibc-debuginfo-common-2.20-5.fc21.x86_64
glibc-devel-2.20-5.fc21.x86_64
glibc-headers-2.20-5.fc21.x86_64

$ rpm -q kernel
kernel-3.17.7-300.fc21.x86_64

The hosting HW is running the same NVR of kernel as the VM.

Comment 16 Kamil Dudka 2015-01-09 15:40:20 UTC
(In reply to Kamil Dudka from comment #14)
> Created attachment 978181 [details]
> tight loop in glibc code

Actually, it does not loop in libc.so.6, but in the VDSO area, which starts at 0x7fff67ffe000 in the hanging process.  I am switching the component to kernel.

Comment 17 Kevin Fenzi 2015-01-09 16:28:04 UTC
Adding datapoints: 

Here the host is a rhel7 machine: 3.10.0-123.13.2.el7.x86_64

I am trying a 3.18.1 kernel (from rawhide) now on the f21 builders to see if it changes anything.

Comment 18 Kevin Fenzi 2015-01-09 16:48:09 UTC
Seemed to take a bit longer, but the bug persists on 3.18.1-2.fc22.x86_64.

Comment 19 Kevin Fenzi 2015-01-09 17:40:11 UTC
Another thing to note: The f20 builders which don't show this issue were using the almost same kernel: 
f20 builders, no bug: 3.17.7-200.fc20.x86_64
f21 builders, yes bug: 3.17.7-300.fc21.x86_64

So I wonder if glibc might be more likely? Or I suppose it could be a host issue shared by the rhel7 and f21 kernels?

Comment 20 Josh Boyer 2015-01-09 18:24:45 UTC
Adding Andy on CC since he's now the upstream VDSO maintainer.  He's nice, and maybe he'll have an idea.

Comment 21 Andy Lutomirski 2015-01-09 19:08:06 UTC
Interesting.  I'm currently working on rewriting this code entirely, and I have a hunch.  How many CPUs does the guest have?  Can you post the output of:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource

in the guest?

Can you try applying this patch:

https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=1ddf0b1b11aa8a90cef6706e935fc31c75c406ba

and re-testing?  If that patch doesn't apply, grep for __getcpu in arch/x86/include/asm and add the missing volatile.  I hope that this patch will fix the bug for you, but, if it doesn't, then it will at least make what's going on much clearer.

Comment 22 Andy Lutomirski 2015-01-09 19:17:15 UTC
I took a look at the disassembly you provided (thanks!), and I'm reasonably confident that my guess was right.  If you manage to complete an iteration of the vread_pvclock loop with cpu != cpu1, then you'll never escape the loop, because the compiler will keep using cached cpu values on each iteration.  I know this because "lsl" doesn't appear anywhere in the disassembly, so the __getcpu isn't happening, despite appearances in the C code.

Josh, I'm confident enough that this is a real problem and is probably the cause of this particular hang that it might be worth kicking off a scratch kernel build or something to make this easier to test.

(The reason I never looked too closely at the implications of the missing volatile was that it seems to be highly dependent on compiler details, phase of the moon, seemingly irrelevant nearby code, etc.)

Comment 23 Josh Boyer 2015-01-09 20:14:22 UTC
(In reply to Andy Lutomirski from comment #22)
> I took a look at the disassembly you provided (thanks!), and I'm reasonably
> confident that my guess was right.  If you manage to complete an iteration
> of the vread_pvclock loop with cpu != cpu1, then you'll never escape the
> loop, because the compiler will keep using cached cpu values on each
> iteration.  I know this because "lsl" doesn't appear anywhere in the
> disassembly, so the __getcpu isn't happening, despite appearances in the C
> code.
> 
> Josh, I'm confident enough that this is a real problem and is probably the
> cause of this particular hang that it might be worth kicking off a scratch
> kernel build or something to make this easier to test.

Started:

http://koji.fedoraproject.org/koji/taskinfo?taskID=8575247

The function is in vsyscall.h in 3.17, so a small trivial backport of the patch.

Thanks Andy!

Comment 24 Kevin Fenzi 2015-01-09 21:14:57 UTC
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
kvm-clock

Guests have 4 cpus. 

Will try that kernel once it's done building. Thanks!

Comment 25 Kevin Fenzi 2015-01-09 23:19:12 UTC
ok. About an hour so far on that kernel and 0 cases of the bug... looking pretty good.

Comment 26 Josh Boyer 2015-01-10 15:55:10 UTC
(In reply to Kevin Fenzi from comment #25)
> ok. About an hour so far on that kernel and 0 cases of the bug... looking
> pretty good.

Excellent.  I'll try to get the patch included in Fedora git later this evening.

Comment 27 Josh Boyer 2015-01-12 13:20:42 UTC
Fixed in git.

Comment 28 Gerhard Wiesinger 2015-01-14 09:10:30 UTC
Test program posted at https://www.mail-archive.com/qemu-devel@nongnu.org/msg273843.html works well now for more than 1 hour.

Seems to be fixed for me. Can you please release a new official kernel version to the update repo ASAP. Thnx.

For the records the test program:
#include <sys/time.h>
#include <stdio.h>

int main(int argc, char* argv[])
{
  struct timeval tv;
  int i = 0;
  for (;;)
  {
    gettimeofday(&tv, 0);
    ++i;
    if (i >= 10000000)
    {
      i = 0;
      printf("%i\n", (int)tv.tv_sec);
    }
  }
  return 0;
}

BTW: What was the main source/trigger of the bug? gcc differences between F20 and F21?
I'm asking as already discussed because same kernel version worked on F20.

I guess the answer is: Bug is in kernel but was not triggered with old F20 gcc and was triggered by F21 newer gcc version.

OK : F20: 3.17.6-200.fc20.x86_64 on guest/host, qemu-kvm-1.6.2-10.fc20.x86_64 on host
NOK: F21: 3.17.7-300.fc21.x86_64 on guest/host, qemu-kvm-2.1.2-7.fc21.x86_64 on host
NOK: F21: 3.17.8-300.fc21.x86_64 on guest/host, qemu-kvm-2.2.0-1.fc21.x86_64 on host

Comment 29 Fedora Update System 2015-01-14 13:14:28 UTC
kernel-3.18.2-200.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-3.18.2-200.fc21

Comment 30 Fedora Update System 2015-01-14 23:59:47 UTC
Package kernel-3.18.2-200.fc21:
* should fix your issue,
* was pushed to the Fedora 21 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.18.2-200.fc21'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2015-0724/kernel-3.18.2-200.fc21
then log in and leave karma (feedback).

Comment 31 Fedora Update System 2015-01-19 20:49:09 UTC
kernel-3.18.3-201.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-3.18.3-201.fc21

Comment 32 Fedora Update System 2015-01-26 02:30:55 UTC
kernel-3.18.3-201.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.