1169524 – kernel-3.17.4-300.fc21.x86_64 /proc issues

Bug 1169524 - kernel-3.17.4-300.fc21.x86_64 /proc issues [NEEDINFO]

Summary: kernel-3.17.4-300.fc21.x86_64 /proc issues

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-12-01 22:13 UTC by Jakub Jelinek
Modified:	2015-02-24 16:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-24 16:14:37 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)

Description Jakub Jelinek 2014-12-01 22:13:23 UTC

After upgrade to kernel-3.17.4-300.fc21.x86_64 ps / top in F21 show nonsense values for consumed time.
uname -a; uptime; ps ax | grep systemd
Linux tucnak 3.17.4-300.fc21.x86_64 #1 SMP Fri Nov 21 21:11:57 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 23:02:35 up 3 min,  1 user,  load average: 0.02, 0.05, 0.03
    1 ?        Ss   23262:52 /usr/lib/systemd/systemd --switched-root --system --deserialize 20
top shows similar value, which doesn't make any sense for 3 minutes of uptime.
Similar excessive value is shown for
   44 ?        S    23201:34 [migration/1]
   50 ?        S    23201:34 [migration/2]
On the other side, in one of the boots with this kernel (which had weird times for pid 2 (kthreadd) rather than pid 1) no normal processes, even clearly very busy cc1/cc1plus processes, weren't showing any %CPU time consumed, neither in ps nor top.
kernel-3.17.2-300.fc21.x86_64 never showed this behavior, kernel-3.17.3-300.fc21.x86_64 occassionally.

Comment 1 Jakub Jelinek 2014-12-01 22:26:06 UTC

Also, kernels 3.17.3-300.fc21.x86_64 and 3.17.3-400.fc21.x86_64 are the only ones on which I've reproduced https://gcc.gnu.org/ml/gcc-patches/2014-11/msg03092.html which has been running with older kernels for at least a year of daily make -j48 bootstraps without problems, has something substantial changed in the scheduler?  Sure, the missing dependency is a gcc Makefile bug, but I'm also experiencing much longer make -j16 -k check times since yum update a week ago.
Also, since when are segfaults being logged on x86_64?  I don't ever remember seeing that in the past on x86-64 (I remember it from other arches).
[ 1509.727801] null-4.exe[10757]: segfault at 0 ip 000000000804864e sp 00000000ff89e3e0 error 4 in null-4.exe[8048000+1000]
[ 1509.760455] pr59667.exe[10904]: segfault at 0 ip 0000000008048798 sp 00000000fff28170 error 6 in pr59667.exe[8048000+1000]
[ 1509.892075] pr59667.exe[11333]: segfault at 0 ip 0000000008048638 sp 00000000ffa99e40 error 6 in pr59667.exe[8048000+1000]
[ 1510.018093] pr59667.exe[11395]: segfault at 0 ip 0000000008048500 sp 00000000ff97b180 error 6 in pr59667.exe[8048000+1000]
[ 1510.317717] null-4.exe[11589]: segfault at 0 ip 0000000008048638 sp 00000000ffcc9d80 error 4 in null-4.exe[8048000+1000]
[ 1510.637411] pr59667.exe[11655]: segfault at 0 ip 0000000008048500 sp 00000000ff96bcc0 error 6 in pr59667.exe[8048000+1000]
[ 1510.794110] null-4.exe[11887]: segfault at 8 ip 00000000080484fd sp 00000000ff972f40 error 4 in null-4.exe[8048000+1000]
[ 1510.840073] pr59667.exe[11899]: segfault at 0 ip 0000000008048500 sp 00000000fff9b330 error 6 in pr59667.exe[8048000+1000]
[ 1510.981501] null-4.exe[12001]: segfault at 8 ip 00000000080484fd sp 00000000ff8e4c50 error 4 in null-4.exe[8048000+1000]
[ 1511.754681] pr59667.exe[12339]: segfault at 0 ip 0000000008048500 sp 00000000fff935a0 error 6 in pr59667.exe[8048000+1000]
Logging all this is highly undesirable for gcc testing, some segfaults are completely intentional there.

Comment 2 Josh Boyer 2014-12-02 19:08:55 UTC

(In reply to Jakub Jelinek from comment #1)
> Also, kernels 3.17.3-300.fc21.x86_64 and 3.17.3-400.fc21.x86_64 are the only
> ones on which I've reproduced
> https://gcc.gnu.org/ml/gcc-patches/2014-11/msg03092.html which has been
> running with older kernels for at least a year of daily make -j48 bootstraps
> without problems, has something substantial changed in the scheduler?  Sure,
> the missing dependency is a gcc Makefile bug, but I'm also experiencing much
> longer make -j16 -k check times since yum update a week ago.

I'm not seeing this on any of my machines.  There are a handful of rcu commits between .2, .3, and .4 but none of them immediately strike me as causing something like this.

> Also, since when are segfaults being logged on x86_64?  I don't ever

Since around 2007 with commit abd4f7505bafd.  It's guarded by a CONFIG option, but we aren't explicitly setting that.  It might have come on with some kind of reorganization the in various Kconfig files, but it has been this way for quite some time now.

> remember seeing that in the past on x86-64 (I remember it from other arches).
> [ 1509.727801] null-4.exe[10757]: segfault at 0 ip 000000000804864e sp
> 00000000ff89e3e0 error 4 in null-4.exe[8048000+1000]
> [ 1509.760455] pr59667.exe[10904]: segfault at 0 ip 0000000008048798 sp
> 00000000fff28170 error 6 in pr59667.exe[8048000+1000]
> [ 1509.892075] pr59667.exe[11333]: segfault at 0 ip 0000000008048638 sp
> 00000000ffa99e40 error 6 in pr59667.exe[8048000+1000]
> [ 1510.018093] pr59667.exe[11395]: segfault at 0 ip 0000000008048500 sp
> 00000000ff97b180 error 6 in pr59667.exe[8048000+1000]
> [ 1510.317717] null-4.exe[11589]: segfault at 0 ip 0000000008048638 sp
> 00000000ffcc9d80 error 4 in null-4.exe[8048000+1000]
> [ 1510.637411] pr59667.exe[11655]: segfault at 0 ip 0000000008048500 sp
> 00000000ff96bcc0 error 6 in pr59667.exe[8048000+1000]
> [ 1510.794110] null-4.exe[11887]: segfault at 8 ip 00000000080484fd sp
> 00000000ff972f40 error 4 in null-4.exe[8048000+1000]
> [ 1510.840073] pr59667.exe[11899]: segfault at 0 ip 0000000008048500 sp
> 00000000fff9b330 error 6 in pr59667.exe[8048000+1000]
> [ 1510.981501] null-4.exe[12001]: segfault at 8 ip 00000000080484fd sp
> 00000000ff8e4c50 error 4 in null-4.exe[8048000+1000]
> [ 1511.754681] pr59667.exe[12339]: segfault at 0 ip 0000000008048500 sp
> 00000000fff935a0 error 6 in pr59667.exe[8048000+1000]
> Logging all this is highly undesirable for gcc testing, some segfaults are
> completely intentional there.

It's a sysctl so you can turn it off.  I believe it's called 'debug.exception-trace'.

Comment 3 Jakub Jelinek 2014-12-03 22:48:42 UTC

Sorry for mixing the segfault logging in, apparently even older kernels did it, just haven't been paying attention.  But with the 3.17.4-300.fc21.x86_64 kernel, I'm also seeing weird behavior when accessing /proc, sometimes when many but not all threads are busy, I can normally run many commands and the system is responsive, but when I run ps ax, the ps command hangs until I Ctrl-C the CPU intensive jobs, at least for a couple of minutes.  E.g. managed to reproduce it with steal_check Cilk+ test, and gdb hang too until I've Ctrl-C interrupted it.
* 16   Thread 0x7fa74407f700 (LWP 24823) "steal_check.exe" 0x00007fa7443d5c0b in check_for_work (w=0xa99680)
    at ../../../libcilkrts/runtime/scheduler.c:1781
  15   Thread 0x7fa74387e700 (LWP 24824) "steal_check.exe" 0x00000032798e49a7 in sched_yield () from /lib64/libc.so.6
  14   Thread 0x7fa74307d700 (LWP 24825) "steal_check.exe" 0x00007fa7443ceeb2 in cilk_fiber_sysdep::run (this=0x7fa724000a00)
    at ../../../libcilkrts/runtime/cilk_fiber-unix.cpp:231
  13   Thread 0x7fa74287c700 (LWP 24826) "steal_check.exe" 0x00000032798e49a7 in sched_yield () from /lib64/libc.so.6
  12   Thread 0x7fa74207b700 (LWP 24827) "steal_check.exe" 0x00000032798e49a7 in sched_yield () from /lib64/libc.so.6
  11   Thread 0x7fa73987a700 (LWP 24828) "steal_check.exe" 0x00000032798fb00a in mmap64 () from /lib64/libc.so.6
  10   Thread 0x7fa74187a700 (LWP 24829) "steal_check.exe" 0x00000032798e49a7 in sched_yield () from /lib64/libc.so.6
  9    Thread 0x7fa741079700 (LWP 24830) "steal_check.exe" 0x00000032798e49a7 in sched_yield () from /lib64/libc.so.6
  8    Thread 0x7fa740878700 (LWP 24831) "steal_check.exe" 0x00000032798fb067 in mprotect () from /lib64/libc.so.6
  7    Thread 0x7fa73bfff700 (LWP 24832) "steal_check.exe" 0x00000032798fb00a in mmap64 () from /lib64/libc.so.6
  6    Thread 0x7fa73b7fe700 (LWP 24833) "steal_check.exe" 0x000000327987dac2 in new_heap () from /lib64/libc.so.6
  5    Thread 0x7fa73affd700 (LWP 24834) "steal_check.exe" 0x00000032798fef5a in get_nprocs () from /lib64/libc.so.6
  4    Thread 0x7fa73a7fc700 (LWP 24835) "steal_check.exe" 0x00000032798fef5a in get_nprocs () from /lib64/libc.so.6
  3    Thread 0x7fa739079700 (LWP 24836) "steal_check.exe" 0x00000032798fef5a in get_nprocs () from /lib64/libc.so.6
  2    Thread 0x7fa738878700 (LWP 24837) "steal_check.exe" 0x00000032798fef5a in get_nprocs () from /lib64/libc.so.6
  1    Thread 0x7fa744081740 (LWP 24818) "steal_check.exe" 0x0000000000400b03 in foo (some_other_var=0x7fff6171ab6c)
is what I saw in gdb when it finally managed to attach, so clearly was also waiting for /proc in some threads.

Comment 4 Justin M. Forbes 2015-01-27 14:59:44 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 21 kernel bugs.

Fedora 21 has now been rebased to 3.18.3-201.fc21.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 5 Fedora Kernel Team 2015-02-24 16:14:37 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.