637242 – Sleeping function called from invalid context at arch/x86/mm/fault.c

Bug 637242 - Sleeping function called from invalid context at arch/x86/mm/fault.c

Summary: Sleeping function called from invalid context at arch/x86/mm/fault.c

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	13
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-09-24 16:17 UTC by rhbug.30.miller_2555
Modified:	2011-06-28 12:23 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-06-28 12:23:31 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description rhbug.30.miller_2555 2010-09-24 16:17:41 UTC

Description of problem:
    Invocation of a user daemon in a particular manner causes crash of system -- believed to be a part of the distribution kernel (see details below).

Version-Release number of selected component (if applicable):
    Fedora release 13 (Goddard)

How reproducible:
    Invoke the compiled fwknop deamon, then restart using specific options. I have only attempted to execute the following within a customized initrd (I did not recompile the kernel to create the customized initrd -- just copied additional configuration scripts into the tarred structure and tweaked the init script to invoke).

Steps to Reproduce:
1. Download, install, and configure the fwknop-2.0.0rc1-1 source RPM (http://www.cipherdyne.org/fwknop/download/)
2. Issue following invocation:
    fwknopd -a /etc/fwknop/access.conf -c /etc/fwknop/fwknopd.conf -i eth0 --gpg-home-dir=/root/.gnupg -f > /dev/null 2> /dev/kmsg &
3. Wait for fwknop to handle signals:
    sleep 5;
4. Restart daemon in the following manner:
    fwknopd -a /etc/fwknop/access.conf -c /etc/fwknop/fwknopd.conf -i eth0 --gpg-home-dir=/root/.gnupg -f --restart;
  
Actual results:
   The following is the error message (abridged to exclude register and stack trace information -- had to manually type this):

BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1074
in_atomic(): 0, irqs_disabled(): 1, pid: 224, name: fwknopd
BUG: unable to handle kernel NULL pointer dereference at 0000000000000258
IP: [<ffffffff8105c6f2>] complete_signal+0x103/0x151
PGD 22fe16067 PUD 22fc93067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/module/ip6_tables/initstate
CPU 3
Modules linked in: xt_multiport ip6table_filter ip6_tables ipv6 e1000e noveau ttm drm_kms_helper drm i2c_algo_bit video ouput i2c_core

Pid: 224, comm: fwknopd Not tainted 2.6.34.7-56.fc13.x86_64 #1 H57M01/DX4831
< ... register dump information ... >
< ... stack trace ... >
Call Trace:
 [<ffffffff8105c9a4>] __send_signal+0x264/0x288
 [<ffffffff8105ca3e>] send_signal+0x76/0x81
 [<ffffffff8105ca94>] do_send_sig_info+0x4b/0x75
 [<ffffffff811cdf7f>] ? security_task_kill+0x16/0x18
 [<ffffffff8105cd1f>] group_send_sig_info+0x39/0x42
 [<ffffffff8105ce79>] __kill_pgrp_info+0x44/0x67
 [<ffffffff8105cfcf>] sys_kill+0xea/0x16e
 [<ffffffff8100e0d7>] ? vfs_write+0xd3/0x10b
 [<ffffffff8100e1e6>] ? sys_write+0x61/0x6e
 [<ffffffff81009c72>] system_call_fastpath+0x16/0x1b

Expected results:
    <no crash>

Additional info:
   Available upon request. I have also contacted the fwknop developer regarding the issue, but the error message above indicates an ungraceful handling of a userspace error (which the kernel should be able to handle in some manner).

Comment 1 rhbug.30.miller_2555 2010-09-24 16:24:40 UTC

Almost forgot the kernel detail: 
Linux localhost.localnet 2.6.34.7-56.fc13.x86_64 #1 SMP Wed Sep 15 03:36:55 UTC 2010 x86_64 GNU/Linux

Comment 2 Oleg Nesterov 2010-10-01 14:14:43 UTC

_perhaps_ fa2755e20ab0c7215d99c2dc7c262e98a09b01df
"INIT_TASK() should initialize ->thread_group list"
can help.

IFF /sbin/init doesn't change its pgid and fwknopd
runs in init's pgrp.

> only attempted to execute the following within a
> customized initrd

this is not clear to me. So, is it possible to
reproduce the problem by just doing 1-4 on f13
machine or no?

Comment 3 Stanislaw Gruszka 2010-10-01 14:18:01 UTC

(In reply to comment #2)
> _perhaps_ fa2755e20ab0c7215d99c2dc7c262e98a09b01df
> "INIT_TASK() should initialize ->thread_group list"
> can help.

I'll prepare kernel for test. Thanks Oleg for looking at this!

Comment 4 Stanislaw Gruszka 2010-10-01 15:02:11 UTC

Here is kernel build with Oleg's patch (currently still compiling):
http://koji.fedoraproject.org/koji/taskinfo?taskID=2506347

Comment 5 Stanislaw Gruszka 2010-10-04 08:18:10 UTC

Please test if above kernel build fix problem on your system. Note these scratch builds are removed automatically after about one week.

Comment 6 Eric Buehl 2010-10-04 23:10:11 UTC

With 2.6.34.7-56.fc13.x86_64 I get a similar, reproducible crash while mounting certain NFS exports.  With the referenced test kernel I no longer get a crash on the NFS mount but later it eventually died.  Here is the trace in /var/log/messages:

BUG: unable to handle kernel paging request at 0000006e75725f64
IP: [<0000006e75725f64>] 0x6e75725f64
PGD 337562067 PUD 0 
Oops: 0010 [#1] SMP 
last sysfs file: /sys/module/lockd/initstate
CPU 4 
Modules linked in: nfsd exportfs nfs lockd fscache nfs_acl auth_rpcgss autofs4 fuse sunrpc 8021q garp 
i2c_core ioatdma serio_raw joydev dca iTCO_wdt iTCO_vendor_support raid1 [last unloaded: scsi_wait_sca


Pid: 1298, comm: mdmon Not tainted 2.6.34.7-59.bz637242.fc13.x86_64 #1 X8DTT-H/X8DTT-H
RIP: 0010:[<0000006e75725f64>]  [<0000006e75725f64>] 0x6e75725f64
RSP: 0018:ffff880337e49e40  EFLAGS: 00010246
RAX: ffffffff8165b73c RBX: ffff8801b7b69280 RCX: ffff880337e49f58
RDX: ffff880337e49e80 RSI: ffff880337e49e80 RDI: ffff8801b7b69280
RBP: ffff880337e49eb8 R08: ffffffff811267d9 R09: 0000000000000007
R10: 0000000000000064 R11: 0000000000000246 R12: 0000000000000400
R13: ffff880337e49f58 R14: ffff8801b7dbec00 R15: 0000000000000000
FS:  00007ffa86352700(0000) GS:ffff8801c5800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000006e75725f64 CR3: 0000000337563000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process mdmon (pid: 1298, threadinfo ffff880337e48000, task ffff880337582ee0)
Stack:
ffffffff81126943 0000000000102073 ffff880337e49e80 ffff8801b7b692b8
<0> 0000000000000000 00007ffa8635f000 0000000000000000 ffff880336e33800
<0> 0000000000000000 ffff880337e49eb8 ffff8801b8ff2600 ffff8801b7dbec00
Call Trace:
[<ffffffff81126943>] ? seq_read+0x16a/0x36b
[<ffffffff81155ce6>] proc_reg_read+0x75/0x8e
[<ffffffff8110e29e>] vfs_read+0xab/0x108
[<ffffffff8110e3bb>] sys_read+0x4a/0x6e
[<ffffffff81009c72>] system_call_fastpath+0x16/0x1b
Code:  Bad RIP value.
RIP  [<0000006e75725f64>] 0x6e75725f64
RSP <ffff880337e49e40>
CR2: 0000006e75725f64
---[ end trace b127d97b815b890b ]---

Comment 7 Oleg Nesterov 2010-10-05 11:25:17 UTC

Confused ;)

(In reply to comment #6)
>
> With 2.6.34.7-56.fc13.x86_64 I get a similar, reproducible crash while
> mounting certain NFS exports.

No, it is not similar at all. This bug has nothing to do with
the original bug report.

> With the referenced test kernel I no longer get a crash
> on the NFS mount

What about signals? Does this patch fix the original problem
with sys_kill() or not?

And we still do not know how the original problem can be reproduced.
Let me repeat the question from #2.

> only attempted to execute the following within a
> customized initrd

Is it possible to reproduce the problem by just doing 1-4 on f13
machine? Or do you need a special environment (initrd/etc) to
reproduce?

> BUG: unable to handle kernel paging request at 0000006e75725f64
> IP: [<0000006e75725f64>] 0x6e75725f64
> ...
> [<ffffffff81126943>] ? seq_read+0x16a/0x36b
> [<ffffffff81155ce6>] proc_reg_read+0x75/0x8e

So, this is another problem. Again, how to reproduce?

Comment 8 rhbug.30.miller_2555 2010-11-19 00:15:40 UTC

Oleg & Stanislaw - Sincerest apologies - I was on a backpacking trip for the last several weeks (hence the late reply). If the patch is still available (looks like it is), I'll try the patch as soon as I get a chance and will report back my findings (shooting for this weekend or over Thanksgiving). I'll also try the lastest pre-built kernel to see if it has been resolved there. 

Thanks a ton & will let you know - 
Will

Comment 9 Bug Zapper 2011-05-31 12:40:24 UTC

This message is a reminder that Fedora 13 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 13.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '13'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 13's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 13 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 10 PaulB 2011-06-27 13:57:16 UTC

All,
The following isssue was seen during testing:
Checking dmesg for specific failures!
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1097
End of log.

See testing here:
 https://beaker.engineering.redhat.com/recipes/208451

 http://tinyurl.com/3ejwysx
 <-SNIP->
  Checking dmesg for specific failures!
  BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1097
  End of log.
 <-SNIP->

http://tinyurl.com/3qkop3s
 <-SNIP->
  BUG: sleeping function called from invalid context at arch/x86
  /mm/fault.c:1097 
  in_atomic(): 0, irqs_disabled(): 1, pid: 25664, name: rhts-db-submit- 
  INFO: lockdep is turned off. 
  irq event stamp: 0 
  hardirqs last  enabled at (0): [<(null)>] (null) 
  hardirqs last disabled at (0): [<ffffffff81069ddd>] copy_process+0x5ed/0x14d0 
  softirqs last  enabled at (0): [<ffffffff81069ddd>] copy_process+0x5ed/0x14d0 
  softirqs last disabled at (0): [<(null)>] (null) 
  Pid: 25664, comm: rhts-db-submit- Tainted: G           ---------------- T   
  2.6.32-162.el6.x86_64.debug #1 
  Call Trace: 
  [<ffffffff810a8300>] ? print_irqtrace_events+0xd0/0xe0 
  [<ffffffff81055a27>] ? __might_sleep+0xf7/0x130 
  [<ffffffff810428a4>] ? __do_page_fault+0x114/0x4e0 
  [<ffffffff8128455e>] ? cfq_set_request+0x8e/0x520 
  [<ffffffff8128455e>] ? cfq_set_request+0x8e/0x520 
  [<ffffffff810ac6cd>] ? trace_hardirqs_on+0xd/0x10 
  [<ffffffff8151441e>] ? do_page_fault+0x3e/0xa0 
  [<ffffffff81511525>] ? page_fault+0x25/0x30 
  [<ffffffff812987ec>] ? debug_object_activate+0x5c/0x160 
  [<ffffffff81298812>] ? debug_object_activate+0x82/0x160 
  [<ffffffff812987ec>] ? debug_object_activate+0x5c/0x160 
  [<ffffffff8107ffdf>] ? mod_timer+0xcf/0x240 
  [<ffffffff8126a6f2>] ? blk_plug_device+0x72/0x100 
  [<ffffffff8126d9a4>] ? __make_request+0x194/0x5e0 
  [<ffffffffa0003b2b>] ? dm_request+0x3b/0x1e0 [dm_mod] 
  [<ffffffff8126bbb9>] ? generic_make_request+0x329/0x640 
  [<ffffffff811c6586>] ? bio_add_page+0x36/0x40 
  [<ffffffff811cb5d0>] ? do_mpage_readpage+0x310/0x5f0 
  [<ffffffff8126bf5d>] ? submit_bio+0x8d/0x120 
  [<ffffffff811cb137>] ? mpage_bio_submit+0x27/0x30 
  [<ffffffff811cba35>] ? mpage_readpages+0x115/0x130 
  [<ffffffffa0244e30>] ? ext4_get_block+0x0/0x120 [ext4] 
  [<ffffffffa0244e30>] ? ext4_get_block+0x0/0x120 [ext4] 
  [<ffffffff8116c30a>] ? alloc_pages_current+0xaa/0x110 
  [<ffffffffa02409ad>] ? ext4_readpages+0x1d/0x20 [ext4] 
  [<ffffffff81139a90>] ? __do_page_cache_readahead+0x1d0/0x260 
  [<ffffffff8113996e>] ? __do_page_cache_readahead+0xae/0x260 
  [<ffffffff81123be0>] ? find_get_page+0x0/0x120 
  [<ffffffff81139b41>] ? ra_submit+0x21/0x30 
  [<ffffffff811245f8>] ? filemap_fault+0x4e8/0x530 
  [<ffffffff8114e844>] ? __do_fault+0x54/0x4f0 
  [<ffffffff812317a4>] ? task_has_capability+0xb4/0x110 
  [<ffffffff8114ed70>] ? handle_pte_fault+0x90/0xa90 
  [<ffffffff81152d98>] ? vma_link+0x58/0xf0 
  [<ffffffff815109eb>] ? _spin_unlock+0x2b/0x40 
  [<ffffffff8114f954>] ? handle_mm_fault+0x1e4/0x2b0 
  [<ffffffff810428f3>] ? __do_page_fault+0x163/0x4e0 
  [<ffffffff81155a2a>] ? do_mmap_pgoff+0x33a/0x380 
  [<ffffffff8151441e>] ? do_page_fault+0x3e/0xa0 
  [<ffffffff81511525>] ? page_fault+0x25/0x30 
 <-SNIP->


=========================================================================
Note: 
I was not able to reproduce this issue testing with same host.
See here:
[] J:101962 2.6.32-162.el6 Cthon X5 
   https://beaker.engineering.redhat.com/jobs/101962
=========================================================================

Best,
-pbunyan



=========================================================================

Comment 11 Oleg Nesterov 2011-06-27 20:09:53 UTC

(In reply to comment #10)
>
>   BUG: sleeping function called from invalid context at arch/x86
>   /mm/fault.c:1097 
>   in_atomic(): 0, irqs_disabled(): 1, pid: 25664, name: rhts-db-submit- 

note this in_atomic() == 0

>   [<ffffffff812987ec>] ? debug_object_activate+0x5c/0x160 
>   [<ffffffff81298812>] ? debug_object_activate+0x82/0x160 
>   [<ffffffff812987ec>] ? debug_object_activate+0x5c/0x160 
>   [<ffffffff8107ffdf>] ? mod_timer+0xcf/0x240 
>   [<ffffffff8126a6f2>] ? blk_plug_device+0x72/0x100

so according to this trace debug_object_activate() faults for
some unknown reason, strange. And since in_atomic() == F
do_page_fault() doesn't do bad_area() but takes mmap_sem and
proceeds. And it seems fault_in_kernel_space() == F, this is
strange too.

And. why in_atomic() == F ??? We are holding tvec_base->lock
at least.

Confused.

Comment 12 Oleg Nesterov 2011-06-27 20:13:51 UTC

(In reply to comment #11)
>
> And. why in_atomic() == F ??? We are holding tvec_base->lock
> at least.

Ah, probably !CONFIG_PREEMPT.

> Confused.

Yes.

Comment 13 Bug Zapper 2011-06-28 12:23:31 UTC

Fedora 13 changed to end-of-life (EOL) status on 2011-06-25. Fedora 13 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.