Bug 1464923 - git-remote-http causes kernel crash [NEEDINFO]
git-remote-http causes kernel crash
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
25
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
: 1464922 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-26 05:00 EDT by Roman Pavlyuk
Modified: 2017-11-16 13:49 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
nobodyless: needinfo? (extras-qa)


Attachments (Terms of Use)
Kernel crash log (3.52 KB, text/plain)
2017-06-26 05:01 EDT, Roman Pavlyuk
no flags Details
Jenkins job sample (1.83 KB, application/xml)
2017-06-26 05:02 EDT, Roman Pavlyuk
no flags Details

  None (edit)
Description Roman Pavlyuk 2017-06-26 05:00:08 EDT
Description of problem:
I have Jenkins CI setup that has a set of CI jobs set up where SCM polling is done. Every minute Jenkins check git (actually, GitHub) for changes and (if any) the build job is started.

However, after some time (approx. 2 days) the system becomes slow, then unresponsive and is crashing at all. dmesg log is filled with the following messages:

Jun 21 06:32:01 liberty kernel: BUG: unable to handle kernel paging request at 0000000002f27b6e
Jun 21 06:32:01 liberty kernel: IP: __d_lookup_rcu+0x67/0x180
Jun 21 06:32:01 liberty kernel: PGD 0
Jun 21 06:32:01 liberty kernel:
Jun 21 06:32:01 liberty kernel: Oops: 0000 [#8] SMP
Jun 21 06:32:01 liberty kernel: Modules linked in: 8021q garp mrp veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrt
Jun 21 06:32:01 liberty kernel:  tpm binfmt_misc i915 i2c_algo_bit drm_kms_helper serio_raw drm ata_generic pata_acpi sata_sil24 video
Jun 21 06:32:01 liberty kernel: CPU: 0 PID: 5053 Comm: git-remote-http Tainted: G      D W       4.11.4-200.fc25.x86_64 #1
Jun 21 06:32:01 liberty kernel: Hardware name: System manufacturer System Product Name/P8H61-MX R2.0, BIOS 1109 06/20/2014
Jun 21 06:32:01 liberty kernel: task: ffffa0ced20fa480 task.stack: ffffb1a60ac48000
Jun 21 06:32:01 liberty kernel: RIP: 0010:__d_lookup_rcu+0x67/0x180
Jun 21 06:32:01 liberty kernel: RSP: 0018:ffffb1a60ac4bc48 EFLAGS: 00010206
Jun 21 06:32:01 liberty kernel: RAX: 000000000000001b RBX: 0000000002f27b72 RCX: ffffb1a60001b000
Jun 21 06:32:01 liberty kernel: RDX: ffffb1a60ac4bcc4 RSI: ffffb1a60ac4bdb0 RDI: ffffa0ced2caec00
Jun 21 06:32:01 liberty kernel: RBP: ffffb1a60ac4bca0 R08: ffffa0cba6fdbcc0 R09: ffffb1a60ac4bcc4
Jun 21 06:32:01 liberty kernel: R10: 00000000dfba2be3 R11: 0000001b00000000 R12: 0000000000000000
Jun 21 06:32:01 liberty kernel: R13: ffffa0ced2caec00 R14: 0000001bdfba2be3 R15: ffffa0cd0f87102b
Jun 21 06:32:01 liberty kernel: FS:  00007fbaa4a631c0(0000) GS:ffffa0cf9fa00000(0000) knlGS:0000000000000000
Jun 21 06:32:01 liberty kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 21 06:32:01 liberty kernel: CR2: 0000000002f27b6e CR3: 000000017090e000 CR4: 00000000001406f0
Jun 21 06:32:01 liberty kernel: Call Trace:
Jun 21 06:32:01 liberty kernel:  lookup_fast+0x57/0x3a0
Jun 21 06:32:01 liberty kernel:  walk_component+0x49/0x350
Jun 21 06:32:01 liberty kernel:  ? path_init+0x1c3/0x320
Jun 21 06:32:01 liberty kernel:  path_lookupat+0x4d/0x100
Jun 21 06:32:01 liberty kernel:  filename_lookup+0xb8/0x1a0
Jun 21 06:32:01 liberty kernel:  ? __check_object_size+0x100/0x19d
Jun 21 06:32:01 liberty kernel:  ? strncpy_from_user+0x4d/0x170
Jun 21 06:32:01 liberty kernel:  user_path_at_empty+0x36/0x40
Jun 21 06:32:01 liberty kernel:  ? user_path_at_empty+0x36/0x40
Jun 21 06:32:01 liberty kernel:  SyS_access+0xb4/0x220
Jun 21 06:32:01 liberty kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa9
Jun 21 06:32:01 liberty kernel: RIP: 0033:0x7fbaa3849ba7
Jun 21 06:32:01 liberty kernel: RSP: 002b:00007ffccea59478 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
Jun 21 06:32:01 liberty kernel: RAX: ffffffffffffffda RBX: 00007fbaa3b12ae0 RCX: 00007fbaa3849ba7
Jun 21 06:32:01 liberty kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005597ab136790
Jun 21 06:32:01 liberty kernel: RBP: 00005597ab134570 R08: 0000000000000002 R09: 0000000000000001
Jun 21 06:32:01 liberty kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000
Jun 21 06:32:01 liberty kernel: R13: 000000000000caa0 R14: 00005597ab134560 R15: 00005597ab11b200
Jun 21 06:32:01 liberty kernel: Code: 83 e3 fe 0f 84 95 00 00 00 4c 89 f0 45 89 f2 49 89 d1 48 c1 e8 20 48 89 75 c0 49 89 fd 48 89 45 c8 eb 08 48 8b 1b 48 85 db 74 73 <44> 8b 63 fc
Jun 21 06:32:01 liberty kernel: RIP: __d_lookup_rcu+0x67/0x180 RSP: ffffb1a60ac4bc48
Jun 21 06:32:01 liberty kernel: CR2: 0000000002f27b6e
Jun 21 06:32:01 liberty kernel: ---[ end trace acd72dc7d5a5f346 ]---


Version-Release number of selected component (if applicable):
[root@liberty ~]# rpm -qa jenkins*
jenkins-2.65-1.1.noarch
[root@liberty ~]# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
[root@liberty ~]#
[root@liberty ~]# rpm -qa git* | sort
git-2.9.4-1.fc25.x86_64
git-core-2.9.4-1.fc25.x86_64
git-core-doc-2.9.4-1.fc25.x86_64
[root@liberty ~]#
[root@liberty ~]# uname -a
Linux liberty 4.11.4-200.fc25.x86_64 #1 SMP Wed Jun 7 18:28:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Always reproducible. It takes approx. 36-48 hrs for the first crash error to appear and after 12-24 hrs of crashing the system become unavailable.

I've turned off Jenkins and the system became stable with no issues.

The same exact configuration is running on CentOS 7 x86_64 and no kernel issues are detected.


Steps to Reproduce:
1. Install Jenkins CI (official Jenkins repo), Git and other dependencies
2. Set Jenkins to start on boot
3. Configure Jenkins to run CI job with Git SCM polling (see attached job that as an example)
4. Start and run Jenkins for 2+ days non-stop

Actual results:
The system becomes unresponsive after 36-48 hrs

Expected results:
System is stable and no kernel issues are present.

Additional info:
This issue was posted on FedoraForums few days ago. Link: http://forums.fedoraforum.org/showthread.php?t=314588
Comment 1 Roman Pavlyuk 2017-06-26 05:01 EDT
Created attachment 1291907 [details]
Kernel crash log
Comment 2 Roman Pavlyuk 2017-06-26 05:02 EDT
Created attachment 1291908 [details]
Jenkins job sample

This is config.xml that is stored in /var/lib/jenkins/jobs/<job_name> folder
Comment 3 niemand 2017-06-26 09:35:02 EDT
My take on this, since Roman posted this on official Fedora.org forum:

http://www.forums.fedoraforum.org/showpost.php?p=1789243&postcount=2

I need from you (fedora developers) the following:
[1] The precise explanation of the root cause of this problem;
[2] The fix, what exactly the patch (as final fix) is to be applied?!

This is A MUST (you all, Fedora developers, as I know, are not too good with professionalism/professional handling, although your ARE paid for your efforts/fixes, so this is is why I in FIRM demand for such an explanation).

Thank you,
_nobody_
Comment 4 Laura Abbott 2017-06-26 09:37:46 EDT
*** Bug 1464922 has been marked as a duplicate of this bug. ***
Comment 5 Roman Pavlyuk 2017-07-11 08:31:12 EDT
Hello niemand,

1. Exact root cause of the problem is unknown. It is assumed that there's a memory leak or memory usage bug in 'git-remote-http' command. The command is being triggered by every Jenkins job (I have approx. 6 of them) every minute. It means that Jenkins (actually, Java) process is calling 'git-remote-http' command at least 6 times per minute. Maybe, memory corruption happens when 2-3+ processes of 'git-remote-http' command start at the same time?.. After approx 36 to 48 hours of constant operation (means, calling the scription 5-6 times per minute) the first kernel exceptions start to appear (see bug description). The stop point is always the same (__d_lookup_rcu+0x67/0x180). 

I'm going to setup another box with the same configuration (F25+Jenkins) and will see if the issue is widely reproducible. Because if it is than the future RHEL/CentOS release might be at risk.

2. Kernel crashing stopped as soon as Jenkins service was stopped and disabled. Once stopped, the server is very stable and no other issues are found. I will have more details on what exact fix to apply once I spin up an experimental box.

Thanks,
Roman
Comment 6 niemand 2017-07-11 09:58:38 EDT
> I'm going to setup another box with the same configuration (F25+Jenkins)
> and will see if the issue is widely reproducible. Because if it is than
> the future RHEL/CentOS release might be at risk.

Please, do so. Two identical setups producing the same results, are MANY! ;-)

I would advise to you the next step, if you pass above one (and prove the bug). Please, take a bit different configuration (F26+Jenkins) and see if the issue is also reproducible.

F26 within few minutes (10:00 AM EST) should be officially released! So, please, update FC25 to FC26, and repeat the test. ;-)

_nobody_
Comment 7 Roman Pavlyuk 2017-10-10 05:12:28 EDT
UPDATE: Running the same configuration of Jenkins but inside a Docker container still causes kernel on host system to crash.

The container is built out of CentOS7 image and thus has all C libraries out of that OS. (Docker file is here: https://github.com/rpavlyuk/docker-svarog-ci/tree/master/docker-svarog-ci)

Crash log:
[585706.238814] BUG: unable to handle kernel paging request at 00000000059fe23d
[585706.239718] IP: __d_lookup_rcu+0x67/0x180
[585706.240592] PGD 1e9578067
[585706.240593] P4D 1e9578067
[585706.241461] PUD 401330067
[585706.242315] PMD 0

[585706.244782] Oops: 0000 [#1] SMP
[585706.245572] Modules linked in: veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison loop sunrpc vfat fat intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp xfs libcrc32c kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel intel_cstate snd_hda_codec_via eeepc_wmi asus_wmi sparse_keymap rfkill iTCO_wdt intel_uncore snd_hda_codec_generic iTCO_vendor_support intel_rapl_perf snd_hda_intel snd_hda_codec raid1 snd_hda_core r8169 i2c_i801 mii snd_hwdep snd_seq snd_seq_device wmi lpc_ich shpchp snd_pcm snd_timer snd soundcore ie31200_edac mei_me mei tpm_tis tpm_tis_core tpm binfmt_misc i915 i2c_algo_bit
[585706.251665]  drm_kms_helper drm sata_sil24 serio_raw ata_generic pata_acpi video
[585706.252562] CPU: 1 PID: 30264 Comm: git-remote-http Not tainted 4.12.13-200.fc25.x86_64 #1
[585706.253461] Hardware name: System manufacturer System Product Name/P8H61-MX R2.0, BIOS 1109 06/20/2014
[585706.254375] task: ffff9b6c8bf68000 task.stack: ffffbf5e49cd8000
[585706.255292] RIP: 0010:__d_lookup_rcu+0x67/0x180
[585706.256200] RSP: 0018:ffffbf5e49cdbbd8 EFLAGS: 00010206
[585706.257117] RAX: 000000000000001b RBX: 00000000059fe241 RCX: ffffbf5e4001b000
[585706.258025] RDX: ffffbf5e49cdbc5c RSI: ffffbf5e49cdbd80 RDI: ffff9b6fc5f38480
[585706.258918] RBP: ffffbf5e49cdbc30 R08: ffff9b6cba94e3c0 R09: ffffbf5e49cdbc5c
[585706.259807] R10: 000000001a45b123 R11: 0000001b00000000 R12: 0000000000000000
[585706.260702] R13: ffff9b6fc5f38480 R14: 0000001b1a45b123 R15: ffff9b6e2207302b
[585706.261598] FS:  00007fbe39308840(0000) GS:ffff9b709fb00000(0000) knlGS:0000000000000000
[585706.262505] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[585706.263412] CR2: 00000000059fe23d CR3: 00000002ff7d1000 CR4: 00000000001406e0
[585706.264331] Call Trace:
[585706.265228]  lookup_fast+0x53/0x2f0
[585706.266100]  walk_component+0x49/0x350
[585706.266964]  ? dput+0x34/0x1e0
[585706.267823]  path_lookupat+0x73/0x220
[585706.268674]  filename_lookup+0xb8/0x1a0
[585706.269521]  ? __seccomp_filter+0x37/0x250
[585706.270362]  ? set_next_entity+0xd9/0x210
[585706.271204]  ? __check_object_size+0xb3/0x190
[585706.272024]  user_path_at_empty+0x36/0x40
[585706.272822]  ? user_path_at_empty+0x36/0x40
[585706.273614]  SyS_access+0xb4/0x220
[585706.274400]  do_syscall_64+0x67/0x150
[585706.275181]  entry_SYSCALL64_slow_path+0x25/0x25
[585706.275963] RIP: 0033:0x7fbe382fd897
[585706.276736] RSP: 002b:00007ffc83a9bf78 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
[585706.277532] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fbe382fd897
[585706.278316] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001164070
[585706.279077] RBP: 0000000001164070 R08: 00007fbe33827a48 R09: 0000000000000002
[585706.279835] R10: 000000000000002e R11: 0000000000000246 R12: 0000000000000021
[585706.280588] R13: 0000000000001a44 R14: 000000000116407f R15: 0000000001161db8
[585706.281350] Code: 83 e3 fe 0f 84 95 00 00 00 4c 89 f0 45 89 f2 49 89 d1 48 c1 e8 20 48 89 75 c0 49 89 fd 48 89 45 c8 eb 08 48 8b 1b 48 85 db 74 73 <44> 8b 63 fc 4c 3b 6b 10 75 e
e 48 83 7b 08 00 74 e7 41 83 e4 fe
[585706.283007] RIP: __d_lookup_rcu+0x67/0x180 RSP: ffffbf5e49cdbbd8
[585706.283852] CR2: 00000000059fe23d
[585706.284749] ---[ end trace eab31d53f53312f2 ]---
[585706.342862] BUG: unable to handle kernel paging request at 00000000059fe23d
[585706.343733] IP: __d_lookup_rcu+0x67/0x180
[585706.344562] PGD 1e5701067
[585706.344562] P4D 1e5701067
[585706.345364] PUD 3e4094067
[585706.346139] PMD 0
Comment 8 Fedora End Of Life 2017-11-16 13:49:16 EST
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Note You need to log in before you can comment on or make changes to this bug.