Bug 1464923 - git-remote-http causes kernel crash [NEEDINFO]
git-remote-http causes kernel crash
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
25
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
: 1464922 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-26 05:00 EDT by Roman Pavlyuk
Modified: 2017-07-11 09:58 EDT (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
nobodyless: needinfo? (extras-qa)


Attachments (Terms of Use)
Kernel crash log (3.52 KB, text/plain)
2017-06-26 05:01 EDT, Roman Pavlyuk
no flags Details
Jenkins job sample (1.83 KB, application/xml)
2017-06-26 05:02 EDT, Roman Pavlyuk
no flags Details

  None (edit)
Description Roman Pavlyuk 2017-06-26 05:00:08 EDT
Description of problem:
I have Jenkins CI setup that has a set of CI jobs set up where SCM polling is done. Every minute Jenkins check git (actually, GitHub) for changes and (if any) the build job is started.

However, after some time (approx. 2 days) the system becomes slow, then unresponsive and is crashing at all. dmesg log is filled with the following messages:

Jun 21 06:32:01 liberty kernel: BUG: unable to handle kernel paging request at 0000000002f27b6e
Jun 21 06:32:01 liberty kernel: IP: __d_lookup_rcu+0x67/0x180
Jun 21 06:32:01 liberty kernel: PGD 0
Jun 21 06:32:01 liberty kernel:
Jun 21 06:32:01 liberty kernel: Oops: 0000 [#8] SMP
Jun 21 06:32:01 liberty kernel: Modules linked in: 8021q garp mrp veth xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrt
Jun 21 06:32:01 liberty kernel:  tpm binfmt_misc i915 i2c_algo_bit drm_kms_helper serio_raw drm ata_generic pata_acpi sata_sil24 video
Jun 21 06:32:01 liberty kernel: CPU: 0 PID: 5053 Comm: git-remote-http Tainted: G      D W       4.11.4-200.fc25.x86_64 #1
Jun 21 06:32:01 liberty kernel: Hardware name: System manufacturer System Product Name/P8H61-MX R2.0, BIOS 1109 06/20/2014
Jun 21 06:32:01 liberty kernel: task: ffffa0ced20fa480 task.stack: ffffb1a60ac48000
Jun 21 06:32:01 liberty kernel: RIP: 0010:__d_lookup_rcu+0x67/0x180
Jun 21 06:32:01 liberty kernel: RSP: 0018:ffffb1a60ac4bc48 EFLAGS: 00010206
Jun 21 06:32:01 liberty kernel: RAX: 000000000000001b RBX: 0000000002f27b72 RCX: ffffb1a60001b000
Jun 21 06:32:01 liberty kernel: RDX: ffffb1a60ac4bcc4 RSI: ffffb1a60ac4bdb0 RDI: ffffa0ced2caec00
Jun 21 06:32:01 liberty kernel: RBP: ffffb1a60ac4bca0 R08: ffffa0cba6fdbcc0 R09: ffffb1a60ac4bcc4
Jun 21 06:32:01 liberty kernel: R10: 00000000dfba2be3 R11: 0000001b00000000 R12: 0000000000000000
Jun 21 06:32:01 liberty kernel: R13: ffffa0ced2caec00 R14: 0000001bdfba2be3 R15: ffffa0cd0f87102b
Jun 21 06:32:01 liberty kernel: FS:  00007fbaa4a631c0(0000) GS:ffffa0cf9fa00000(0000) knlGS:0000000000000000
Jun 21 06:32:01 liberty kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 21 06:32:01 liberty kernel: CR2: 0000000002f27b6e CR3: 000000017090e000 CR4: 00000000001406f0
Jun 21 06:32:01 liberty kernel: Call Trace:
Jun 21 06:32:01 liberty kernel:  lookup_fast+0x57/0x3a0
Jun 21 06:32:01 liberty kernel:  walk_component+0x49/0x350
Jun 21 06:32:01 liberty kernel:  ? path_init+0x1c3/0x320
Jun 21 06:32:01 liberty kernel:  path_lookupat+0x4d/0x100
Jun 21 06:32:01 liberty kernel:  filename_lookup+0xb8/0x1a0
Jun 21 06:32:01 liberty kernel:  ? __check_object_size+0x100/0x19d
Jun 21 06:32:01 liberty kernel:  ? strncpy_from_user+0x4d/0x170
Jun 21 06:32:01 liberty kernel:  user_path_at_empty+0x36/0x40
Jun 21 06:32:01 liberty kernel:  ? user_path_at_empty+0x36/0x40
Jun 21 06:32:01 liberty kernel:  SyS_access+0xb4/0x220
Jun 21 06:32:01 liberty kernel:  entry_SYSCALL_64_fastpath+0x1a/0xa9
Jun 21 06:32:01 liberty kernel: RIP: 0033:0x7fbaa3849ba7
Jun 21 06:32:01 liberty kernel: RSP: 002b:00007ffccea59478 EFLAGS: 00000246 ORIG_RAX: 0000000000000015
Jun 21 06:32:01 liberty kernel: RAX: ffffffffffffffda RBX: 00007fbaa3b12ae0 RCX: 00007fbaa3849ba7
Jun 21 06:32:01 liberty kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005597ab136790
Jun 21 06:32:01 liberty kernel: RBP: 00005597ab134570 R08: 0000000000000002 R09: 0000000000000001
Jun 21 06:32:01 liberty kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000
Jun 21 06:32:01 liberty kernel: R13: 000000000000caa0 R14: 00005597ab134560 R15: 00005597ab11b200
Jun 21 06:32:01 liberty kernel: Code: 83 e3 fe 0f 84 95 00 00 00 4c 89 f0 45 89 f2 49 89 d1 48 c1 e8 20 48 89 75 c0 49 89 fd 48 89 45 c8 eb 08 48 8b 1b 48 85 db 74 73 <44> 8b 63 fc
Jun 21 06:32:01 liberty kernel: RIP: __d_lookup_rcu+0x67/0x180 RSP: ffffb1a60ac4bc48
Jun 21 06:32:01 liberty kernel: CR2: 0000000002f27b6e
Jun 21 06:32:01 liberty kernel: ---[ end trace acd72dc7d5a5f346 ]---


Version-Release number of selected component (if applicable):
[root@liberty ~]# rpm -qa jenkins*
jenkins-2.65-1.1.noarch
[root@liberty ~]# java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b12)
OpenJDK 64-Bit Server VM (build 25.131-b12, mixed mode)
[root@liberty ~]#
[root@liberty ~]# rpm -qa git* | sort
git-2.9.4-1.fc25.x86_64
git-core-2.9.4-1.fc25.x86_64
git-core-doc-2.9.4-1.fc25.x86_64
[root@liberty ~]#
[root@liberty ~]# uname -a
Linux liberty 4.11.4-200.fc25.x86_64 #1 SMP Wed Jun 7 18:28:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Always reproducible. It takes approx. 36-48 hrs for the first crash error to appear and after 12-24 hrs of crashing the system become unavailable.

I've turned off Jenkins and the system became stable with no issues.

The same exact configuration is running on CentOS 7 x86_64 and no kernel issues are detected.


Steps to Reproduce:
1. Install Jenkins CI (official Jenkins repo), Git and other dependencies
2. Set Jenkins to start on boot
3. Configure Jenkins to run CI job with Git SCM polling (see attached job that as an example)
4. Start and run Jenkins for 2+ days non-stop

Actual results:
The system becomes unresponsive after 36-48 hrs

Expected results:
System is stable and no kernel issues are present.

Additional info:
This issue was posted on FedoraForums few days ago. Link: http://forums.fedoraforum.org/showthread.php?t=314588
Comment 1 Roman Pavlyuk 2017-06-26 05:01 EDT
Created attachment 1291907 [details]
Kernel crash log
Comment 2 Roman Pavlyuk 2017-06-26 05:02 EDT
Created attachment 1291908 [details]
Jenkins job sample

This is config.xml that is stored in /var/lib/jenkins/jobs/<job_name> folder
Comment 3 niemand 2017-06-26 09:35:02 EDT
My take on this, since Roman posted this on official Fedora.org forum:

http://www.forums.fedoraforum.org/showpost.php?p=1789243&postcount=2

I need from you (fedora developers) the following:
[1] The precise explanation of the root cause of this problem;
[2] The fix, what exactly the patch (as final fix) is to be applied?!

This is A MUST (you all, Fedora developers, as I know, are not too good with professionalism/professional handling, although your ARE paid for your efforts/fixes, so this is is why I in FIRM demand for such an explanation).

Thank you,
_nobody_
Comment 4 Laura Abbott 2017-06-26 09:37:46 EDT
*** Bug 1464922 has been marked as a duplicate of this bug. ***
Comment 5 Roman Pavlyuk 2017-07-11 08:31:12 EDT
Hello niemand,

1. Exact root cause of the problem is unknown. It is assumed that there's a memory leak or memory usage bug in 'git-remote-http' command. The command is being triggered by every Jenkins job (I have approx. 6 of them) every minute. It means that Jenkins (actually, Java) process is calling 'git-remote-http' command at least 6 times per minute. Maybe, memory corruption happens when 2-3+ processes of 'git-remote-http' command start at the same time?.. After approx 36 to 48 hours of constant operation (means, calling the scription 5-6 times per minute) the first kernel exceptions start to appear (see bug description). The stop point is always the same (__d_lookup_rcu+0x67/0x180). 

I'm going to setup another box with the same configuration (F25+Jenkins) and will see if the issue is widely reproducible. Because if it is than the future RHEL/CentOS release might be at risk.

2. Kernel crashing stopped as soon as Jenkins service was stopped and disabled. Once stopped, the server is very stable and no other issues are found. I will have more details on what exact fix to apply once I spin up an experimental box.

Thanks,
Roman
Comment 6 niemand 2017-07-11 09:58:38 EDT
> I'm going to setup another box with the same configuration (F25+Jenkins)
> and will see if the issue is widely reproducible. Because if it is than
> the future RHEL/CentOS release might be at risk.

Please, do so. Two identical setups producing the same results, are MANY! ;-)

I would advise to you the next step, if you pass above one (and prove the bug). Please, take a bit different configuration (F26+Jenkins) and see if the issue is also reproducible.

F26 within few minutes (10:00 AM EST) should be officially released! So, please, update FC25 to FC26, and repeat the test. ;-)

_nobody_

Note You need to log in before you can comment on or make changes to this bug.