Bug 1490467

Summary:

systemd[1]: rpc-gssd.service: main process exited, code=killed, status=6/ABRT

Product:

Red Hat Enterprise Linux 7

Reporter:

Orion Poplawski <orion>

Component:

nfs-utils

Assignee:

Steve Dickson <steved>

Status:

CLOSED WONTFIX

QA Contact:

Yongcheng Yang <yoyang>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

7.5

CC:

baumanmo, fsorenso, orion, rbergant, rharwood, ssorce, steved, xzhou, yoyang

Target Milestone:

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-11-11 21:55:34 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
/var/log/messages from around the time of the hang	none
core_backtrace	none

Description Orion Poplawski 2017-09-11 16:34:23 UTC

Created attachment 1324518 [details]
/var/log/messages from around the time of the hang

Description of problem:

I'm seeing trouble with nfs mounts as described in bug #1466944

In an effort to debug I'm running a version of gssproxy from Fedora with the timeout patch applied and compiled on EL7:

gssproxy-0.7.0-14.el7.nwra.1.x86_64

It appears though that after a timeout occurs, rpc.gssd dies.

Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Handling query reply: 0x7fcf3000ac90 (176)
Sep 11 10:25:24 barry gssproxy: [CID 12][2017/09/11 16:25:24]: [status] Returned buffer 6 (GSSX_ACQUIRE_CRED) from [0x1080c40 (116)]: [0x7fcf28000bf0 (176)]
Sep 11 10:25:24 barry gssproxy: [CID 12][2017/09/11 16:25:24]: [status] Handling query output: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Handling query reply: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Sending data: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Sending data [0x7fcf28000bf0 (176)]: successful write of 176
Sep 11 10:25:24 barry rpc.gssd[823]: creating tcp client for server earth.cora.nwra.com
Sep 11 10:25:24 barry rpc.gssd[823]: creating context with server nfs.nwra.com
Sep 11 10:25:25 barry rpc.gssd[823]: doing downcall: lifetime_rec=35438 acceptor=nfs.nwra.com
Sep 11 10:25:54 barry gssproxy: [2017/09/11 16:25:54]: Client connected (fd = 12)[2017/09/11 16:25:54]:  (pid = 823) (uid = 30657) (gid = 30657)[2017/09/11 16:25:54]:  (context = system_u:system_r:gssd_t:s0)[2017/09/11 16:25:54]:
Sep 11 10:25:54 barry systemd: rpc-gssd.service: main process exited, code=killed, status=6/ABRT
Sep 11 10:25:54 barry systemd: Unit rpc-gssd.service entered failed state.
Sep 11 10:25:54 barry systemd: rpc-gssd.service failed.

Version-Release number of selected component (if applicable):
nfs-utils-1.3.0-0.48.el7.x86_64

How reproducible:
Seen a few times now.

Comment 2 Steve Dickson 2017-09-11 18:25:57 UTC

Just curious... If you take gssproxy out of the picture 
by set GSS_USE_PROXY="no" in /etc/sysconfig/nfs does
the abrt happen?

Comment 3 Orion Poplawski 2017-09-15 20:15:02 UTC

It may be too early to tell, but early testing seems to indicate that setting GSS_USE_PROXY=no prevents the crash.  Unfortunately, I also cannot reproduce the crash with gdb attached to rpc.gssd.

Comment 4 Simo Sorce 2017-09-18 15:05:52 UTC

Di abrt catch the rpc.gssd stacktrace ?
I would like to take a look at it to see where it blows up.

Comment 5 Orion Poplawski 2017-09-18 15:19:37 UTC

No, it didn't.  I don't know why.

Comment 6 Orion Poplawski 2018-07-25 16:43:37 UTC

abrt-hook-ccpp[7829]: Process 1283 (rpc.gssd) of user 0 killed by SIGABRT - dumping core
abrt-hook-ccpp[7829]: Failed to create core_backtrace: waitpid failed: No child processes

Not sure why it isn't catching the coredump.

Comment 7 Orion Poplawski 2018-12-26 19:03:57 UTC

Still present with nfs-utils-1.3.0-0.61.el7.x86_64, but still not producing a coredump.

Comment 8 Orion Poplawski 2019-01-25 16:01:21 UTC

Created attachment 1523541 [details]
core_backtrace

I cant't get a good backtrace with gdb on the coredump, but this is what abrtd collected.

              , {   "address": 139826512251128
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 227576
                ,   "function_name": "abort"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 139826512518343
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 494791
                ,   "function_name": "__libc_message"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 139826512553001
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 529449
                ,   "function_name": "_int_free"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 94794261335291
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 40187
                ,   "function_name": "gssd_get_single_krb5_cred.constprop.4"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }
              , {   "address": 94794261336012
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 40908
                ,   "function_name": "gssd_refresh_krb5_machine_credential"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }
              , {   "address": 94794261324896
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 29792
                ,   "function_name": "krb5_use_machine_creds"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }

Comment 9 Orion Poplawski 2019-02-05 18:49:26 UTC

Finally seem to have a viable coredump - looks like we have memory corruption:

(gdb) bt
#0  0x00007f9e03947207 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f9e039488f8 in __GI_abort () at abort.c:90
#2  0x00007f9e03989d27 in __libc_message (do_abort=do_abort@entry=2,
    fmt=fmt@entry=0x7f9e03a9b678 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f9e03992489 in malloc_printerr (ar_ptr=0x7f9dfc000020, ptr=<optimized out>,
    str=0x7f9e03a9b738 "double free or corruption (fasttop)", action=3) at malloc.c:5004
#4  _int_free (av=0x7f9dfc000020, p=<optimized out>, have_lock=0) at malloc.c:3843
#5  0x0000557c2be4acfb in gssd_get_single_krb5_cred (context=0x7f9dfc0045e0, kt=<optimized out>,
    ple=ple@entry=0x7f9dfc005fa0, nocache=0) at krb5_util.c:427
#6  0x0000557c2be4afcc in gssd_refresh_krb5_machine_credential (
    hostname=0x557c2c87da00 "csdisk4ib.cora.nwra.com", ple=0x7f9dfc005fa0, ple@entry=0x0,
    service=service@entry=0x557c2c892410 "*") at krb5_util.c:1302
#7  0x0000557c2be48460 in krb5_use_machine_creds (clp=clp@entry=0x557c2c87de40, uid=uid@entry=0,
    tgtname=tgtname@entry=0x0, service=service@entry=0x557c2c892410 "*",
    rpc_clnt=rpc_clnt@entry=0x7f9e00f4acf0) at gssd_proc.c:546
#8  0x0000557c2be4868d in process_krb5_upcall (clp=clp@entry=0x557c2c87de40, uid=uid@entry=0,
    fd=10, tgtname=tgtname@entry=0x0, service=service@entry=0x557c2c892410 "*") at gssd_proc.c:655
#9  0x0000557c2be48ed9 in handle_gssd_upcall (info=0x557c2c8923f0) at gssd_proc.c:814
#10 0x00007f9e03ce5dd5 in start_thread (arg=0x7f9e00f4b700) at pthread_create.c:307
#11 0x00007f9e03a0eead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) up 5
#5  0x0000557c2be4acfb in gssd_get_single_krb5_cred (context=0x7f9dfc0045e0, kt=<optimized out>,
    ple=ple@entry=0x7f9dfc005fa0, nocache=0) at krb5_util.c:427
427                     free(ple->ccname);
(gdb) list
422                     cache_type,
423                     ccachesearch[0], GSSD_DEFAULT_CRED_PREFIX,
424                     GSSD_DEFAULT_MACHINE_CRED_SUFFIX, ple->realm);
425             ple->endtime = my_creds.times.endtime;
426             if (ple->ccname != NULL)
427                     free(ple->ccname);
428             ple->ccname = strdup(cc_name);
429             if (ple->ccname == NULL) {
430                     printerr(0, "ERROR: no storage to duplicate credentials "
431                                 "cache name '%s'\n", cc_name);
(gdb) print *ple
$1 = {next = 0x0, princ = 0x7f9dfc006460,
  ccname = 0x7f9df4006060 "FILE:/tmp/krb5ccmachine_NWRA.COM", realm = 0x7f9dfc0061a0 "NWRA.COM",
  endtime = 1549433693}

Comment 10 Simo Sorce 2019-02-07 14:01:05 UTC

Robby,
I seem to recall some recent fixes with ccaches and double frees, can you take a look at this one and see if this is related ?

Comment 11 Robbie Harwood 2019-02-08 19:57:22 UTC

Unless you're using a MEMORY ccache, it wouldn't be related to all that.  (And that stuff only matters for the case of manipulating multiple handles to the same one anyway.)  But if you wanted to be sure, you can try krb5-1.15.1-37 (7.6.z).

Unfortunately corruption issues are going to be nigh-impossible to debug without a trace from under valgrind (with debug symbols installed).

Comment 12 Simo Sorce 2019-02-11 16:55:53 UTC

Uhm looking better at the backtrace this is not a libkr5 call, this is still pure gssd code.

Steve, sounds like this is in your court.

Comment 19 Chris Williams 2020-11-11 21:55:34 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7