1490467 – systemd[1]: rpc-gssd.service: main process exited, code=killed, status=6/ABRT

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1490467 - systemd[1]: rpc-gssd.service: main process exited, code=killed, status=6/ABRT

Summary: systemd[1]: rpc-gssd.service: main process exited, code=killed, status=6/ABRT

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	nfs-utils
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Dickson
QA Contact:	Yongcheng Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-11 16:34 UTC by Orion Poplawski
Modified:	2023-12-15 15:58 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-11 21:55:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/log/messages from around the time of the hang (54.64 KB, text/plain) 2017-09-11 16:34 UTC, Orion Poplawski	no flags	Details
core_backtrace (9.23 KB, text/plain) 2019-01-25 16:01 UTC, Orion Poplawski	no flags	Details
View All

Description Orion Poplawski 2017-09-11 16:34:23 UTC

Created attachment 1324518 [details]
/var/log/messages from around the time of the hang

Description of problem:

I'm seeing trouble with nfs mounts as described in bug #1466944

In an effort to debug I'm running a version of gssproxy from Fedora with the timeout patch applied and compiled on EL7:

gssproxy-0.7.0-14.el7.nwra.1.x86_64

It appears though that after a timeout occurs, rpc.gssd dies.

Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Handling query reply: 0x7fcf3000ac90 (176)
Sep 11 10:25:24 barry gssproxy: [CID 12][2017/09/11 16:25:24]: [status] Returned buffer 6 (GSSX_ACQUIRE_CRED) from [0x1080c40 (116)]: [0x7fcf28000bf0 (176)]
Sep 11 10:25:24 barry gssproxy: [CID 12][2017/09/11 16:25:24]: [status] Handling query output: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Handling query reply: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Sending data: 0x7fcf28000bf0 (176)
Sep 11 10:25:24 barry gssproxy: [2017/09/11 16:25:24]: [status] Sending data [0x7fcf28000bf0 (176)]: successful write of 176
Sep 11 10:25:24 barry rpc.gssd[823]: creating tcp client for server earth.cora.nwra.com
Sep 11 10:25:24 barry rpc.gssd[823]: creating context with server nfs.nwra.com
Sep 11 10:25:25 barry rpc.gssd[823]: doing downcall: lifetime_rec=35438 acceptor=nfs.nwra.com
Sep 11 10:25:54 barry gssproxy: [2017/09/11 16:25:54]: Client connected (fd = 12)[2017/09/11 16:25:54]:  (pid = 823) (uid = 30657) (gid = 30657)[2017/09/11 16:25:54]:  (context = system_u:system_r:gssd_t:s0)[2017/09/11 16:25:54]:
Sep 11 10:25:54 barry systemd: rpc-gssd.service: main process exited, code=killed, status=6/ABRT
Sep 11 10:25:54 barry systemd: Unit rpc-gssd.service entered failed state.
Sep 11 10:25:54 barry systemd: rpc-gssd.service failed.

Version-Release number of selected component (if applicable):
nfs-utils-1.3.0-0.48.el7.x86_64

How reproducible:
Seen a few times now.

Comment 2 Steve Dickson 2017-09-11 18:25:57 UTC

Just curious... If you take gssproxy out of the picture 
by set GSS_USE_PROXY="no" in /etc/sysconfig/nfs does
the abrt happen?

Comment 3 Orion Poplawski 2017-09-15 20:15:02 UTC

It may be too early to tell, but early testing seems to indicate that setting GSS_USE_PROXY=no prevents the crash.  Unfortunately, I also cannot reproduce the crash with gdb attached to rpc.gssd.

Comment 4 Simo Sorce 2017-09-18 15:05:52 UTC

Di abrt catch the rpc.gssd stacktrace ?
I would like to take a look at it to see where it blows up.

Comment 5 Orion Poplawski 2017-09-18 15:19:37 UTC

No, it didn't.  I don't know why.

Comment 6 Orion Poplawski 2018-07-25 16:43:37 UTC

abrt-hook-ccpp[7829]: Process 1283 (rpc.gssd) of user 0 killed by SIGABRT - dumping core
abrt-hook-ccpp[7829]: Failed to create core_backtrace: waitpid failed: No child processes

Not sure why it isn't catching the coredump.

Comment 7 Orion Poplawski 2018-12-26 19:03:57 UTC

Still present with nfs-utils-1.3.0-0.61.el7.x86_64, but still not producing a coredump.

Comment 8 Orion Poplawski 2019-01-25 16:01:21 UTC

Created attachment 1523541 [details]
core_backtrace

I cant't get a good backtrace with gdb on the coredump, but this is what abrtd collected.

              , {   "address": 139826512251128
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 227576
                ,   "function_name": "abort"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 139826512518343
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 494791
                ,   "function_name": "__libc_message"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 139826512553001
                ,   "build_id": "95cdabda24bcd671d2876c8d7c5d6411902a8566"
                ,   "build_id_offset": 529449
                ,   "function_name": "_int_free"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 94794261335291
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 40187
                ,   "function_name": "gssd_get_single_krb5_cred.constprop.4"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }
              , {   "address": 94794261336012
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 40908
                ,   "function_name": "gssd_refresh_krb5_machine_credential"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }
              , {   "address": 94794261324896
                ,   "build_id": "5b24daf020ad3925c1805d79c7152bbdaa7b2715"
                ,   "build_id_offset": 29792
                ,   "function_name": "krb5_use_machine_creds"
                ,   "file_name": "/usr/sbin/rpc.gssd"
                }

Comment 9 Orion Poplawski 2019-02-05 18:49:26 UTC

Finally seem to have a viable coredump - looks like we have memory corruption:

(gdb) bt
#0  0x00007f9e03947207 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f9e039488f8 in __GI_abort () at abort.c:90
#2  0x00007f9e03989d27 in __libc_message (do_abort=do_abort@entry=2,
    fmt=fmt@entry=0x7f9e03a9b678 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#3  0x00007f9e03992489 in malloc_printerr (ar_ptr=0x7f9dfc000020, ptr=<optimized out>,
    str=0x7f9e03a9b738 "double free or corruption (fasttop)", action=3) at malloc.c:5004
#4  _int_free (av=0x7f9dfc000020, p=<optimized out>, have_lock=0) at malloc.c:3843
#5  0x0000557c2be4acfb in gssd_get_single_krb5_cred (context=0x7f9dfc0045e0, kt=<optimized out>,
    ple=ple@entry=0x7f9dfc005fa0, nocache=0) at krb5_util.c:427
#6  0x0000557c2be4afcc in gssd_refresh_krb5_machine_credential (
    hostname=0x557c2c87da00 "csdisk4ib.cora.nwra.com", ple=0x7f9dfc005fa0, ple@entry=0x0,
    service=service@entry=0x557c2c892410 "*") at krb5_util.c:1302
#7  0x0000557c2be48460 in krb5_use_machine_creds (clp=clp@entry=0x557c2c87de40, uid=uid@entry=0,
    tgtname=tgtname@entry=0x0, service=service@entry=0x557c2c892410 "*",
    rpc_clnt=rpc_clnt@entry=0x7f9e00f4acf0) at gssd_proc.c:546
#8  0x0000557c2be4868d in process_krb5_upcall (clp=clp@entry=0x557c2c87de40, uid=uid@entry=0,
    fd=10, tgtname=tgtname@entry=0x0, service=service@entry=0x557c2c892410 "*") at gssd_proc.c:655
#9  0x0000557c2be48ed9 in handle_gssd_upcall (info=0x557c2c8923f0) at gssd_proc.c:814
#10 0x00007f9e03ce5dd5 in start_thread (arg=0x7f9e00f4b700) at pthread_create.c:307
#11 0x00007f9e03a0eead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) up 5
#5  0x0000557c2be4acfb in gssd_get_single_krb5_cred (context=0x7f9dfc0045e0, kt=<optimized out>,
    ple=ple@entry=0x7f9dfc005fa0, nocache=0) at krb5_util.c:427
427                     free(ple->ccname);
(gdb) list
422                     cache_type,
423                     ccachesearch[0], GSSD_DEFAULT_CRED_PREFIX,
424                     GSSD_DEFAULT_MACHINE_CRED_SUFFIX, ple->realm);
425             ple->endtime = my_creds.times.endtime;
426             if (ple->ccname != NULL)
427                     free(ple->ccname);
428             ple->ccname = strdup(cc_name);
429             if (ple->ccname == NULL) {
430                     printerr(0, "ERROR: no storage to duplicate credentials "
431                                 "cache name '%s'\n", cc_name);
(gdb) print *ple
$1 = {next = 0x0, princ = 0x7f9dfc006460,
  ccname = 0x7f9df4006060 "FILE:/tmp/krb5ccmachine_NWRA.COM", realm = 0x7f9dfc0061a0 "NWRA.COM",
  endtime = 1549433693}

Comment 10 Simo Sorce 2019-02-07 14:01:05 UTC

Robby,
I seem to recall some recent fixes with ccaches and double frees, can you take a look at this one and see if this is related ?

Comment 11 Robbie Harwood 2019-02-08 19:57:22 UTC

Unless you're using a MEMORY ccache, it wouldn't be related to all that.  (And that stuff only matters for the case of manipulating multiple handles to the same one anyway.)  But if you wanted to be sure, you can try krb5-1.15.1-37 (7.6.z).

Unfortunately corruption issues are going to be nigh-impossible to debug without a trace from under valgrind (with debug symbols installed).

Comment 12 Simo Sorce 2019-02-11 16:55:53 UTC

Uhm looking better at the backtrace this is not a libkr5 call, this is still pure gssd code.

Steve, sounds like this is in your court.

Comment 19 Chris Williams 2020-11-11 21:55:34 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Note You need to log in before you can comment on or make changes to this bug.