Bug 1235902
Summary: | Segmentation fault on ARM with psql | |||
---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | ToBeReplaced | |
Component: | gssproxy | Assignee: | Robbie Harwood <rharwood> | |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 22 | CC: | dpal, gdeschner, rharwood, ssorce, ToBeReplaced | |
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | armv7l | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | gssproxy-0.4.1-3.fc23 gssproxy-0.4.1-2.fc22 gssproxy-0.4.1-2.fc21 | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1344518 (view as bug list) | Environment: | ||
Last Closed: | 2015-11-01 02:30:44 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1344518 |
Description
ToBeReplaced
2015-06-26 04:34:11 UTC
This was unable to be reproduced on x86_64. The same configuration resulted in the expected (desired) behavior. The test on x86_64 was run on a machine that included a lot of other software (gnome-desktop, for example). The armv7l machine was based off of Fedora 22 Minimal, and only contained upgrades, installation and registration as a freeipa-client, and psql. (In reply to ToBeReplaced from comment #1) > This was unable to be reproduced on x86_64. The same configuration resulted > in the expected (desired) behavior. The test on x86_64 was run on a machine > that included a lot of other software (gnome-desktop, for example). The > armv7l machine was based off of Fedora 22 Minimal, and only contained > upgrades, installation and registration as a freeipa-client, and psql. Raw guessing: 32bit issue or compiler issue... ;-/ Reporter: Could you please run *both* gssproxy AND psql under valgrind control to see if it produces any warnings/errors prior to the crash... Reporter: I need to reproduce the problem... is there any recommended emulator (with ethernet network access) you can recommend (preferably with instructions how to install Fedora on it) ? (In reply to Roland Mainz from comment #3) > Reporter: > I need to reproduce the problem... is there any recommended emulator (with > ethernet network access) you can recommend (preferably with instructions how > to install Fedora on it) ? Fedora supports Versatile Express Emulation with QEMU, so you might try that. I have never used it, so I can't provide any additional commentary. Instructions here: https://fedoraproject.org/wiki/Architectures/ARM/F22/Installation#For_Versatile_Express_Emulation_with_QEMU The problem presented itself on a BeagleBone Black. I imagine it would present on any of the armv7l devices that Fedora supports. (ex. PandaBoard, CubieTruck). Installation instructions are found on the same page as above. As for valgrind; I'll do that, but it likely won't be until Monday. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Did you get the chance to valgrind this? No, I had to move forward. I still intend to get to this, but unfortunately I have to keep putting it off. In case it's easier to reproduce, sudo 1.8.14p3-1.fc22 also segfaults on ARM devices that are IPA clients (and thus use gssproxy), which I suspect is related. I had an old Fedora 21 machine I was able to brick, so I upgraded to rawhide packages (same versions). gssproxy 0.4.1 The psql side (with GSS_USE_PROXY="YES"): ==1017== Memcheck, a memory error detector ==1017== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==1017== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==1017== Command: psql -h database.example.domain -U psql/host.example.domain -d test ==1017== disInstr(arm): unhandled instruction: 0xEC510F1E cond=14(0xE) 27:20=197(0xC5) 4:4=1 3:0=14(0xE) ==1017== valgrind: Unrecognised instruction at address 0x4ae0be8. ==1017== at 0x4AE0BE8: _armv7_tick (armv4cpuid.S:94) ==1017== Your program just tried to execute an instruction that Valgrind ==1017== did not recognise. There are two possible reasons for this. ==1017== 1. Your program has a bug and erroneously jumped to a non-code ==1017== location. If you are running Memcheck and you just saw a ==1017== warning about a bad jump, it's probably your program's fault. ==1017== 2. The instruction is legitimate but Valgrind doesn't handle it, ==1017== i.e. it's Valgrind's fault. If you think this is the case or ==1017== you are not sure, please let us know and we'll try to fix it. ==1017== Either way, Valgrind will now raise a SIGILL signal which will ==1017== probably kill your program. ==1017== Invalid free() / delete / delete[] / realloc() ==1017== at 0x4836A08: free (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so) ==1017== Address 0x8 is not stack'd, malloc'd or (recently) free'd ==1017== psql: GSSAPI continuation error: Unspecified GSS failure. Minor code may provide more information GSSAPI continuation error: No Kerberos credentials available ==1017== ==1017== HEAP SUMMARY: ==1017== in use at exit: 88,579 bytes in 3,096 blocks ==1017== total heap usage: 6,172 allocs, 3,077 frees, 548,384 bytes allocated ==1017== ==1017== LEAK SUMMARY: ==1017== definitely lost: 64 bytes in 4 blocks ==1017== indirectly lost: 114 bytes in 6 blocks ==1017== possibly lost: 0 bytes in 0 blocks ==1017== still reachable: 88,401 bytes in 3,086 blocks ==1017== suppressed: 0 bytes in 0 blocks ==1017== Rerun with --leak-check=full to see details of leaked memory ==1017== ==1017== For counts of detected and suppressed errors, rerun with: -v ==1017== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 12 from 8) The gssproxy side showed no errors: ==1055== Memcheck, a memory error detector ==1055== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al. ==1055== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info ==1055== Command: gssproxy -i ==1055== gssproxy[1055]: (OID: { 1 2 840 113554 1 2 2 }) Unspecified GSS failure. Minor code may provide more information, No credentials cache found ^C==1055== ==1055== HEAP SUMMARY: ==1055== in use at exit: 11,412 bytes in 95 blocks ==1055== total heap usage: 1,917 allocs, 1,822 frees, 291,081 bytes allocated ==1055== ==1055== LEAK SUMMARY: ==1055== definitely lost: 66 bytes in 3 blocks ==1055== indirectly lost: 118 bytes in 6 blocks ==1055== possibly lost: 5,172 bytes in 35 blocks ==1055== still reachable: 6,056 bytes in 51 blocks ==1055== suppressed: 0 bytes in 0 blocks ==1055== Rerun with --leak-check=full to see details of leaked memory ==1055== ==1055== For counts of detected and suppressed errors, rerun with: -v ==1055== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 14 from 2) Looks like the `armv7_tick` might be the root of both issues. On sudo 1.8.14p3, with sssd 1.13.0 running (no gssproxy required): (gdb) run Starting program: /usr/bin/sudo true [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". Program received signal SIGSEGV, Segmentation fault. sss_sudo_free_result (result=0x19c) at src/sss_client/sudo/sss_sudo.c:204 204 sss_sudo_free_rules(result->num_rules, result->rules); (gdb) bt #0 sss_sudo_free_result (result=0x19c) at src/sss_client/sudo/sss_sudo.c:204 #1 0xb6a541d8 in sudo_sss_setdefs (nss=<optimized out>) at ./sssd.c:456 #2 0xb6a4cf30 in sudoers_policy_init (info=info@entry=0xbeb2e450, envp=envp@entry=0xb6f94eb0 <__stack_chk_guard>) at ./sudoers.c:195 #3 0xb6a47e38 in sudoers_policy_open (version=<optimized out>, conversation=<optimized out>, plugin_printf=<optimized out>, settings=0xb8506398, user_info=0xb85008e8, envp=0xbeb2e7a0, args=0x0) at ./policy.c:621 #4 0xb6f99c94 in policy_open (plugin=0xb6fc5fac <policy_plugin>, user_env=0xd696914, user_info=0xb85008e8, settings=<optimized out>) at ./sudo.c:1189 #5 main (argc=<optimized out>, argv=<optimized out>, envp=0xd696914) at ./sudo.c:206 (gdb) c Continuing. On starting sssd: (gdb) run Starting program: /usr/sbin/sssd -i [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. _armv7_tick () at armv4cpuid.S:94 94 mrrc p15,1,r0,r1,c14 @ CNTVCT (gdb) bt #0 _armv7_tick () at armv4cpuid.S:94 #1 0xb6a7af9c in OPENSSL_cpuid_setup () at armcap.c:157 #2 0xb6fe86b8 in call_init (l=<optimized out>, argc=argc@entry=2, argv=argv@entry=0xbefff794, env=env@entry=0xbefff7a0) at dl-init.c:76 #3 0xb6fe8814 in call_init (env=<optimized out>, argv=<optimized out>, argc=<optimized out>, l=<optimized out>) at dl-init.c:34 #4 _dl_init (main_map=0xb6fff908, argc=2, argv=0xbefff794, env=0xbefff7a0) at dl-init.c:124 #5 0xb6fd8b44 in _dl_start_user () from /lib/ld-linux-armhf.so.3 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) c Continuing. Detaching after fork from child process 1235. Detaching after fork from child process 1236. Detaching after fork from child process 1237. Detaching after fork from child process 1238. Detaching after fork from child process 1239. Detaching after fork from child process 1240. (gdb) c Continuing. Also of note, I was able to install the missing debuginfos if I removed postgresql-debuginfo (space limited). This yielded: Missing separate debuginfos, use: debuginfo-install postgresql-9.3.9-1.fc21.armv7hl (gdb) run Starting program: /usr/bin/psql -h database.example.domain -U psql/host.example.domain -d test [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/libthread_db.so.1". Program received signal SIGILL, Illegal instruction. _armv7_tick () at armv4cpuid.S:94 94 mrrc p15,1,r0,r1,c14 @ CNTVCT (gdb) bt #0 _armv7_tick () at armv4cpuid.S:94 #1 0xb6c39f9c in OPENSSL_cpuid_setup () at armcap.c:157 #2 0xb6fe86b8 in call_init (l=<optimized out>, argc=argc@entry=7, argv=argv@entry=0xbefff654, env=env@entry=0xbefff674) at dl-init.c:76 #3 0xb6fe8814 in call_init (env=<optimized out>, argv=<optimized out>, argc=<optimized out>, l=<optimized out>) at dl-init.c:34 #4 _dl_init (main_map=0xb6fff908, argc=7, argv=0xbefff654, env=0xbefff674) at dl-init.c:124 #5 0xb6fd8b44 in _dl_start_user () from /lib/ld-linux-armhf.so.3 Backtrace stopped: previous frame identical to this frame (corrupt stack?) (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. __GI___libc_free (mem=0x8) at malloc.c:2934 2934 if (chunk_is_mmapped (p)) /* release mmapped memory. */ (gdb) bt #0 __GI___libc_free (mem=0x8) at malloc.c:2934 #1 0xb6407840 in gssrpc_xdr_bytes () from /lib/libgssrpc.so.4 #2 0xb642515c in xdr_gp_rpc_opaque_auth () from /usr/lib/gssproxy/proxymech.so #3 0xb6425270 in xdr_gp_rpc_accepted_reply () from /usr/lib/gssproxy/proxymech.so #4 0xb6425484 in xdr_gp_rpc_reply_header () from /usr/lib/gssproxy/proxymech.so #5 0xb64254d0 in xdr_gp_rpc_msg_union () from /usr/lib/gssproxy/proxymech.so #6 0xb6425520 in xdr_gp_rpc_msg () from /usr/lib/gssproxy/proxymech.so #7 0xb6406fc8 in gssrpc_xdr_free () from /lib/libgssrpc.so.4 #8 0xb642ae20 in gpm_make_call () from /usr/lib/gssproxy/proxymech.so #9 0xb6429e28 in gpm_init_sec_context () from /usr/lib/gssproxy/proxymech.so #10 0xb642d82c in gssi_init_sec_context () from /usr/lib/gssproxy/proxymech.so #11 0xb6af80f0 in gss_init_sec_context () from /lib/libgssapi_krb5.so.2 #12 0xb6fa8d48 in pg_GSS_continue () from /lib/libpq.so.5 #13 0xb6fa9108 in pg_fe_sendauth () from /lib/libpq.so.5 #14 0xb6fad3c4 in PQconnectPoll () from /lib/libpq.so.5 #15 0xb6fadfc8 in connectDBComplete () from /lib/libpq.so.5 #16 0xb6fae72c in PQconnectdbParams () from /lib/libpq.so.5 #17 0x0000c290 in main () (gdb) c Continuing. Program terminated with signal SIGSEGV, Segmentation fault. The program no longer exists. I have reproduced this on a test VM (read: *extremely* slow, but I can install all the debuginfos I want). I am intimately familiar with the postgres code, so I'm walking through that. Some notes: - actually configuring gssproxy seems to be unnecessary; the bug happens before that - I'm running as an IPA user with non-1000 uid - we really should release a new gssproxy so that we can use the symbolic UIDs everywhere - bug does not trigger if GSS_USE_PROXY="no" - a valid postgres server is needed because it doesn't access creds until two send-and-reply cycles have happened - ... but since the handshake doesn't complete anyway, you don't actually need to be a user It would be really useful to have the corresponding gssproxy logs when this happens. Running the client under valgrind may also help, though if it is already slow, valgrind won't help for sure. Adding a memset(&msg, 0, sizeof(gp_rpc_msg)); just before we decode the header may be a good idea, I suspect the issue is some failure in the XDR layer that get then clobbered because the msg structure is dirty. Logs from gssproxy: Debug Enabled Client connected (fd = 11) (pid = 27905) (uid = 0) (gid = 0) (context = system_u:system_r:kernel_t:s0) Client connected (fd = 12) (pid = 27914) (uid = 1523400003) (gid = 1523400003) (context = unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023) gp_rpc_execute: executing 8 (GSSX_INIT_SEC_CONTEXT) for service "psql", euid: 1523400003, socket: (null) From valgrind (this matches what ToBeReplaced was seeing): ==27983== 2 errors in context 1 of 1: ==27983== Invalid free() / delete / delete[] / realloc() ==27983== at 0x4846A08: free (in /usr/lib/valgrind/vgpreload_memcheck-arm-linux.so) ==27983== Address 0x8 is not stack'd, malloc'd or (recently) free'd I don't know what was going on before, but I don't seem to be able to reproduce this with a custom-built gssproxy anymore. However, adding the memset in causes the segfault to occur. gssproxy-0.4.1-3.fc23 has been submitted as an update to Fedora 23. https://bodhi.fedoraproject.org/updates/FEDORA-2015-20d1e0d890 gssproxy-0.4.1-2.fc22 has been submitted as an update to Fedora 22. https://bodhi.fedoraproject.org/updates/FEDORA-2015-91663ccfea Whoops, that should read "causes the segfault to *not* occur". Otherwise this chain of events is kind of confusing... gssproxy-0.4.1-3.fc23 has been pushed to the Fedora 23 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update gssproxy' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-20d1e0d890 gssproxy-0.4.1-2.fc21 has been pushed to the Fedora 21 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update gssproxy' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-25ab0dae49 gssproxy-0.4.1-2.fc22 has been pushed to the Fedora 22 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with $ su -c 'dnf --enablerepo=updates-testing update gssproxy' You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2015-91663ccfea gssproxy-0.4.1-3.fc23 has been pushed to the Fedora 23 stable repository. If problems still persist, please make note of it in this bug report. gssproxy-0.4.1-2.fc22 has been pushed to the Fedora 22 stable repository. If problems still persist, please make note of it in this bug report. gssproxy-0.4.1-2.fc21 has been pushed to the Fedora 21 stable repository. If problems still persist, please make note of it in this bug report. (In reply to ToBeReplaced from comment #9) > Also of note, I was able to install the missing debuginfos if I removed > postgresql-debuginfo (space limited). This yielded: > > Missing separate debuginfos, use: debuginfo-install > postgresql-9.3.9-1.fc21.armv7hl > (gdb) run > Starting program: /usr/bin/psql -h database.example.domain -U > psql/host.example.domain -d test > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/libthread_db.so.1". > > Program received signal SIGILL, Illegal instruction. > _armv7_tick () at armv4cpuid.S:94 > 94 mrrc p15,1,r0,r1,c14 @ CNTVCT > (gdb) bt FWIW, the crash at mrrc instruction is due to performance counters being inaccessible from userland on ARMv7 by default, see for example http://neocontra.blogspot.co.uk/2013/05/user-mode-performance-counters-for.html . |