Bug 1507817
Summary: | [rhel-7.5] krb5 mounting starts to be failed as access denied by server | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yongcheng Yang <yoyang> |
Component: | gssproxy | Assignee: | Robbie Harwood <rharwood> |
Status: | CLOSED ERRATA | QA Contact: | ipa-qe <ipa-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 7.5 | CC: | abokovoy, ajmitchell, amessina, fs-qe, ikent, ipa-qe, jiyin, jstodola, ksiddiqu, lkocman, lmiksik, ovasik, pasik, ppiotrowski, rharwood, rob.verduijn, salmy, steved, swhiteho, xdong, xzhou, yoyang |
Target Milestone: | beta | Keywords: | Regression, TestBlocker |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | gssproxy-0.7.0-17.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-04-10 11:09:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1514241 | ||
Bug Blocks: |
Description
Yongcheng Yang
2017-10-31 08:55:09 UTC
Please try with gssproxy-0.7.0-11 and selinux DONTAUDIT rules disabled. Thanks! Sure, I'll take a stab. gssproxy-0.7.0-7.el7 updates our unit file (gssproxy.service) to match that of the Fedora unit file, which I believe was changed back in May, with steved taking care of the nfs-utils side. Nothing else changes between -6 and -7. In particular, this removes a line `Requires=proc-fs-nfsd.mount` (which gssproxy cannot have since it can't depend on nfs-utils). To counter this, it adds the ability to reload the gssproxy daemon (`systemctl try-reload-or-restart` in fedora; `systemctl reload-or-try-restart` in rhel). These commands cause gssproxy to start (if it's not) and bring up the kernel interface. I believe the current failure yoyang is reporting is that everything works fine after upgrading packages *until* the machine is restarted. This is consistent with a unit dependency problem. Broken in fedora, gssproxy-0.7.0-24.fc27.x86_64. Works when I manually start gssproxy. I'm wonder if this problem is due to https://bugzilla.redhat.com/show_bug.cgi?id=1497267 Does setting GSS_USE_PROXY="no" in /etc/sysconfig/nfs and then restarting rpcgssd help? (In reply to Steve Dickson from comment #18) > I'm wonder if this problem is due to > https://bugzilla.redhat.com/show_bug.cgi?id=1497267 > > Does setting GSS_USE_PROXY="no" in /etc/sysconfig/nfs > and then restarting rpcgssd help? I modified /etc/sysconfig/nfs as described and rebooted. The mount works with gssproxy-0.7.0-24.fc27.x86_64. (In reply to Steve Dickson from comment #18) > I'm wonder if this problem is due to > https://bugzilla.redhat.com/show_bug.cgi?id=1497267 > > Does setting GSS_USE_PROXY="no" in /etc/sysconfig/nfs > and then restarting rpcgssd help? I'm afraid this issue is a bit different. Setting GSS_USE_PROXY="no" cannot resolve my problem: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [root@krb5-nfs-server ~]# grep GSS_USE_PROXY /etc/sysconfig/nfs GSS_USE_PROXY="no" [root@krb5-nfs-server ~]# rpm -q gssproxy gssproxy-0.7.0-7.el7.x86_64 [root@krb5-nfs-server ~]# mount -t nfs4 -o sec=krb5 krb5-nfs-server:/exportDir-krb5-common-local /mnt/ mount.nfs4: access denied by server while mounting krb5-nfs-server:/exportDir-krb5-common-local [root@krb5-nfs-server ~]# klist -a Ticket cache: FILE:/tmp/krb5cc_0 Default principal: root Valid starting Expires Service principal 20/11/17 10:27:22 21/11/17 10:27:22 krbtgt/RHQE.COM renew until 27/11/17 10:27:22 Addresses: (none) [root@krb5-nfs-server ~]# yum downgrade -y gssproxy-0.7.0-6.el7.x86_64.rpm &>/dev/null [root@krb5-nfs-server ~]# rpm -q gssproxy gssproxy-0.7.0-6.el7.x86_64 [root@krb5-nfs-server ~]# reboot Connection to krb5-nfs-server closed by remote host. Connection to krb5-nfs-server closed. ... After downgrading to gssproxy-0.7.0-6.el7, it can work: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [root@krb5-nfs-server ~]# grep GSS_USE_PROXY /etc/sysconfig/nfs GSS_USE_PROXY="yes" [root@krb5-nfs-server ~]# rpm -q gssproxy gssproxy-0.7.0-6.el7.x86_64 [root@krb5-nfs-server ~]# mount -t nfs4 -o sec=krb5 krb5-nfs-server:/exportDir-krb5-common-local /mnt/ [root@krb5-nfs-server ~]# echo $? 0 [root@krb5-nfs-server ~]# klist -a Ticket cache: FILE:/tmp/krb5cc_0 Default principal: host/krb5-nfs-server Valid starting Expires Service principal 31/12/69 19:00:00 31/12/69 19:00:00 Encrypted/Credentials/v1@X-GSSPROXY: Addresses: (none) [root@krb5-nfs-server ~]# cat /proc/mounts | grep mnt krb5-nfs-server:/exportDir-krb5-common-local /mnt nfs4 rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=krb5,clientaddr=10.73.4.155,local_lock=none,addr=10.73.4.155 0 0 [root@krb5-nfs-server ~]# umount /mnt/ (In reply to Yongcheng Yang from comment #20) > (In reply to Steve Dickson from comment #18) > > I'm wonder if this problem is due to > > https://bugzilla.redhat.com/show_bug.cgi?id=1497267 > > > > Does setting GSS_USE_PROXY="no" in /etc/sysconfig/nfs > > and then restarting rpcgssd help? > > I'm afraid this issue is a bit different. > > Setting GSS_USE_PROXY="no" cannot resolve my problem: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > After downgrading to gssproxy-0.7.0-6.el7, it can work: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So how is this an nfs-utils problem when downing gssproxy fixes the problem? Nothing has changed in the nfs-utils but there obviously there has been some change in the gssproxy code. (In reply to Steve Dickson from comment #21) > (In reply to Yongcheng Yang from comment #20) > > After downgrading to gssproxy-0.7.0-6.el7, it can work: > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > So how is this an nfs-utils problem when downing gssproxy fixes > the problem? Nothing has changed in the nfs-utils but there > obviously there has been some change in the gssproxy code. Maybe Robbie has provided some investigation in comment #5, comment #10 and comment #15. Sorry I cannot fully understand it. (In reply to James from comment #19) > (In reply to Steve Dickson from comment #18) > > I'm wonder if this problem is due to > > https://bugzilla.redhat.com/show_bug.cgi?id=1497267 > > > > Does setting GSS_USE_PROXY="no" in /etc/sysconfig/nfs > > and then restarting rpcgssd help? > > I modified /etc/sysconfig/nfs as described and rebooted. The mount works > with gssproxy-0.7.0-24.fc27.x86_64. Seems like I spoke too soon. This only works on one of my machines. Don't know what's different about the others. I'm not seeing this problem... rhel7# mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp rhel7# umount /mnt/tmp Using the following: # rpm -q nfs-utils gssproxy nfs-utils-1.3.0-0.52.el7.x86_64 gssproxy-0.7.0-13.el7.x86_64 But I am seeing this being logged. Nov 29 09:59:41 rhel7 python: SELinux is preventing /usr/sbin/rpc.gssd from connectto access on the unix_stream_socket /var/lib/gssproxy/default.sock.#012#012***** Plugin catchall (100. confidence) suggests **************************#012#012If you believe that rpc.gssd should be allowed connectto access on the default.sock unix_stream_socket by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'rpc.gssd' --raw | audit2allow -M my-rpcgssd#012# semodule -i my-rpcgssd.pp#012 Could this possibly be an SELinux issue? But... now after a complete RHEL7.5 nightly update I am seeing a hang on a -o krb5 mount: rhel7# ps ax | grep mount.nfs 1921 pts/0 D+ 0:00 /sbin/mount.nfs rhel7srv:/home/tmp /mnt/tmp -o rw,sec=krb5 1968 pts/2 S+ 0:00 grep mount.nfs rhel7# cat /proc/1921/stack [<ffffffffc04cfaf4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc] [<ffffffffc04cfacd>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] [<ffffffffc0837cf4>] _nfs4_proc_exchange_id+0x3a4/0x490 [nfsv4] [<ffffffffc083e8b9>] nfs4_proc_exchange_id+0x39/0x60 [nfsv4] [<ffffffffc084fb61>] nfs41_discover_server_trunking+0x21/0xb0 [nfsv4] [<ffffffffc08501ad>] nfs4_discover_server_trunking+0x7d/0x2e0 [nfsv4] [<ffffffffc0858dc6>] nfs4_init_client+0x126/0x300 [nfsv4] [<ffffffffc07ee6d6>] nfs_get_client+0x2c6/0x3e0 [nfs] [<ffffffffc08582a8>] nfs4_set_client+0x98/0x130 [nfsv4] [<ffffffffc0859a9e>] nfs4_create_server+0x13e/0x3b0 [nfsv4] [<ffffffffc0850bee>] nfs4_remote_mount+0x2e/0x60 [nfsv4] [<ffffffffb8213989>] mount_fs+0x39/0x1b0 [<ffffffffb82307f7>] vfs_kern_mount+0x67/0x110 [<ffffffffc0850b16>] nfs_do_root_mount+0x86/0xc0 [nfsv4] [<ffffffffc0850f14>] nfs4_try_mount+0x44/0xc0 [nfsv4] [<ffffffffc07fbaf5>] nfs_fs_mount+0x4c5/0xd90 [nfs] [<ffffffffb8213989>] mount_fs+0x39/0x1b0 [<ffffffffb82307f7>] vfs_kern_mount+0x67/0x110 [<ffffffffb8232d03>] do_mount+0x233/0xaf0 [<ffffffffb8233946>] SyS_mount+0x96/0xf0 [<ffffffffb86ea909>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Now the odd thing is if I start rpc.gssd from the shell I do not see the hang rhel7# systemctl stop rpc-gssd rhel7# rpc.gssd -vv rhel7# mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp rhel7# umount /mnt/tmp Any idea to what this means? Here is the current rpc-gssd.service file: [Unit] Description=RPC security service for NFS client and server DefaultDependencies=no Conflicts=umount.target Requires=rpc_pipefs.target After=rpc_pipefs.target gssproxy.service ConditionPathExists=/etc/krb5.keytab PartOf=nfs-utils.service Wants=nfs-config.service After=nfs-config.service [Service] EnvironmentFile=-/run/sysconfig/nfs-utils Type=forking ExecStart=/usr/sbin/rpc.gssd $GSSDARGS Is there something missing that is needed by systemd? Question... has the credit cache moved?? That could explain the failure that is being seen.... I was finally able to reproduce access denied problem... Not sure why all the systemd noise went away although I did disabled Selinux which might have done the trick... It turns out stopping gssproxy mades the problem go away. rhel7# systemctl start gssproxy rhel7# mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp mount.nfs: access denied by server while mounting rhel7srv:/home/tmp rhel7# systemctl stop gssproxy rhel7# mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp rhel7# umount /mnt/tmp Yongcheng, can you please verify this is the case in your testing If verified, the workaround to move the alpha forward is to disable gssproxy (In reply to Steve Dickson from comment #30) > It turns out stopping gssproxy mades the problem go away. Hi Steve, our testcase is to setup krb server and client in the same machine. So stopping gssproxy may lead the whole kerberos function stop working. So IMO this workaround cannot solve our testcase's failure (and I have checked it). However, if client is not located with server, it may help. I'll check that. (In reply to Yongcheng Yang from comment #31) > (In reply to Steve Dickson from comment #30) > > > It turns out stopping gssproxy mades the problem go away. > > Hi Steve, our testcase is to setup krb server and client in the same > machine. So > stopping gssproxy may lead the whole kerberos function stop working. So IMO > this > workaround cannot solve our testcase's failure (and I have checked it). > > However, if client is not located with server, it may help. I'll check that. My set up is between two different vms... One being the client and the other being the server... I think that is more of a realistic set than loopback... Very few customers (if any) use loopback mounts in production, esp secure mounts. Do we know this is not the the problem https://bugzilla.redhat.com/show_bug.cgi?id=1514241 (In reply to Steve Dickson from comment #26) > But... now after a complete RHEL7.5 nightly update I am seeing a hang on > a -o krb5 mount: > > rhel7# ps ax | grep mount.nfs > 1921 pts/0 D+ 0:00 /sbin/mount.nfs rhel7srv:/home/tmp /mnt/tmp -o > rw,sec=krb5 > 1968 pts/2 S+ 0:00 grep mount.nfs > rhel7# cat /proc/1921/stack > [<ffffffffc04cfaf4>] rpc_wait_bit_killable+0x24/0xb0 [sunrpc] > [<ffffffffc04cfacd>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc] > [<ffffffffc0837cf4>] _nfs4_proc_exchange_id+0x3a4/0x490 [nfsv4] > [<ffffffffc083e8b9>] nfs4_proc_exchange_id+0x39/0x60 [nfsv4] > [<ffffffffc084fb61>] nfs41_discover_server_trunking+0x21/0xb0 [nfsv4] > [<ffffffffc08501ad>] nfs4_discover_server_trunking+0x7d/0x2e0 [nfsv4] > [<ffffffffc0858dc6>] nfs4_init_client+0x126/0x300 [nfsv4] > [<ffffffffc07ee6d6>] nfs_get_client+0x2c6/0x3e0 [nfs] > [<ffffffffc08582a8>] nfs4_set_client+0x98/0x130 [nfsv4] > [<ffffffffc0859a9e>] nfs4_create_server+0x13e/0x3b0 [nfsv4] > [<ffffffffc0850bee>] nfs4_remote_mount+0x2e/0x60 [nfsv4] > [<ffffffffb8213989>] mount_fs+0x39/0x1b0 > [<ffffffffb82307f7>] vfs_kern_mount+0x67/0x110 > [<ffffffffc0850b16>] nfs_do_root_mount+0x86/0xc0 [nfsv4] > [<ffffffffc0850f14>] nfs4_try_mount+0x44/0xc0 [nfsv4] > [<ffffffffc07fbaf5>] nfs_fs_mount+0x4c5/0xd90 [nfs] > [<ffffffffb8213989>] mount_fs+0x39/0x1b0 > [<ffffffffb82307f7>] vfs_kern_mount+0x67/0x110 > [<ffffffffb8232d03>] do_mount+0x233/0xaf0 > [<ffffffffb8233946>] SyS_mount+0x96/0xf0 > [<ffffffffb86ea909>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > > Now the odd thing is if I start rpc.gssd from the shell I do not see the hang > rhel7# systemctl stop rpc-gssd > rhel7# rpc.gssd -vv > rhel7# mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp > rhel7# umount /mnt/tmp > > Any idea to what this means? This hang comes from bug 1519511 I'm beginning think this might be a configuration issue. QE is used krb5kdc as their KDC and I'm using an IPA server as my KDC is one difference and mounts are working in my world and not in theirs... that is the other difference :-) A third difference is QE is running the their KDC on their NFS server, which may not be a big deal other than making it harder to debug because there a not a network between them, its just another difference. Now, I'm not saying we should switch over to an IPA server, we definitely should get the krb5kdc environment working but I don't think we should hold the alpha for it. Here is why I think its a configuration issue... in the IPA world, when the client does the mount I see the following network traffic to the KDC. No. Time Source Destination Protocol Length Info 2857 8.926648824 172.31.1.58 172.31.1.3 KRB5 282 AS-REQ 2859 8.932487365 172.31.1.3 172.31.1.58 KRB5 404 KRB Error: KRB5KDC_ERR_PREAUTH_REQUIRED 2886 8.983945212 172.31.1.58 172.31.1.3 KRB5 1125 TGS-REQ 2888 8.992263978 172.31.1.3 172.31.1.58 KRB5 1045 TGS-REP I don't see any this type of traffic between their client and KDC. Note, we should also be testing against a Windows KDC, since that is the majority of the KDC out there... The exchange AS-REQ / PRE-AUTH response /TGS-REQ / TGS-REP is a correct one in IPA (and admittedly in any other contemporary Kerberos setup). If your KDC is not requiring pre-auth on the principals, including krbtgt/REALM@REALM, you wouldn't see that first response but overall running without pre-auth is insecure. See, for example, https://social.technet.microsoft.com/wiki/contents/articles/23559.kerberos-pre-authentication-why-it-should-not-be-disabled.aspx. Can we get more details of how krb5kdc is set up in QE? I tried to find build_krb5_server.sh and failed. I have built a new version of gssproxy - gssproxy-0.7.0-14.el7. If people could re-test with that, that would be very helpful. I'll try to reproduce otherwise. Also please note that any time you upgrade gssproxy for testing, you need to restart rpc-gssd, else it will still be using the old gssproxy library. Alexander, all, preauth is not required for rando keys by default by the MIT KDC as it is not necessary. We did enable preauth on all keys in IPA just in case but it is definitely not a requirement. Preauth is required for keys that are generated from a password, because dictionary based brute force attacks are possible with human chosen passwords. That is generally not the case with service keys as they are fully random so a brute force is equivalente to an exhaustive search, which is unfeasible in case of AES encryption. TL;DR: ignore the prauth thing it's a red herring and completely unrelated to the bug at hand. HTH. (In reply to Robbie Harwood from comment #15) > Sure, I'll take a stab. > > gssproxy-0.7.0-7.el7 updates our unit file (gssproxy.service) to match that > of the Fedora unit file, which I believe was changed back in May, with > steved taking care of the nfs-utils side. Nothing else changes between -6 > and -7. > > In particular, this removes a line `Requires=proc-fs-nfsd.mount` (which > gssproxy cannot have since it can't depend on nfs-utils). To counter this, Why remove it? You don't need to remove a "Requires" that's too strong for your needs, changing it to "Wants" should be enough. Yongcheng, what happens if you do this? Change the "Requires" to "Wants" instead of removing the line in the gssproxy.service file? (In reply to Ian Kent from comment #51) > Yongcheng, what happens if you do this? > Change the "Requires" to "Wants" instead of removing the line > in the gssproxy.service file? Hello Ian, just a quick update here: it works in my environment! After gssproxy removing "Requires=proc-fs-nfsd.mount", we do find that module "nfsd" not loaded in newly booted machine (/proc/fs/nfsd not mounted either) until NFS started after awhile. However I didn't know it had something relative with krb5 configuration before. After adding the line "Wants=proc-fs-nfsd.mount", "nfsd" loads automatically, even though gssproxy.service is disabled. And then my test can pass. ~~~~~~~~~~~~~~ [root@~]# mount | grep nfsd nfsd on /proc/fs/nfsd type nfsd (rw,relatime) [root@~]# lsmod | grep nfsd nfsd 342943 1 auth_rpcgss 59415 1 nfsd nfs_acl 12837 1 nfsd lockd 93827 1 nfsd grace 13515 2 nfsd,lockd sunrpc 349214 7 nfsd,auth_rpcgss,lockd,nfs_acl [root@~]# systemctl is-enabled gssproxy disabled ... Maybe this is the reason why this issue cannot be reproduced in some environment. And I also find it always fail in newly machine but can pass sometimes (In reply to Ian Kent from comment #54) > (In reply to Yongcheng Yang from comment #52) > > (In reply to Ian Kent from comment #51) > > > > > Yongcheng, what happens if you do this? > > > Change the "Requires" to "Wants" instead of removing the line > > > in the gssproxy.service file? > > > > Hello Ian, just a quick update here: it works in my environment! > > > > After gssproxy removing "Requires=proc-fs-nfsd.mount", we do find that > > module "nfsd" not loaded in newly booted machine (/proc/fs/nfsd not > > mounted either) until NFS started after awhile. However I didn't know > > it had something relative with krb5 configuration before. > > I'm only going by what I've read in the comments. > > Namely, works with the "Requires" line present and gssproxy cannot > depend on nfs-utils (the reason the Requires was removed). > > So "Wants" appears to be appropriate in that it won't fail the > service gssproxy if proc-fs-nfsd.mount fails or is not present. > > The presence of the "Wants" will affect dependency order and > that might be why using Wants does what's needed. > > Perhaps there is some other sensible reason gssproxy cannot > use this? > > Ian This change came at request in https://bugzilla.redhat.com/show_bug.cgi?id=1326440 and several other bugs. But basically: gssproxy is an application that is *consumed by* NFS, not the other way around. NFS should be encoding its dependency on gssproxy; it is conceptually incorrect for gssproxy to cart around a list of applications that depend on it. Again, can QE please try gssproxy-0.7.0-14 and tell me if the issue is resolved? (In reply to Robbie Harwood from comment #56) > (In reply to Ian Kent from comment #54) > > (In reply to Yongcheng Yang from comment #52) > > > (In reply to Ian Kent from comment #51) > > > > > > > Yongcheng, what happens if you do this? > > > > Change the "Requires" to "Wants" instead of removing the line > > > > in the gssproxy.service file? > > > > > > Hello Ian, just a quick update here: it works in my environment! > > > > > > After gssproxy removing "Requires=proc-fs-nfsd.mount", we do find that > > > module "nfsd" not loaded in newly booted machine (/proc/fs/nfsd not > > > mounted either) until NFS started after awhile. However I didn't know > > > it had something relative with krb5 configuration before. > > > > I'm only going by what I've read in the comments. > > > > Namely, works with the "Requires" line present and gssproxy cannot > > depend on nfs-utils (the reason the Requires was removed). > > > > So "Wants" appears to be appropriate in that it won't fail the > > service gssproxy if proc-fs-nfsd.mount fails or is not present. > > > > The presence of the "Wants" will affect dependency order and > > that might be why using Wants does what's needed. > > > > Perhaps there is some other sensible reason gssproxy cannot > > use this? > > > > Ian > > This change came at request in > https://bugzilla.redhat.com/show_bug.cgi?id=1326440 and several other bugs. Indeed and the problem statement contains a contradiction. "But in no sense does gssproxy depend on NFS" and just before that an admission that gssproxy does depend on NFS "Yes, if NFS is in use, proc-fs-nfsd.mount should be started before gssproxy.service". That second statement is the problem description and essentially says that it is a gssproxy problem. And there is no discussion about what this person actually needs and why the bug has been logged. > > But basically: gssproxy is an application that is *consumed by* NFS, not the > other way around. NFS should be encoding its dependency on gssproxy; it is > conceptually incorrect for gssproxy to cart around a list of applications > that depend on it. NFS shouldn't need to concern itself with whether gssproxy is too clueless to provide services to NFS if the proc file system entry isn't available when it's started. Documenting this fact without adding a hard dependency seems like the right thing to do if gssproxy can't be updated to handle this case. Is it the hard dependency that bug 1326440 is about or something else? What really concerns me about what I see here is it may be this type of approach that has caused some of the more questionable practices within the NFS service units. Such as calling systemctl to start a service from within another service unit. Sure, you can do that and most of the time it isn't a problem, until it is! In fact I have an essentially unresolvable problem with autofs because of exactly this. It is essential that the dependency declaration features of systemd be used in a sensible way to document these types of dependencies rather than try and do "smart" things in service units that could and do have side effects. Ian (In reply to Robbie Harwood from comment #56) > Again, can QE please try gssproxy-0.7.0-14 and tell me if the issue is > resolved? Seems gssproxy-0.7.0-14.el7 still can't resolve this problem. Please find comment #53 for the test logs. (In reply to Yongcheng Yang from comment #58) > (In reply to Robbie Harwood from comment #56) > > > Again, can QE please try gssproxy-0.7.0-14 and tell me if the issue is > > resolved? > > Seems gssproxy-0.7.0-14.el7 still can't resolve this problem. > > Please find comment #53 for the test logs. Where are the test logs? I see logs showing that tests were run and failed, but no output from them. (What I'd really like is a reproducer that doesn't require that I stand up beaker.) (In reply to Robbie Harwood from comment #59) > (In reply to Yongcheng Yang from comment #58) > > (In reply to Robbie Harwood from comment #56) > > > > > Again, can QE please try gssproxy-0.7.0-14 and tell me if the issue is > > > resolved? > > > > Seems gssproxy-0.7.0-14.el7 still can't resolve this problem. > > > > Please find comment #53 for the test logs. > > Where are the test logs? I see logs showing that tests were run and failed, > but no output from them. (What I'd really like is a reproducer that doesn't > require that I stand up beaker.) I don't know what that last sentence means? It would be good if QE could get those logs, hopefully they will. The other thing you can do is have a look at the test that was run. I had a quick look at the nfs-utils test here and it isn't too hard to follow but ... QE use a couple of libraries of scripts which are hard to locate. You would need to ask QE questions if it wasn't clear what these scripts or functions did or perhaps have QE direct you to where they are located so can see for your self. Other than that most of the scripted test infrastructure is fairly straight forward to follow so that problems can be manually reproduced based on what the test does either on Beaker reserved hosts or on local machines. Assuming you have a REDHAT.COM tgt ticket the nfs-utils tests repository can be checked out using: rhpkg clone tests/nfs-utils and the test is at: cd nfs-utils/krb5-test/krb5-common-local look at it with: vi runtest.sh Ian (In reply to Robbie Harwood from comment #59) > Where are the test logs? I see logs showing that tests were run and failed, > but no output from them. (What I'd really like is a reproducer that doesn't > require that I stand up beaker.) Hi Robbie, please find "TESTOUT.log" in the beaker page for detailed logs. What's more, following are reproduce steps I used: # > /etc/exports # > /etc/sysconfig/nfs # build_krb5_server.sh (comment #37) # config_krb5_client.sh $HOSTNAME (comment #37) # systemctl restart gssproxy.service # systemctl restart rpc-gssd.service # systemctl restart nfs.service # mkdir /export_test # exportfs -v -i $HOSTNAME:/export_test -o rw,sec=sys:krb5:krb5i:krb5p # mount -t nfs4 -o sec=krb5 $HOSTNAME:/export_test /mnt mount.nfs4: access denied by server while mounting ... I can always hit this problem within a newly installed 7.5 machine. Sometimes if nfs.service is started before "build_krb5_server.sh" (maybe some other issue), problem solved, and then I cannot reproduce it any more on that specific machine again. Sorry for I still cannot figure out a simple reproducer. Anyway, it can be triggered consistently in machine newly installed. I too can now reliably reproduce this in a manner similar to comment #61, the newer gssproxy -14 makes no difference. One observation is that no amount of restarting services seems to fix this, only a reboot, and I have checked that the exact same set of services is running before and after that reboot. The more I look at this the more I am reminded of the conclusions in bug 1452881 in that although gssproxy does not require nfsd be running, if it is not then it puts itself into a state which is not updated if nfsd is later started. The change between release 6 and 7 removed the requirement that nfsd be started before gssproxy allowing this to happen. I suppose what is required is a way for nfs-server.service to be able to request that if and only if gssproxy is already running, then restart it. nfs-server.service does try to reload gssproxy by sending it a HUP, but this is not enough, it does not reload everything and doesn't change the /proc/net/rpc/use-gss-proxy file amongst other things. I would propose the following work around to nfs-server.service, whilst not being ideal (as it calls systemctl) i see no existing mechanism that would do the same without adding back the dependencies we were trying to remove. This will make ensure that if gssproxy was already running when nfs is started, as appears to be happening in this bug, that it will be restarted, but does not create a dependency in either direction. --- nfs-server.service 2017-12-11 16:29:18.010517239 +0000 +++ nfs-server.service.orig 2017-12-11 16:27:00.988323934 +0000 @@ -25,8 +25,8 @@ Type=oneshot RemainAfterExit=yes ExecStartPre=-/usr/sbin/exportfs -r +ExecStartPre=-/bin/sh -c '/bin/kill -HUP `cat /run/gssproxy.pid`' ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS -ExecStartPost=-/bin/sh -c 'if systemctl -q is-active gssproxy; then systemctl r estart gssproxy; fi' ExecStop=/usr/sbin/rpc.nfsd 0 ExecStopPost=/usr/sbin/exportfs -au ExecStopPost=/usr/sbin/exportfs -f (In reply to Justin Mitchell from comment #65) I just did a simple test by hand against the workaround comment #65 and it dose resolve the problem. Thanks Justin for the investigating! Can we get this done today in that case? We need the patch in today if it is to make beta. SteveD should be able to assist. (In reply to Justin Mitchell from comment #65) > nfs-server.service does try to reload gssproxy by sending it a HUP, but this > is not enough, it does not reload everything and doesn't change the > /proc/net/rpc/use-gss-proxy file amongst other things. > > I would propose the following work around to nfs-server.service, whilst not > being ideal (as it calls systemctl) i see no existing mechanism that would > do the same without adding back the dependencies we were trying to remove. > > This will make ensure that if gssproxy was already running when nfs is > started, as appears to be happening in this bug, that it will be restarted, > but does not create a dependency in either direction. > > --- nfs-server.service 2017-12-11 16:29:18.010517239 +0000 > +++ nfs-server.service.orig 2017-12-11 16:27:00.988323934 +0000 > @@ -25,8 +25,8 @@ > Type=oneshot > RemainAfterExit=yes > ExecStartPre=-/usr/sbin/exportfs -r > +ExecStartPre=-/bin/sh -c '/bin/kill -HUP `cat /run/gssproxy.pid`' > ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS > -ExecStartPost=-/bin/sh -c 'if systemctl -q is-active gssproxy; then > systemctl r > estart gssproxy; fi' > ExecStop=/usr/sbin/rpc.nfsd 0 > ExecStopPost=/usr/sbin/exportfs -au > ExecStopPost=/usr/sbin/exportfs -f Where is this version of nfs-server come from? Looking at the latest version: [Service] EnvironmentFile=-/run/sysconfig/nfs-utils Type=oneshot RemainAfterExit=yes ExecStartPre=-/usr/sbin/exportfs -r ExecStartPre=-/bin/sh -c '/bin/kill -HUP `cat /run/gssproxy.pid`' ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS ExecStop=/usr/sbin/rpc.nfsd 0 ExecStopPost=/usr/sbin/exportfs -au ExecStopPost=/usr/sbin/exportfs -f That line exists... actually I was added by the nfs-utils-1.3.0-systemd-gssproxy.patch in 7.4... (In reply to Justin Mitchell from comment #65) > nfs-server.service does try to reload gssproxy by sending it a HUP, but this > is not enough, it does not reload everything and doesn't change the > /proc/net/rpc/use-gss-proxy file amongst other things. > > I would propose the following work around to nfs-server.service, whilst not > being ideal (as it calls systemctl) i see no existing mechanism that would > do the same without adding back the dependencies we were trying to remove. > > This will make ensure that if gssproxy was already running when nfs is > started, as appears to be happening in this bug, that it will be restarted, > but does not create a dependency in either direction. > > --- nfs-server.service 2017-12-11 16:29:18.010517239 +0000 > +++ nfs-server.service.orig 2017-12-11 16:27:00.988323934 +0000 > @@ -25,8 +25,8 @@ > Type=oneshot > RemainAfterExit=yes > ExecStartPre=-/usr/sbin/exportfs -r > +ExecStartPre=-/bin/sh -c '/bin/kill -HUP `cat /run/gssproxy.pid`' > ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS > -ExecStartPost=-/bin/sh -c 'if systemctl -q is-active gssproxy; then > systemctl r > estart gssproxy; fi' > ExecStop=/usr/sbin/rpc.nfsd 0 > ExecStopPost=/usr/sbin/exportfs -au > ExecStopPost=/usr/sbin/exportfs -f So I guess the patch is telling me to added the systemctl and remove the -HUP... The -HUP did work in 7.4... what changed in gssproxy to break it??? Here is the proposed patch --- nfs-server.service.orig 2017-11-03 10:14:43.000000000 -0400 +++ nfs-server.service 2017-12-12 10:50:55.083472070 -0500 @@ -25,8 +25,8 @@ EnvironmentFile=-/run/sysconfig/nfs-util Type=oneshot RemainAfterExit=yes ExecStartPre=-/usr/sbin/exportfs -r -ExecStartPre=-/bin/sh -c '/bin/kill -HUP `cat /run/gssproxy.pid`' ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS +ExecStartPost=-/bin/sh -c 'if systemctl -q is-active gssproxy; then systemctl restart gssproxy ; fi' ExecStop=/usr/sbin/rpc.nfsd 0 ExecStopPost=/usr/sbin/exportfs -au ExecStopPost=/usr/sbin/exportfs -f The change is based on nfs-utils-1.3.0-0.52.el7 that was in the 7.5-alpha that i tested on. gssproxy-0.7.0-6.el7 (comment #15) used to depend on nfsd, so that gssproxy always started up in a specific mode. That dependency causes issues listed elsewhere so was removed. So now gssproxy can start up in a state where nfsd was not already running, maybe not even enabled yet, which is fine in itself. If you later configure and run nfsd, as the qe tests do, the service file sends a HUP to gssproxy to reload its config. However, that reload is not a complete one, and krb5 remains in a broken state until the next reboot when the normal service ordering has nfsd start before gssproxy. short version: systemd boot ordering works, but manual starting fails if nfsd comes after gssproxy because -HUP does not sufficiently reset gssproxy's state With the patch in Comment 70 and restarting the nfs server I'm seeing the following gssproxy error Stopping GSSAPI Proxy Daemon... [2017/12/12 16:09:02]: Exiting after receiving a signal Starting GSSAPI Proxy Daemon... [2017/12/12 16:09:02]: Debug Enabled (level: 9) [2017/12/12 16:09:02]: Problem with kernel communication! NFS server will not work [2017/12/12 16:09:02]: Failed to get peer's SELinux context (92:Protocol not available) [2017/12/12 16:09:02]: Client connected (fd = 11)[2017/12/12 16:09:02]: (pid = 2045) (uid = 0) (gid = 0)[2017/12/12 16:09:02]: Started GSSAPI Proxy Daemon. Per discussion on IRC, I have a gssproxy version that is slow but workable. No changes from nfs-utils should be required at this time. If I have missed any other questions in the above, please re-ask them. BZ has taken my flags away during component reassign, so I need those back please. Anyway, there's probably a bug related to timeout behavior that will be incoming; it should get a separate ticket and be triaged separately. Slow mounts shouldn't block beta. (In reply to Robbie Harwood from comment #73) > Per discussion on IRC, I have a gssproxy version that is slow but workable. > > No changes from nfs-utils should be required at this time. If I have missed > any other questions in the above, please re-ask them. I'm thinking if we make the following change that slowness will go away diff -up ./src/client/gpm_common.c.orig ./src/client/gpm_common.c --- ./src/client/gpm_common.c.orig 2017-12-12 16:23:25.856802016 -0500 +++ ./src/client/gpm_common.c 2017-12-12 17:39:23.740147375 -0500 @@ -48,7 +48,7 @@ static void gpm_init_once(void) pthread_mutex_init(&gpm_global_ctx.lock, &attr); gpm_global_ctx.fd = -1; - gpm_global_ctx.epollfd = -1; + gpm_global_ctx.epollfd = 10000; gpm_global_ctx.timerfd = -1; seedp = time(NULL) + getpid() + pthread_self(); (In reply to Robbie Harwood from comment #73) > Per discussion on IRC, I have a gssproxy version that is slow but workable. > > No changes from nfs-utils should be required at this time. If I have missed > any other questions in the above, please re-ask them. Just went through the IRC logs, seems gssproxy-0.7.0-16.el7 is to fix the hang issue ie. bug 1519511 (mentioned in comment #26, comment #34 and comment #46) but not this one. And the reproducer of comment #61 still gets failed as Justin says. For this "gssproy started before nfsd" issue, thanks Justin for the debugging, just found that when get into the mis-configured situation, file "/proc/net/rpc/use-gss-proxy" doesn't contain "1", which NOT meet https://www.kernel.org/doc/Documentation/filesystems/nfs/rpc-server-gss.txt However, i don't know it's whose duty to update that file :-) (In reply to Yongcheng Yang from comment #79) > For this "gssproy started before nfsd" issue, thanks Justin for the > debugging, just found that when get into the mis-configured situation, > file "/proc/net/rpc/use-gss-proxy" doesn't contain "1", which NOT meet > https://www.kernel.org/doc/Documentation/filesystems/nfs/rpc-server-gss.txt > However, i don't know it's whose duty to update that file :-) I do not know if it is the only program that touches it, but gssproxy has code to set the use-gss-proxy value, but does so only once at startup, it will not change it later from a HUP or other input. i.e. init_proc_nfsd() in gp_init.c is called only once from main() I am not sure yet if this is the cause or just a symptom, but i do know that restarting gssproxy after you first start nfsd does solve the stated problem, hence why i called it a workaround rather than a fix. We just never used to see this because previous versions always started nfs before starting gssproxy, likewise you wont see it if both services are enabled and you reboot, as the ordering dependencies will start nfs first. (In reply to Steve Dickson from comment #78) > (In reply to Robbie Harwood from comment #73) > > Per discussion on IRC, I have a gssproxy version that is slow but workable. > > > > No changes from nfs-utils should be required at this time. If I have missed > > any other questions in the above, please re-ask them. > > I'm thinking if we make the following change that slowness will go away > diff -up ./src/client/gpm_common.c.orig ./src/client/gpm_common.c > --- ./src/client/gpm_common.c.orig 2017-12-12 16:23:25.856802016 -0500 > +++ ./src/client/gpm_common.c 2017-12-12 17:39:23.740147375 -0500 > @@ -48,7 +48,7 @@ static void gpm_init_once(void) > pthread_mutex_init(&gpm_global_ctx.lock, &attr); > > gpm_global_ctx.fd = -1; > - gpm_global_ctx.epollfd = -1; > + gpm_global_ctx.epollfd = 10000; > gpm_global_ctx.timerfd = -1; > > seedp = time(NULL) + getpid() + pthread_self(); Without the above fix there are the mount speeds: rhel7# time mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp real 0m45.322s user 0m0.003s sys 0m0.011s rhel7# umount /mnt/tmp rhel7# time mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp real 1m0.122s user 0m0.001s sys 0m0.005s rhel7# umount /mnt/tmp With the above fix here are the mount speeds: hel7# time mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp real 0m0.205s user 0m0.002s sys 0m0.013s rhel7# umount /mnt/tmp rhel7# time mount -o sec=krb5 rhel7srv:/home/tmp /mnt/tmp real 0m0.106s user 0m0.001s sys 0m0.014s Steve, epollfd is a *file descriptor*, you cannot assign random values to it. Maybe you want to change the timeouts to be shorter. (In reply to Simo Sorce from comment #82) > Steve, > epollfd is a *file descriptor*, you cannot assign random values to it. Maybe > you want to change the timeouts to be shorter. Maybe I got it wrong... I thought that what was suggested I do... My original ideas was this which worked quite well too: diff -up ./src/client/gpm_common.c.orig ./src/client/gpm_common.c --- ./src/client/gpm_common.c.orig 2017-12-12 16:23:25.856802016 -0500 +++ ./src/client/gpm_common.c 2017-12-13 08:37:37.085290157 -0500 @@ -269,7 +269,7 @@ static int gpm_epoll_wait(struct gpm_ctx } do { - epoll_ret = epoll_wait(gpmctx->epollfd, events, 2, -1); + epoll_ret = epoll_wait(gpmctx->epollfd, events, 2, 10000); } while (epoll_ret < 0 && errno == EINTR); if (epoll_ret < 0) { Now that QE has acked... Lets see what they see... If they don't see this 1m delay then fine... its something in my world... But if they see the 1m delay for each mount we make one of the above changes.... Not that most people put beta's in production, but 1m delay on every single NFS mount will causes pain esp with hundreds mounts happening at once. We should try to avoid this pain... IMHO... This gssproxy fix should have been made for this bz https://bugzilla.redhat.com/show_bug.cgi?id=1519511 So Comment 77 through Comment 84 should been made in bug 1519511 since they fix the NFS client mounts hanging the gssproxy code. The patch proposed in Comment 74 of this bz mostly fixes the access denied server issue QE was originally seeing Please, please, my kingdom for a separate bug for the slowness. Also on irc you told me you were mucking with selinux state as well, not just doing things to the epoll_wait call. Per IRC discussion: I would rather fix this in gssproxy if you're seeing the HUP not be sufficient. However, I would again like to ask that NFS consider using `systemctl reload` instead of digging up the gssproxy pidfile and shelling out to kill. Therefore, I am taking this bug, and requesting that you not do what was proposed in Comment #70. For reference: my understanding of the bug is that if NFS is started after gssproxy, then we see access denied messages. But that if we start gssproxy after NFS is up, everything works. And there is additional evidence to suggest that gssproxy is failing to poke the proc file after restart. (In reply to Robbie Harwood from comment #90) > Per IRC discussion: I would rather fix this in gssproxy if you're seeing the > HUP not be sufficient. > > However, I would again like to ask that NFS consider using `systemctl > reload` instead of digging up the gssproxy pidfile and shelling out to kill. Ok.... Please bz on this.... I'll that happen later on down the road.. > > Therefore, I am taking this bug, and requesting that you not do what was > proposed in Comment #70. Its all yours... I can't do any commits now... No bz :-) > > For reference: my understanding of the bug is that if NFS is started after > gssproxy, then we see access denied messages. But that if we start gssproxy > after NFS is up, everything works. And there is additional evidence to > suggest that gssproxy is failing to poke the proc file after restart. I think you nailed it I would further refine this that instead of having nfs-utils restart gssproxy, if you modify gssproxy to poke proc after a HUP then the existing nfs-utils script is enough, and this also appears to fix the issue. This works in my tests, applied to gssproxy-0.7.0-16 diff -ur gssproxy-0.7.0.orig/src/gssproxy.c gssproxy-0.7.0.wip/src/gssproxy.c --- gssproxy-0.7.0.orig/src/gssproxy.c 2017-12-13 15:37:45.496066981 +0000 +++ gssproxy-0.7.0.wip/src/gssproxy.c 2017-12-13 15:40:56.557956682 +0000 @@ -143,6 +143,7 @@ free_config(&old_config); GPDEBUG("New config loaded successfully.\n"); + init_proc_nfsd(gpctx->config); return; } Need a qe_ack before I can do a proper build. In the meantime, here's a scratch-build that should fix the issue: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14751067 (The logic should match pretty closely to Justin's patch, but do please complain if it doesn't work.) (In reply to Robbie Harwood from comment #94) > In the meantime, here's a scratch-build that should fix the issue: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14751067 > > (The logic should match pretty closely to Justin's patch, but do please > complain if it doesn't work.) Thanks for it! Have verified that above version do fix the problem. P.s. as there is Beaker Incident, I just test it by hand for now. What's more, just a simple test for the proc file: ~~~~~~~~~~~~~~~~~~~~~~~~~ [root@~]# rpm -q gssproxy gssproxy-0.7.0-13.el7.x86_64 [root@~]# cat /proc/net/rpc/use-gss-proxy cat: /proc/net/rpc/use-gss-proxy: No such file or directory [root@~]# lsmod | grep nfsd [root@~]# systemctl restart gssproxy [root@~]# systemctl restart nfs [root@~]# cat /proc/net/rpc/use-gss-proxy -1 ^^^ [root@~]# [root@~]# rpm -Uvh gssproxy-0.7.0-17.el7.x86_64.rpm >/dev/null <<< updating [root@~]# sh krb5-common-local/rm_nfsd.sh <<< remove the module nfsd [root@~]# rpm -q gssproxy gssproxy-0.7.0-17.el7.x86_64 [root@~]# cat /proc/net/rpc/use-gss-proxy cat: /proc/net/rpc/use-gss-proxy: No such file or directory [root@~]# lsmod | grep nfsd [root@~]# systemctl restart gssproxy [root@~]# systemctl restart nfs [root@~]# cat /proc/net/rpc/use-gss-proxy 1 ^^^ Checked it on my reproducer, works for me as well, thanks. (In reply to Yongcheng Yang from comment #96) > (In reply to Robbie Harwood from comment #94) > > > In the meantime, here's a scratch-build that should fix the issue: > > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=14751067 > > > > (The logic should match pretty closely to Justin's patch, but do please > > complain if it doesn't work.) > > Thanks for it! Have verified that above version do fix the problem. > P.s. as there is Beaker Incident, I just test it by hand for now. > > What's more, just a simple test for the proc file: > ~~~~~~~~~~~~~~~~~~~~~~~~~ > [root@~]# rpm -q gssproxy > gssproxy-0.7.0-13.el7.x86_64 > [root@~]# cat /proc/net/rpc/use-gss-proxy > cat: /proc/net/rpc/use-gss-proxy: No such file or directory > [root@~]# lsmod | grep nfsd > [root@~]# systemctl restart gssproxy > [root@~]# systemctl restart nfs > [root@~]# cat /proc/net/rpc/use-gss-proxy > -1 > ^^^ > [root@~]# > [root@~]# rpm -Uvh gssproxy-0.7.0-17.el7.x86_64.rpm >/dev/null <<< updating > [root@~]# sh krb5-common-local/rm_nfsd.sh <<< remove the module nfsd > [root@~]# rpm -q gssproxy > gssproxy-0.7.0-17.el7.x86_64 > [root@~]# cat /proc/net/rpc/use-gss-proxy > cat: /proc/net/rpc/use-gss-proxy: No such file or directory > [root@~]# lsmod | grep nfsd > [root@~]# systemctl restart gssproxy > [root@~]# systemctl restart nfs > [root@~]# cat /proc/net/rpc/use-gss-proxy > 1 > ^^^ Yongcheng Yang, Can we mark this verified on the basis of above outputs? (In reply to Kaleem from comment #98) > Yongcheng Yang, > > Can we mark this verified on the basis of above outputs? Sure! Have verified that gssproxy-0.7.0-17.el7 has solved this problem. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0709 |