Bug 1153007
Summary: | /var partition at 100% caused by rpc.statd errors: "my_svc_run() - select: Bad file descriptor" filling up /var/log/messages | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Eyal Edri <eedri> | ||||
Component: | nfs-utils | Assignee: | Steve Dickson <steved> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Filesystem QE <fs-qe> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 6.7 | CC: | amureini, bfields, bkorren, bmcclain, danken, eedri, extras-qa, fdeutsch, greearb, harshula, howey.vernon, johnh, lbednar, mojbordel, pdangur, sbonazzo, sherold, slapshotct, steved, swhiteho | ||||
Target Milestone: | rc | Keywords: | Regression | ||||
Target Release: | 6.7 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 995861 | Environment: | |||||
Last Closed: | 2015-01-15 14:24:37 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 995861 | ||||||
Bug Blocks: | 1002699, 1147536, 1164308, 1164311, 1164899, 1173188 | ||||||
Attachments: |
|
Description
Eyal Edri
2014-10-15 11:24:13 UTC
Any progress on this bug? (In reply to Fabian Deutsch from comment #2) > Any progress on this bug? Not really since it does not happen very often so its hard to debug... I don't think we just want to comment the error out. Instead we either want to exit or try to recover.. Let me take this upstream to see what they have to say... Thanks Steve. Eyal, do you know a reliable way to trigger this problem? this happens both to the ci & qe rhev team on their testing environment constantly when using latest nfs-utils on latest rhel 6.6. we currently enforcing nfs-utils version nfs-utils-1.2.3-53.el6.x86_64 via puppet to overcome this. if you take one machine, disable puppet and run yum update to get nfs-utils-1.2.3-54.el6.x86_64, you'll start getting those error messages. i can provide a test machine if needed, also QE might have one already - adding lukas from QE. (In reply to Eyal Edri from comment #5) > > i can provide a test machine if needed, also QE might have one already - > adding lukas from QE. Please do... I would like to debug this live... I've managed to reproduce the issue on loki01.ci.lab.tlv.redhat.com, you can login and look around. I've reserved this host and disabled puppet on it for now. (In reply to Paz Dangur from comment #7) > I've managed to reproduce the issue on loki01.ci.lab.tlv.redhat.com, you can > login and look around. Any idea how you reproduced it? Ok I see what the problem is but I have no idea why or how its happen There 43 rpc.statd running when there should be one. They are all trying to suck off the same pipe and the pipe breaks causing the log abuse... So the question is how and why are those rpc.statd getting started. Is there some type of nfs mounting going on? also is there a serial console to this machine? adding lukas who's in charge of the testing framework that qe uses while mounting nfs that is when we see this problem. lukas - can you assist in explaining what the test is doing while mounting NFS and why it needs so many rpc.statd processes? I don't think it is related to framework, because this process runs on jenkins slave (VDC) and the issue happens on host (VDS). here is flow: 1) it creates export (from VDC) on one of our storages, it can be emc, netapp, xtreamio, or ordinary linux server 2) set relevant info (server:/path/to/export) to config file and run test 3) then rhevm infra is built via REST API, create DC, cluster, add hosts 4) add storage, just passing server:/path/to/export into creation request with other parameters ... 5) running testcases via: REST API, python-sdk, java-sdk, rhevm-cli 6) destory infra unsing REST API 7) remove export ... the mounting of these mount points is done by vdsm. I don't see connection between framework and this issue. (In reply to Lukas Bednar from comment #12) > I don't think it is related to framework, because this process runs on > jenkins slave (VDC) and the issue happens on host (VDS). How reproducible is this in this environment? When the problem happens, are there more than one rpc.statd running? > > here is flow: > 1) it creates export (from VDC) on one of our storages, it can be emc, > netapp, xtreamio, or ordinary linux server So the problem is happening on that "ordinary linux server" or is happens on clients (sorry I don't know what a VDS is :-) )? > 2) set relevant info (server:/path/to/export) to config file and run test So the mounts happen after the config file nad the tests are run? Or the mounts happen during the running of the tests? > 3) then rhevm infra is built via REST API, create DC, cluster, add hosts > 4) add storage, just passing server:/path/to/export into creation request > with other parameters ... > 5) running testcases via: REST API, python-sdk, java-sdk, rhevm-cli > 6) destory infra unsing REST API > 7) remove export Does any of the above steps cause services to restarted? > ... > > the mounting of these mount points is done by vdsm. > I don't see connection between framework and this issue. I believe the problem is only happening in this type of environment... but it does happen across the board (rhel6/7 fedora). I'm thinking it has something to do with the nfs mounts since mount.nfs does try to start rpc.statd if its not running... But that just a theory... :-( Finally, when this does happen can you ping me in IRC so I can jump on the machine and take a look... My nick is steved. Our step to downgrade the RHEV-CI environment to 'nfs-utils-1.2.3-53' had not been as effective as we've hoped, We also see this (E.g. multiple rpc.statd processes and flooded logs) happening on this version. One thing that might give us a hint as to what is happening here: we've seen some tests fail when RHEV is trying to mount NFS storage domain on a host, digging in the VDSM logs we can extract the mount command that was issued: /usr/bin/sudo -n /bin/mount -t nfs -o soft,nosharecache,timeo=600,retrans=6,nfsvers=3 10.35.148.15:/vol/vfilers/nfs01/jenkins-automation-rpm-vm10_nfs_2 014_12_28_14_49_37_884835 /rhev/data-center/mnt/10.35.148.15:_vol_vfilers_nfs01_jenkins-automation-rpm-vm10__nfs__2014__12__28__14__49__37__884835 The error message received from that command is as follows: mount.nfs: rpc.statd is not running but is required for remote locking. mount.nfs: Either use '-o nolock' to keep locks local, or start statd. mount.nfs: an incorrect mount option was specified (And this is despite having many 'rpc.statd' processes to "pick" from...) I have found the cause for the spawned rcp.statd. How to reproduce this: Easy: 1. do 'service rpcbind restart' 2. mount some NFS. Congratulations, you have yourself a new rpc.statd process. (In reply to Barak Korren from comment #15) > I have found the cause for the spawned rcp.statd. > > How to reproduce this: > Easy: > 1. do 'service rpcbind restart' > 2. mount some NFS. > > Congratulations, you have yourself a new rpc.statd process. This does not do it for me... # service rpcbind restart # ps ax | grep statd 888 ? Ss 0:00 /usr/sbin/rpc.statd --no-notify # mount server:/home /mnt/home ps ax | grep statd 888 ? Ss 0:00 /usr/sbin/rpc.statd --no-notify # What am I doing wrong? Hmm interesting... on my laptop and test servers it happens... So more info: We're using: nfs-utils-1.2.3-53.el6.x86_64 Maybe its because we have: -A INPUT -j REJECT --reject-with icmp-host-prohibited setup in iptables, (with other rules to open ports, etc before it) In particular it blocks port 111 and whatever port rpc.statd uses. Do note that the OUTPUT chain is setup with ACCEPT policy so nothing is blocked that way. We also have the following -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT Which is why we can mount at all. I've attached an 'iptables-save' dump of a typical server. Created attachment 974835 [details]
iptables-save of a typical rhev host
I've confirmed this does not happen when mounting from a RHEL server. Also the iptables stuff seems unrelated. So its probably has to do with the NetApp filers doing something strange. [root@cinteg03 ~]# pgrep -l rpc.statd 16601 rpc.statd 16661 rpc.statd 16709 rpc.statd [root@cinteg03 ~]# service rpcbind restart Stopping rpcbind: [ OK ] Starting rpcbind: [ OK ] [root@cinteg03 ~]# pgrep -l rpc.statd 16601 rpc.statd 16661 rpc.statd 16709 rpc.statd [root@cinteg03 ~]# mount cinteg04.ci.lab.tlv.redhat.com:/tmp/export /mnt/ [root@cinteg03 ~]# pgrep -l rpc.statd 16601 rpc.statd 16661 rpc.statd 16709 rpc.statd [root@cinteg03 ~]# umount /mnt/ [root@cinteg03 ~]# mount 10.35.148.15:/vol/vfilers/nfs01 /mnt/ [root@cinteg03 ~]# pgrep -l rpc.statd 16601 rpc.statd 16661 rpc.statd 16709 rpc.statd 17489 rpc.statd I have seen this problem when using something that was definitely not a NetApp filer, so it's not just them, at least. (In reply to Barak Korren from comment #20) > I've confirmed this does not happen when mounting from a RHEL server. > Also the iptables stuff seems unrelated. > So its probably has to do with the NetApp filers doing something strange. > > [root@cinteg03 ~]# pgrep -l rpc.statd > 16601 rpc.statd > 16661 rpc.statd > 16709 rpc.statd > [root@cinteg03 ~]# service rpcbind restart > Stopping rpcbind: [ OK ] > Starting rpcbind: [ OK ] > [root@cinteg03 ~]# pgrep -l rpc.statd > 16601 rpc.statd > 16661 rpc.statd > 16709 rpc.statd > [root@cinteg03 ~]# mount cinteg04.ci.lab.tlv.redhat.com:/tmp/export /mnt/ > [root@cinteg03 ~]# pgrep -l rpc.statd > 16601 rpc.statd > 16661 rpc.statd > 16709 rpc.statd > [root@cinteg03 ~]# umount /mnt/ > [root@cinteg03 ~]# mount 10.35.148.15:/vol/vfilers/nfs01 /mnt/ > [root@cinteg03 ~]# pgrep -l rpc.statd > 16601 rpc.statd > 16661 rpc.statd > 16709 rpc.statd > 17489 rpc.statd I logged on to the system and took a look... I was able to determine that the problem was rpcbind telling mount.nfs that rpc.statd is not running when it was. Unfortunately restart rpcbind cleared the problem up.... I've created rpcbind debug rpm and installed it on cinteg03.ci.lab.tlv.redhat.com that will log all of its communications with services. Hopefully, when the problem happens again, I'll have a log of what's going on... The debug rpm is under: http://people.redhat.com/steved/.bz1153007/ If anybody else can reproduce this please install the above rpm then ping me on IRC (steved) when the problem so I can take a look.. (In reply to Barak Korren from comment #15) > I have found the cause for the spawned rcp.statd. > > How to reproduce this: > Easy: > 1. do 'service rpcbind restart' > 2. mount some NFS. > > Congratulations, you have yourself a new rpc.statd process. I also looked into this With RHEL 6, rpcbind is not start with the -w flag so when it is restarted it loses all of its previous port bindings. So rpcbind does not know rpc.statd is running when asked by mount.nfs, if rpc.statd should be started. If rpcbind needs to be restarted either used the -w flag or all the of the other services need to be restarted so they can register their ports. Hmm... Question... is rpcbind being restarted at some point? If rpcbind is being restarted before every nfs mount that would explain why there so many rpc.statd running... Looks like Steve D's question is still outstanding, so moving back into needinfo. bkorren, did you manage to reproduce the error with the debug rpms on the system? Steave, Yes, rpcbind was indeed being restarted before almost every mount. I assumed I made this clear already, apparently I haven't. We have fixed our testing tools to restart the 'nfslock' service after restarting 'rpcbind', so we don't expect to see this triggered automatically anymore. I briefly tried and failed to reproduce manually once on the test machine with the debug rpms, but I hadn't had the time to try again and look deeper since. The testing machine is still available, and we can keep it available for a little while. (In reply to Barak Korren from comment #26) > Steave, > > Yes, rpcbind was indeed being restarted before almost every mount. I assumed > I made this clear already, apparently I haven't. It was not.... > > We have fixed our testing tools to restart the 'nfslock' service after > restarting 'rpcbind', so we don't expect to see this triggered automatically > anymore. Right... restart rpcbind without restarting nfslock is the bug. > > I briefly tried and failed to reproduce manually once on the test machine > with the debug rpms, but I hadn't had the time to try again and look deeper > since. > > The testing machine is still available, and we can keep it available for a > little while. So I'm going to close this as notabug.... |