Bug 1095015
| Summary: | "seting the network namespace failed: Invalid argument" from ip netns commands | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Ian Wienand <iwienand> | |
| Component: | iproute | Assignee: | Phil Sutter <psutter> | |
| Status: | CLOSED WORKSFORME | QA Contact: | Ofer Blaut <oblaut> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 4.0 | CC: | aortega, iwienand, kzhang, lhh, lwang, majopela, rhos-maint, rkhan, sclewis, wfoster, yeylon | |
| Target Milestone: | --- | Keywords: | ZStream | |
| Target Release: | 5.0 (RHEL 7) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1320578 (view as bug list) | Environment: | ||
| Last Closed: | 2016-03-31 10:14:34 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1320578 | |||
Another data point from the neutron side ... all of the TAP devices on the neutron server had disappeared (not showing in ip link). However, all of the dnsmasq processes for the various networks were still running and listening on these non-existant tap devices. Ok, I managed to recreate this problem by deleting an attached ns
e.g., in one window do
---
[root@rhel ~]# ip netns add testing
[root@rhel ~]# ls -l /var/run/netns/testing
-r--r--r--. 1 root root 0 May 7 14:44 /var/run/netns/testing
[root@rhel ~]# ip netns exec testing bash -c "while [ 1 ]; do echo "hi"; sleep 5; done"
---
in another window, remove it
---
[root@rhel ~]# ip netns del testing
Cannot remove /var/run/netns/testing: Device or resource busy
[root@rhel ~]# ls -l /var/run/netns/testing
----------. 1 root root 0 May 7 14:44 /var/run/netns/testing
---
note it has disappeared from /proc/mounts
---
[root@rhel ~]# cat /proc/mounts
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,seclabel,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0
/dev/mapper/vg_rhel-lv_root / ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /selinux selinuxfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/sda1 /boot ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0
cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0
/etc/auto.misc /misc autofs rw,relatime,fd=7,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs rw,relatime,fd=13,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0
---
looking at the patch for iproute netns support
---
+static int netns_delete(int argc, char **argv)
+{
+ const char *name;
+ char netns_path[MAXPATHLEN];
+
+ if (argc < 1) {
+ fprintf(stderr, "No netns name specified\n");
+ return -1;
+ }
+
+ name = argv[0];
+ snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+ umount2(netns_path, MNT_DETACH);
+ if (unlink(netns_path) < 0) {
+ fprintf(stderr, "Cannot remove %s: %s\n",
+ netns_path, strerror(errno));
+ return -1;
+ }
+ return 0;
+}
---
I really think that umount2 call should check it's return before it goes and does the unlink
I see upstream is the same so maybe this highlights an invalid assumption that rhel kernel invalidates [1]? I couldn't find a lot of discussion, [2] was the only relevant thread [1] http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/ip/ipnetns.c#n377 [2] http://marc.info/?l=linux-netdev&m=137962865905031&w=2 Hmm, what build are you using? There was a recent fix related to netns for RHOS4 in bug #1062685. Could that possibly resolve your issue too? Sorry, forgot to paste the route version --- [root@host03 ~]# rpm -qa | grep iproute iproute-2.6.32-130.el6ost.netns.3.x86_64 --- so we should have the fix for bug#1062685. I guess the question is if the behaviour in comment#2 is a bug or a feature... (In reply to Ian Wienand from comment #5) > Sorry, forgot to paste the route version > > --- > [root@host03 ~]# rpm -qa | grep iproute > iproute-2.6.32-130.el6ost.netns.3.x86_64 > --- > > so we should have the fix for bug#1062685. So what's the current status of the bug report from your point of view? > I guess the question is if the behaviour in comment#2 is a bug or a > feature... I don't have details on desired behavior of busy network namespaces but I don't think iproute can affect that. If you're pursuing a fix/change in kernel behavior, please switch the bug to kernel or start a new bug report. Hi, To me the behaviour seems intentional. Upon 'ip netns del <name>', the mounted netns at /var/run/netns/<name> is first umounted (using MNT_DETACH flag), then unlinked. Both operations will succeed even if another process still runs inside that namespace. In fact, MNT_DETACH explicitly requests to delay the umount in case the mount point is busy until the last user is gone. After evaluating a few ways to improve the situation, the only way I see for OpenStack is to behave nicely and kill all PIDs returned by 'ip netns pids <name>' before deleting the namespace. Sadly, one can't get the list of PIDs after deleting the namespace, so there's still a slight chance for a race condition (if a new process is spawned inside the NS in between killing old ones and deleting it). Though I also couldn't find a way to keep the PIDs list available after NS removal since on RHEL7 at least even a regular umount (without MNT_DETACH) succeeds if there are still processes running inside. Is this a possible workaround to the observed behaviour? Thanks, Phil Honestly, in the (almost) 2 years since I filed this, Neutron has changed so much I have no idea. I just tried this on a F23 box and i guess it works as you would expect now ... if you create a ns in one window and run something, then delete it in another, there's nothing left in /var/run/netns but the other process seems to keep running. Hi Ian, (In reply to Ian Wienand from comment #8) > Honestly, in the (almost) 2 years since I filed this, Neutron has changed so > much I have no idea. > > I just tried this on a F23 box and i guess it works as you would expect now > ... if you create a ns in one window and run something, then delete it in > another, there's nothing left in /var/run/netns but the other process seems > to keep running. Thanks for your reply. Assuming the problem is not relevant anymore, I'm closing this ticket now. Feel free to reopen in case you stumble upon it again. Thanks, Phil |
Description of problem: neutron started to misbehave quite badly today in oslab. one of the symptoms was an seemin inability to work with network name-spaces; for example the l3-agent.log was filling up with things like --- 2014-05-06 21:17:44.387 18955 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Traceback (most recent call last): 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 765, in _sync_routers_task 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._process_routers(routers, all_routers=True) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 708, in _process_routers 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._router_added(r['id'], r) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 337, in _router_added 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._create_router_namespace(ri) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 307, in _create_router_namespace 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent ip_wrapper.netns.execute(['sysctl', '-w', 'net.ipv4.ip_forward=1']) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/linux/ip_lib.py", line 467, in execute 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent check_exit_code=check_exit_code) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py", line 75, in execute 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent raise RuntimeError(m) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent RuntimeError: 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Command: ['sudo', 'ip', 'netns', 'exec', 'qrouter-857330af-e8c1-4cc5-b900-086596210244', 'sysctl', '-w', 'net.ipv4.ip_forward=1'] 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Exit code: 255 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stdout: '' 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stderr: 'seting the network namespace failed: Invalid argument\n' --- looking a little closer at one network, it exists but can't be used --- [root@host03 log]# ip netns list | grep dhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 [root@host03 log]# /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show seting the network namespace failed: Invalid argument --- strace of this shows --- [root@host03 log]# strace /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show execve("/sbin/ip", ["/sbin/ip", "netns", "exec", "qdhcp-20803109-5bea-4638-8b02-bc"..., "ip", "-o", "link", "show"], [/* 25 vars */]) = 0 open("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", O_RDONLY) = 4 syscall_308(0x4, 0x40000000, 0xffffffffffffffff, 0, 0x622d323062382d38, 0x7fff2bbc1771, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de) = -1 (errno 22) write(2, "seting the network namespace fai"..., 54seting the network namespace failed: Invalid argument ) = 54 exit_group(-1) = ? --- unfortunately strace in rhel isn't built against openstack kernel so it doesn't know about netns call, but we can see it's calling netns("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", CLONE_NEWNET) which seems about right. EINVAL looks like it could come from two places; proc_ns_fget() or the type check in setns() --- struct file *proc_ns_fget(int fd) { struct file *file; file = fget(fd); if (!file) return ERR_PTR(-EBADF); if (file->f_op != &ns_file_operations) goto out_invalid; return file; out_invalid: fput(file); return ERR_PTR(-EINVAL); } ... SYSCALL_DEFINE2(setns, int, fd, int, nstype) { const struct proc_ns_operations *ops; struct task_struct *tsk = current; struct nsproxy *new_nsproxy; struct proc_inode *ei; struct file *file; int err; if (!capable(CAP_SYS_ADMIN)) return -EPERM; file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); err = -EINVAL; ei = PROC_I(file->f_dentry->d_inode); ops = ei->ns_ops; if (nstype && (ops->type != nstype)) goto out; ... } --- The permissions on the file at the time of the problem were blank --- [root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ----------. 1 root root 0 Mar 27 03:03 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 --- after the system was rebooted, it came back to 666 --- [root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 -r--r--r--. 1 root root 0 May 6 21:25 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 --- Does this suggest the namespace somehow became detached? wfoster and myself went through the logs pretty carefully, there is nothing to suggest what would have changed the permissions, or how. The only network related thing in dmesg (which unfortunately isn't timestamped on this system) was --- lo: Disabled Privacy Extensions __ratelimit: 376 callbacks suppressed Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. tapa3f71897-7b: no IPv6 routers present qr-7dbfe943-f4: no IPv6 routers present tap6b31d415-c7: no IPv6 routers present device qg-56760887-0c entered promiscuous mode __ratelimit: 275 callbacks suppressed Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. tap312416e7-51: no IPv6 routers present qg-56760887-0c: no IPv6 routers present --- versions: --- [root@host03 netns]# rpm -qa | grep neutron openstack-neutron-2013.2.3-7.el6ost.noarch python-neutron-2013.2.3-7.el6ost.noarch openstack-neutron-openvswitch-2013.2.3-7.el6ost.noarch python-neutronclient-2.3.4-1.el6ost.noarch [root@host03 netns]# rpm -qa | grep kernel kernel-headers-2.6.32-431.11.2.el6.x86_64 kernel-2.6.32-431.11.2.el6.x86_64 libreport-plugin-kerneloops-2.0.9-19.el6.x86_64 dracut-kernel-004-336.el6_5.2.noarch kernel-2.6.32-431.3.1.el6.x86_64 kernel-2.6.32-431.5.1.el6.x86_64 abrt-addon-kerneloops-2.0.8-21.el6.x86_64 kernel-firmware-2.6.32-431.11.2.el6.noarch kernel-devel-2.6.32-431.11.2.el6.x86_64 ---