Hide Forgot
+++ This bug was initially created as a clone of Bug #1095015 +++ Description of problem: neutron started to misbehave quite badly today in oslab. one of the symptoms was an seemin inability to work with network name-spaces; for example the l3-agent.log was filling up with things like --- 2014-05-06 21:17:44.387 18955 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Traceback (most recent call last): 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 765, in _sync_routers_task 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._process_routers(routers, all_routers=True) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 708, in _process_routers 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._router_added(r['id'], r) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 337, in _router_added 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent self._create_router_namespace(ri) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 307, in _create_router_namespace 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent ip_wrapper.netns.execute(['sysctl', '-w', 'net.ipv4.ip_forward=1']) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/linux/ip_lib.py", line 467, in execute 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent check_exit_code=check_exit_code) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent File "/usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py", line 75, in execute 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent raise RuntimeError(m) 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent RuntimeError: 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Command: ['sudo', 'ip', 'netns', 'exec', 'qrouter-857330af-e8c1-4cc5-b900-086596210244', 'sysctl', '-w', 'net.ipv4.ip_forward=1'] 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Exit code: 255 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stdout: '' 2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stderr: 'seting the network namespace failed: Invalid argument\n' --- looking a little closer at one network, it exists but can't be used --- [root@host03 log]# ip netns list | grep dhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 [root@host03 log]# /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show seting the network namespace failed: Invalid argument --- strace of this shows --- [root@host03 log]# strace /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show execve("/sbin/ip", ["/sbin/ip", "netns", "exec", "qdhcp-20803109-5bea-4638-8b02-bc"..., "ip", "-o", "link", "show"], [/* 25 vars */]) = 0 open("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", O_RDONLY) = 4 syscall_308(0x4, 0x40000000, 0xffffffffffffffff, 0, 0x622d323062382d38, 0x7fff2bbc1771, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de) = -1 (errno 22) write(2, "seting the network namespace fai"..., 54seting the network namespace failed: Invalid argument ) = 54 exit_group(-1) = ? --- unfortunately strace in rhel isn't built against openstack kernel so it doesn't know about netns call, but we can see it's calling netns("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", CLONE_NEWNET) which seems about right. EINVAL looks like it could come from two places; proc_ns_fget() or the type check in setns() --- struct file *proc_ns_fget(int fd) { struct file *file; file = fget(fd); if (!file) return ERR_PTR(-EBADF); if (file->f_op != &ns_file_operations) goto out_invalid; return file; out_invalid: fput(file); return ERR_PTR(-EINVAL); } ... SYSCALL_DEFINE2(setns, int, fd, int, nstype) { const struct proc_ns_operations *ops; struct task_struct *tsk = current; struct nsproxy *new_nsproxy; struct proc_inode *ei; struct file *file; int err; if (!capable(CAP_SYS_ADMIN)) return -EPERM; file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); err = -EINVAL; ei = PROC_I(file->f_dentry->d_inode); ops = ei->ns_ops; if (nstype && (ops->type != nstype)) goto out; ... } --- The permissions on the file at the time of the problem were blank --- [root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ----------. 1 root root 0 Mar 27 03:03 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 --- after the system was rebooted, it came back to 666 --- [root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 -r--r--r--. 1 root root 0 May 6 21:25 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 --- Does this suggest the namespace somehow became detached? wfoster and myself went through the logs pretty carefully, there is nothing to suggest what would have changed the permissions, or how. The only network related thing in dmesg (which unfortunately isn't timestamped on this system) was --- lo: Disabled Privacy Extensions __ratelimit: 376 callbacks suppressed Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. tapa3f71897-7b: no IPv6 routers present qr-7dbfe943-f4: no IPv6 routers present tap6b31d415-c7: no IPv6 routers present device qg-56760887-0c entered promiscuous mode __ratelimit: 275 callbacks suppressed Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. Neighbour table overflow. tap312416e7-51: no IPv6 routers present qg-56760887-0c: no IPv6 routers present --- versions: --- [root@host03 netns]# rpm -qa | grep neutron openstack-neutron-2013.2.3-7.el6ost.noarch python-neutron-2013.2.3-7.el6ost.noarch openstack-neutron-openvswitch-2013.2.3-7.el6ost.noarch python-neutronclient-2.3.4-1.el6ost.noarch [root@host03 netns]# rpm -qa | grep kernel kernel-headers-2.6.32-431.11.2.el6.x86_64 kernel-2.6.32-431.11.2.el6.x86_64 libreport-plugin-kerneloops-2.0.9-19.el6.x86_64 dracut-kernel-004-336.el6_5.2.noarch kernel-2.6.32-431.3.1.el6.x86_64 kernel-2.6.32-431.5.1.el6.x86_64 abrt-addon-kerneloops-2.0.8-21.el6.x86_64 kernel-firmware-2.6.32-431.11.2.el6.noarch kernel-devel-2.6.32-431.11.2.el6.x86_64 --- --- Additional comment from Ian Wienand on 2014-05-06 20:37:33 EDT --- Another data point from the neutron side ... all of the TAP devices on the neutron server had disappeared (not showing in ip link). However, all of the dnsmasq processes for the various networks were still running and listening on these non-existant tap devices. --- Additional comment from Ian Wienand on 2014-05-07 00:53:07 EDT --- Ok, I managed to recreate this problem by deleting an attached ns e.g., in one window do --- [root@rhel ~]# ip netns add testing [root@rhel ~]# ls -l /var/run/netns/testing -r--r--r--. 1 root root 0 May 7 14:44 /var/run/netns/testing [root@rhel ~]# ip netns exec testing bash -c "while [ 1 ]; do echo "hi"; sleep 5; done" --- in another window, remove it --- [root@rhel ~]# ip netns del testing Cannot remove /var/run/netns/testing: Device or resource busy [root@rhel ~]# ls -l /var/run/netns/testing ----------. 1 root root 0 May 7 14:44 /var/run/netns/testing --- note it has disappeared from /proc/mounts --- [root@rhel ~]# cat /proc/mounts rootfs / rootfs rw 0 0 proc /proc proc rw,relatime 0 0 sysfs /sys sysfs rw,seclabel,relatime 0 0 devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0 devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0 tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0 /dev/mapper/vg_rhel-lv_root / ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0 none /selinux selinuxfs rw,relatime 0 0 devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/sda1 /boot ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0 cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0 cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0 cgroup /cgroup/memory cgroup rw,relatime,memory 0 0 cgroup /cgroup/devices cgroup rw,relatime,devices 0 0 cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0 cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0 cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0 /etc/auto.misc /misc autofs rw,relatime,fd=7,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0 -hosts /net autofs rw,relatime,fd=13,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0 --- looking at the patch for iproute netns support --- +static int netns_delete(int argc, char **argv) +{ + const char *name; + char netns_path[MAXPATHLEN]; + + if (argc < 1) { + fprintf(stderr, "No netns name specified\n"); + return -1; + } + + name = argv[0]; + snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name); + umount2(netns_path, MNT_DETACH); + if (unlink(netns_path) < 0) { + fprintf(stderr, "Cannot remove %s: %s\n", + netns_path, strerror(errno)); + return -1; + } + return 0; +} --- I really think that umount2 call should check it's return before it goes and does the unlink --- Additional comment from Ian Wienand on 2014-05-07 01:10:09 EDT --- I see upstream is the same so maybe this highlights an invalid assumption that rhel kernel invalidates [1]? I couldn't find a lot of discussion, [2] was the only relevant thread [1] http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/ip/ipnetns.c#n377 [2] http://marc.info/?l=linux-netdev&m=137962865905031&w=2 --- Additional comment from Petr Šabata on 2014-05-12 19:42:30 EDT --- Hmm, what build are you using? There was a recent fix related to netns for RHOS4 in bug #1062685. Could that possibly resolve your issue too? --- Additional comment from Ian Wienand on 2014-05-13 15:33:06 EDT --- Sorry, forgot to paste the route version --- [root@host03 ~]# rpm -qa | grep iproute iproute-2.6.32-130.el6ost.netns.3.x86_64 --- so we should have the fix for bug#1062685. I guess the question is if the behaviour in comment#2 is a bug or a feature... --- Additional comment from Pavel Šimerda (pavlix) on 2015-04-20 09:15:35 EDT --- (In reply to Ian Wienand from comment #5) > Sorry, forgot to paste the route version > > --- > [root@host03 ~]# rpm -qa | grep iproute > iproute-2.6.32-130.el6ost.netns.3.x86_64 > --- > > so we should have the fix for bug#1062685. So what's the current status of the bug report from your point of view? > I guess the question is if the behaviour in comment#2 is a bug or a > feature... I don't have details on desired behavior of busy network namespaces but I don't think iproute can affect that. If you're pursuing a fix/change in kernel behavior, please switch the bug to kernel or start a new bug report.
The original ticket's reporter claimed he can't reproduce the issue anymore, so closing this one as well.