Bug 1320578

Summary: "seting the network namespace failed: Invalid argument" from ip netns commands
Product: Red Hat Enterprise Linux 7 Reporter: Rashid Khan <rkhan>
Component: iprouteAssignee: Phil Sutter <psutter>
Status: CLOSED WORKSFORME QA Contact: BaseOS QE Security Team <qe-baseos-security>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: aortega, atragler, iwienand, kzhang, lhh, lwang, majopela, oblaut, psutter, rhos-maint, rkhan, sclewis, wfoster, yeylon
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1095015 Environment:
Last Closed: 2016-03-31 12:33:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1095015    
Bug Blocks:    

Description Rashid Khan 2016-03-23 14:27:26 UTC
+++ This bug was initially created as a clone of Bug #1095015 +++

Description of problem:

neutron started to misbehave quite badly today in oslab.  one of the symptoms was an seemin inability to work with network name-spaces; for example the l3-agent.log was filling up with things like

---
2014-05-06 21:17:44.387 18955 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 765, in _sync_routers_task
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     self._process_routers(routers, all_routers=True)
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 708, in _process_routers
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     self._router_added(r['id'], r)
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 337, in _router_added
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     self._create_router_namespace(ri)
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/l3_agent.py", line 307, in _create_router_namespace
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     ip_wrapper.netns.execute(['sysctl', '-w', 'net.ipv4.ip_forward=1'])
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/linux/ip_lib.py", line 467, in execute
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     check_exit_code=check_exit_code)
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent   File "/usr/lib/python2.6/site-packages/neutron/agent/linux/utils.py", line 75, in execute
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent     raise RuntimeError(m)
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent RuntimeError: 
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Command: ['sudo', 'ip', 'netns', 'exec', 'qrouter-857330af-e8c1-4cc5-b900-086596210244', 'sysctl', '-w', 'net.ipv4.ip_forward=1']
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Exit code: 255
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stdout: ''
2014-05-06 21:17:44.387 18955 TRACE neutron.agent.l3_agent Stderr: 'seting the network namespace failed: Invalid argument\n'
---

looking a little closer at one network, it exists but can't be used

---
[root@host03 log]# ip netns list | grep dhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 
qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2

[root@host03 log]#  /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show 
seting the network namespace failed: Invalid argument
---

strace of this shows

---
[root@host03 log]# strace  /sbin/ip netns exec qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2 ip -o link show 
execve("/sbin/ip", ["/sbin/ip", "netns", "exec", "qdhcp-20803109-5bea-4638-8b02-bc"..., "ip", "-o", "link", "show"], [/* 25 vars */]) = 0
open("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", O_RDONLY) = 4
syscall_308(0x4, 0x40000000, 0xffffffffffffffff, 0, 0x622d323062382d38, 0x7fff2bbc1771, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de, 0x42b7de) = -1 (errno 22)
write(2, "seting the network namespace fai"..., 54seting the network namespace failed: Invalid argument
) = 54
exit_group(-1)                          = ?
---

unfortunately strace in rhel isn't built against openstack kernel so it doesn't know about netns call, but we can see it's calling netns("/var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2", CLONE_NEWNET) which seems about right.

EINVAL looks like it could come from two places; proc_ns_fget() or the type check in setns()

---
struct file *proc_ns_fget(int fd)
{
        struct file *file;

        file = fget(fd);
        if (!file)
                return ERR_PTR(-EBADF);

        if (file->f_op != &ns_file_operations)
                goto out_invalid;

        return file;

out_invalid:
        fput(file);
        return ERR_PTR(-EINVAL);
}

...

SYSCALL_DEFINE2(setns, int, fd, int, nstype)
{
        const struct proc_ns_operations *ops;
        struct task_struct *tsk = current;
        struct nsproxy *new_nsproxy;
        struct proc_inode *ei;
        struct file *file;
        int err;

        if (!capable(CAP_SYS_ADMIN))
                return -EPERM;

        file = proc_ns_fget(fd);
        if (IS_ERR(file))
                return PTR_ERR(file);

        err = -EINVAL;
        ei = PROC_I(file->f_dentry->d_inode);
        ops = ei->ns_ops;
        if (nstype && (ops->type != nstype))
                goto out;

...
}
---

The permissions on the file at the time of the problem were blank

---
[root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2
----------. 1 root root 0 Mar 27 03:03 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2
---

after the system was rebooted, it came back to 666

---
[root@host03 netns]# ls -l /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2
-r--r--r--. 1 root root 0 May  6 21:25 /var/run/netns/qdhcp-20803109-5bea-4638-8b02-bcdbace3f0b2
---
Does this suggest the namespace somehow became detached?

wfoster and myself went through the logs pretty carefully, there is nothing to suggest what would have changed the permissions, or how.  The only network related thing in dmesg (which unfortunately isn't timestamped on this system) was

---
lo: Disabled Privacy Extensions
__ratelimit: 376 callbacks suppressed
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
tapa3f71897-7b: no IPv6 routers present
qr-7dbfe943-f4: no IPv6 routers present
tap6b31d415-c7: no IPv6 routers present
device qg-56760887-0c entered promiscuous mode
__ratelimit: 275 callbacks suppressed
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
Neighbour table overflow.
tap312416e7-51: no IPv6 routers present
qg-56760887-0c: no IPv6 routers present
---

versions:

---
[root@host03 netns]# rpm -qa | grep neutron
openstack-neutron-2013.2.3-7.el6ost.noarch
python-neutron-2013.2.3-7.el6ost.noarch
openstack-neutron-openvswitch-2013.2.3-7.el6ost.noarch
python-neutronclient-2.3.4-1.el6ost.noarch

[root@host03 netns]# rpm -qa | grep kernel
kernel-headers-2.6.32-431.11.2.el6.x86_64
kernel-2.6.32-431.11.2.el6.x86_64
libreport-plugin-kerneloops-2.0.9-19.el6.x86_64
dracut-kernel-004-336.el6_5.2.noarch
kernel-2.6.32-431.3.1.el6.x86_64
kernel-2.6.32-431.5.1.el6.x86_64
abrt-addon-kerneloops-2.0.8-21.el6.x86_64
kernel-firmware-2.6.32-431.11.2.el6.noarch
kernel-devel-2.6.32-431.11.2.el6.x86_64
---

--- Additional comment from Ian Wienand on 2014-05-06 20:37:33 EDT ---

Another data point from the neutron side ... all of the TAP devices on the neutron server had disappeared (not showing in ip link).  However, all of the dnsmasq processes for the various networks were still running and listening on these non-existant tap devices.

--- Additional comment from Ian Wienand on 2014-05-07 00:53:07 EDT ---

Ok, I managed to recreate this problem by deleting an attached ns

e.g., in one window do

---
[root@rhel ~]# ip netns add testing
[root@rhel ~]# ls -l /var/run/netns/testing 
-r--r--r--. 1 root root 0 May  7 14:44 /var/run/netns/testing
[root@rhel ~]# ip netns exec testing bash -c "while [ 1 ]; do echo "hi"; sleep 5; done"
---

in another window, remove it

---
[root@rhel ~]# ip netns del testing
Cannot remove /var/run/netns/testing: Device or resource busy
[root@rhel ~]# ls -l /var/run/netns/testing 
----------. 1 root root 0 May  7 14:44 /var/run/netns/testing
---

note it has disappeared from /proc/mounts

---
[root@rhel ~]# cat /proc/mounts 
rootfs / rootfs rw 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,seclabel,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /dev/shm tmpfs rw,seclabel,relatime 0 0
/dev/mapper/vg_rhel-lv_root / ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /selinux selinuxfs rw,relatime 0 0
devtmpfs /dev devtmpfs rw,seclabel,relatime,size=951024k,nr_inodes=237756,mode=755 0 0
/proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0
/dev/sda1 /boot ext4 rw,seclabel,relatime,barrier=1,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0
cgroup /cgroup/cpuset cgroup rw,relatime,cpuset 0 0
cgroup /cgroup/cpu cgroup rw,relatime,cpu 0 0
cgroup /cgroup/cpuacct cgroup rw,relatime,cpuacct 0 0
cgroup /cgroup/memory cgroup rw,relatime,memory 0 0
cgroup /cgroup/devices cgroup rw,relatime,devices 0 0
cgroup /cgroup/freezer cgroup rw,relatime,freezer 0 0
cgroup /cgroup/net_cls cgroup rw,relatime,net_cls 0 0
cgroup /cgroup/blkio cgroup rw,relatime,blkio 0 0
/etc/auto.misc /misc autofs rw,relatime,fd=7,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0
-hosts /net autofs rw,relatime,fd=13,pgrp=1314,timeout=300,minproto=5,maxproto=5,indirect 0 0
---

looking at the patch for iproute netns support


---
+static int netns_delete(int argc, char **argv)
+{
+       const char *name;
+       char netns_path[MAXPATHLEN];
+
+       if (argc < 1) {
+               fprintf(stderr, "No netns name specified\n");
+               return -1;
+       }
+
+       name = argv[0];
+       snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
+       umount2(netns_path, MNT_DETACH);
+       if (unlink(netns_path) < 0) {
+               fprintf(stderr, "Cannot remove %s: %s\n",
+                       netns_path, strerror(errno));
+               return -1;
+       }
+       return 0;
+}
---

I really think that umount2 call should check it's return before it goes and does the unlink

--- Additional comment from Ian Wienand on 2014-05-07 01:10:09 EDT ---

I see upstream is the same so maybe this highlights an invalid assumption that rhel kernel invalidates [1]?

I couldn't find a lot of discussion, [2] was the only relevant thread

[1] http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/ip/ipnetns.c#n377
[2] http://marc.info/?l=linux-netdev&m=137962865905031&w=2

--- Additional comment from Petr Šabata on 2014-05-12 19:42:30 EDT ---

Hmm, what build are you using?

There was a recent fix related to netns for RHOS4 in bug #1062685.  Could that possibly resolve your issue too?

--- Additional comment from Ian Wienand on 2014-05-13 15:33:06 EDT ---

Sorry, forgot to paste the route version

---
[root@host03 ~]# rpm -qa | grep iproute
iproute-2.6.32-130.el6ost.netns.3.x86_64
---

so we should have the fix for bug#1062685.

I guess the question is if the behaviour in comment#2 is a bug or a feature...

--- Additional comment from Pavel Šimerda (pavlix) on 2015-04-20 09:15:35 EDT ---

(In reply to Ian Wienand from comment #5)
> Sorry, forgot to paste the route version
> 
> ---
> [root@host03 ~]# rpm -qa | grep iproute
> iproute-2.6.32-130.el6ost.netns.3.x86_64
> ---
> 
> so we should have the fix for bug#1062685.

So what's the current status of the bug report from your point of view?

> I guess the question is if the behaviour in comment#2 is a bug or a
> feature...

I don't have details on desired behavior of busy network namespaces but I don't think iproute can affect that. If you're pursuing a fix/change in kernel behavior, please switch the bug to kernel or start a new bug report.

Comment 2 Phil Sutter 2016-03-31 12:33:23 UTC
The original ticket's reporter claimed he can't reproduce the issue anymore, so closing this one as well.