Hide Forgot
Description of problem: Nested bind mount started to trigger kernel soft-lockup and eventually deadlock starting from 20 instances during either mount or umount. # for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done # for i in `seq 1 20`; do umount /mnt/; done [ 361.301885] NMI backtrace for cpu 0 [ 361.302352] CPU: 0 PID: 29 Comm: kworker/0:1 Not tainted 3.10.0-327.10.1.el7.x86_64 #1 [ 361.303062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 [ 361.303882] Workqueue: events qxl_fb_work [qxl] [ 361.304379] task: ffff88013943d080 ti: ffff8801395ec000 task.ti: ffff8801395ec000 [ 361.305057] RIP: 0010:[<ffffffff811e0b33>] [<ffffffff811e0b33>] prune_super+0x23/0x170 [ 361.305784] RSP: 0000:ffff8801395ef5c0 EFLAGS: 00000206 [ 361.306318] RAX: 0000000000000080 RBX: ffff8801394243b0 RCX: 0000000000000000 [ 361.306973] RDX: 0000000000000000 RSI: ffff8801395ef710 RDI: ffff8801394243b0 [ 361.307628] RBP: ffff8801395ef5e8 R08: 0000000000000000 R09: 0000000000000040 [ 361.308282] R10: 0000000000000000 R11: 0000000000000220 R12: ffff8801395ef710 [ 361.308939] R13: ffff880139424000 R14: ffff8801395ef710 R15: 0000000000000000 [ 361.309599] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000 [ 361.310320] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 361.310893] CR2: 00007fea0c55dc3d CR3: 00000000b770d000 CR4: 00000000003406f0 [ 361.311561] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 361.312227] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 361.312894] Stack: [ 361.313223] 0000000000000400 ffff8801395ef710 ffff8801394243b0 0000000000000258 [ 361.313950] 0000000000000000 ffff8801395ef688 ffffffff8117c46b 0000000000000000 [ 361.314858] ffff8801395ef630 ffffffff811d5e21 ffff8801395ef720 0000000000000036 [ 361.315615] Call Trace: [ 361.315995] [<ffffffff8117c46b>] shrink_slab+0xab/0x300 [ 361.316559] [<ffffffff811d5e21>] ? vmpressure+0x21/0x90 [ 361.317114] [<ffffffff8117f6a2>] do_try_to_free_pages+0x3c2/0x4e0 [ 361.317727] [<ffffffff8117f8bc>] try_to_free_pages+0xfc/0x180 [ 361.318309] [<ffffffff811735bd>] __alloc_pages_nodemask+0x7fd/0xb90 [ 361.318927] [<ffffffff811b4429>] alloc_pages_current+0xa9/0x170 [ 361.319513] [<ffffffff811be9ec>] new_slab+0x2ec/0x300 [ 361.320045] [<ffffffff8163220f>] __slab_alloc+0x315/0x48f [ 361.320596] [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0 [ 361.321161] [<ffffffff811c0fb3>] kmem_cache_alloc+0x193/0x1d0 [ 361.321735] [<ffffffff811e064c>] ? get_empty_filp+0x5c/0x1a0 [ 361.322295] [<ffffffff811e064c>] get_empty_filp+0x5c/0x1a0 [ 361.322845] [<ffffffff811e07ae>] alloc_file+0x1e/0xf0 [ 361.323362] [<ffffffff81182773>] __shmem_file_setup+0x113/0x1f0 [ 361.323940] [<ffffffff81182860>] shmem_file_setup+0x10/0x20 [ 361.324496] [<ffffffffa039f5ab>] drm_gem_object_init+0x2b/0x40 [drm] [ 361.325103] [<ffffffffa0422c3d>] qxl_bo_create+0x7d/0x190 [qxl] [ 361.325680] [<ffffffffa042798c>] ? qxl_release_list_add+0x5c/0xc0 [qxl] [ 361.326299] [<ffffffffa0424066>] qxl_alloc_bo_reserved+0x46/0xb0 [qxl] [ 361.326912] [<ffffffffa0424fde>] qxl_image_alloc_objects+0xae/0x140 [qxl] [ 361.327544] [<ffffffffa042556e>] qxl_draw_opaque_fb+0xce/0x3c0 [qxl] [ 361.328145] [<ffffffffa0421ee2>] qxl_fb_dirty_flush+0x1a2/0x260 [qxl] [ 361.328754] [<ffffffffa0421fb9>] qxl_fb_work+0x19/0x20 [qxl] [ 361.329306] [<ffffffff8109d5db>] process_one_work+0x17b/0x470 [ 361.329865] [<ffffffff8109e3ab>] worker_thread+0x11b/0x400 [ 361.330393] [<ffffffff8109e290>] ? rescuer_thread+0x400/0x400 [ 361.330940] [<ffffffff810a5acf>] kthread+0xcf/0xe0 [ 361.331419] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [ 361.332005] [<ffffffff81645998>] ret_from_fork+0x58/0x90 [ 361.332513] [<ffffffff810a5a00>] ? kthread_create_on_node+0x140/0x140 [ 361.333093] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 f6 41 55 4c 8d af 50 fc ff ff 41 54 53 4c 8b 46 08 48 89 fb <4d> 85 c0 74 09 f6 06 80 0f 84 2f 01 00 00 48 8b 83 80 fc ff ff [ 361.335906] Kernel panic - not syncing: hung_task: blocked tasks Version-Release number of selected component (if applicable): kernel-3.10.0-327.10.1.el7.x86_64 How reproducible: always
This test is insane. Are you aware that the mount table expands by a power-of-two with each mount command? For example, on a live system, your test results in a mount table with over 1 million entries: # for i in `seq 1 20`; do mount -o bind /root/ /mnt/; done # mount | wc -l 1048606 # If I unmount them all, and then do each command manually, you can see the progression: # mount | wc -l 31 # mount -o bind /root/ /mnt/ # mount | wc -l 32 # mount -o bind /root/ /mnt/ # mount | wc -l 34 # mount -o bind /root/ /mnt/ # mount | wc -l 38 # mount -o bind /root/ /mnt/ # mount | wc -l 46 # mount -o bind /root/ /mnt/ # mount | wc -l 62 # # mount | tail -31 /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /mnt type xfs (rw,relatime,seclabel,attr2,inode64,noquota) /dev/mapper/rhel_dell--prt5600--01-root on /root type xfs (rw,relatime,seclabel,attr2,inode64,noquota) # In the supplied vmcore, the mount table contains over 4 million entries, (actually 4194334), which means the the mount command was actually run 22 times. For example, if I continue running the mount command on my live system from the "seq 1 20" point shown above: # mount -o bind /root/ /mnt/ # mount | wc -l 2097182 # mount -o bind /root/ /mnt/ # mount | wc -l 4194334 # Here is the beginning of the 4 million mount table entries in the vmcore: crash> mount MOUNT SUPERBLK TYPE DEVNAME DIRNAME ffff88013b026100 ffff880139a68800 rootfs rootfs / ffff88013568ac00 ffff880132dbb000 sysfs sysfs /sys ffff88013568a600 ffff880139a6b000 proc proc /proc ffff88013568a400 ffff880139420000 devtmpfs devtmpfs /dev ffff88013568a300 ffff880132eb0000 securityfs securityfs /sys/kernel/security ffff880136aec100 ffff880132dbb800 tmpfs tmpfs /dev/shm ffff880136aec200 ffff880139423800 devpts devpts /dev/pts ffff880136aec300 ffff880132dbc000 tmpfs tmpfs /run ffff880136aec400 ffff880132dbc800 tmpfs tmpfs /sys/fs/cgroup ffff880136aec500 ffff880132dbd000 cgroup cgroup /sys/fs/cgroup/systemd ffff880136aec600 ffff880132dbd800 pstore pstore /sys/fs/pstore ffff880132e5c700 ffff880132dbe800 cgroup cgroup /sys/fs/cgroup/cpu,cpuacct ffff880132e5c800 ffff880132dbe000 cgroup cgroup /sys/fs/cgroup/perf_event ffff880132e5c900 ffff880132dbf000 cgroup cgroup /sys/fs/cgroup/freezer ffff880132e5ca00 ffff880132dbf800 cgroup cgroup /sys/fs/cgroup/memory ffff880132e5cb00 ffff880132e20000 cgroup cgroup /sys/fs/cgroup/hugetlb ffff880132e5cc00 ffff880132e20800 cgroup cgroup /sys/fs/cgroup/devices ffff880132e5cd00 ffff880132e21000 cgroup cgroup /sys/fs/cgroup/net_cls ffff880132e5ce00 ffff880132e21800 cgroup cgroup /sys/fs/cgroup/blkio ffff880132e5cf00 ffff880132e22000 cgroup cgroup /sys/fs/cgroup/cpuset ffff88003544ed00 ffff88003565d800 configfs configfs /sys/kernel/config ffff8801356af900 ffff880034f0b000 xfs /dev/sda1 / ffff88003542c800 ffff88003565a800 rpc_pipefs rpc_pipefs /var/lib/nfs/rpc_pipefs ffff8800357a1d00 ffff880139425000 selinuxfs selinuxfs /sys/fs/selinux ffff880034b0a000 ffff880034848800 autofs systemd-1 /proc/sys/fs/binfmt_misc ffff8800356d6100 ffff880139424800 mqueue mqueue /dev/mqueue ffff8800354f5d00 ffff880034849000 hugetlbfs hugetlbfs /dev/hugepages ffff8801397a0f00 ffff880139a6e000 debugfs debugfs /sys/kernel/debug ffff880132f62700 ffff880034f0c800 nfsd nfsd /proc/fs/nfsd ffff8800b9c3ac00 ffff880034f0b000 xfs /dev/sda1 /var/lib/docker/devicemapper ffff8800356d8100 ffff8800b1168800 tmpfs tmpfs /run/user/1000 ffff8800348dc600 ffff880034f0b000 xfs /dev/sda1 /tmp ffff8800bb6a3d00 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800bb6a3800 ffff880034f0b000 xfs /dev/sda1 /root ffff8800a7c6fe00 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800354f5b00 ffff880034f0b000 xfs /dev/sda1 /root ffff8800354f5600 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800b74a9200 ffff880034f0b000 xfs /dev/sda1 /root ffff880034b13800 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff880034b13400 ffff880034f0b000 xfs /dev/sda1 /root ffff880034b13500 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800349a3b00 ffff880034f0b000 xfs /dev/sda1 /root ffff8800349a3800 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800349a3400 ffff880034f0b000 xfs /dev/sda1 /root ffff8800b74c6c00 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800b74c6100 ffff880034f0b000 xfs /dev/sda1 /root ffff88013573e000 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff88013573e100 ffff880034f0b000 xfs /dev/sda1 /root ffff88013573e300 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800b9dfb800 ffff880034f0b000 xfs /dev/sda1 /root ffff8800b9dfb000 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800b9dfb600 ffff880034f0b000 xfs /dev/sda1 /root ffff8800b9dfbd00 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800b9dfbe00 ffff880034f0b000 xfs /dev/sda1 /root ffff8800b9dfb300 ffff880034f0b000 xfs /dev/sda1 /tmp/root ffff8800354d1500 ffff880034f0b000 xfs /dev/sda1 /root ffff8800354d1700 ffff880034f0b000 xfs /dev/sda1 /tmp/root ... The test system has 2 cpus and ~4GB of memory. The mounts cause the slab cache to balloon to consume 86% of memory: crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 970581 3.7 GB ---- FREE 21617 84.4 MB 2% of TOTAL MEM USED 948964 3.6 GB 97% of TOTAL MEM SHARED 705 2.8 MB 0% of TOTAL MEM BUFFERS 0 0 0% of TOTAL MEM CACHED 4201 16.4 MB 0% of TOTAL MEM SLAB 841799 3.2 GB 86% of TOTAL MEM TOTAL SWAP 0 0 ---- SWAP USED 0 0 100% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 485290 1.9 GB ---- COMMITTED 97824 382.1 MB 20% of TOTAL LIMIT crash> And the system is trying desperately to free memory, where the runnable tasks on the system are trying to allocate memory, and 15 of which have ended up in shrink_slab(), and have called cond_resched() because they cannot free any memory: crash> foreach RU bt | grep -e PID -e shrink_slab PID: 0 TASK: ffffffff81951440 CPU: 0 COMMAND: "swapper/0" PID: 0 TASK: ffff880139b62280 CPU: 1 COMMAND: "swapper/1" PID: 29 TASK: ffff88013943d080 CPU: 0 COMMAND: "kworker/0:1" #5 [ffff8801395ef5f0] shrink_slab at ffffffff8117c46b PID: 34 TASK: ffff880132ce0b80 CPU: 1 COMMAND: "khungtaskd" PID: 35 TASK: ffff880132ce1700 CPU: 1 COMMAND: "kswapd0" #5 [ffff880132cebca8] shrink_slab at ffffffff8117c46b PID: 323 TASK: ffff8800356e1700 CPU: 1 COMMAND: "kworker/1:2" #5 [ffff88013283fa48] shrink_slab at ffffffff8117c46b PID: 414 TASK: ffff88003494ae00 CPU: 1 COMMAND: "systemd-journal" #3 [ffff880135bb3980] shrink_slab at ffffffff8117c59c PID: 476 TASK: ffff8800bb602e00 CPU: 0 COMMAND: "auditd" #3 [ffff8800bb64f980] shrink_slab at ffffffff8117c59c PID: 563 TASK: ffff8800bb607300 CPU: 0 COMMAND: "systemd-logind" #3 [ffff8800b9c9f980] shrink_slab at ffffffff8117c59c PID: 567 TASK: ffff880132ebae00 CPU: 0 COMMAND: "NetworkManager" #3 [ffff8800b9d2b980] shrink_slab at ffffffff8117c59c PID: 579 TASK: ffff880138205c00 CPU: 1 COMMAND: "gssproxy" #3 [ffff8800b9ca3980] shrink_slab at ffffffff8117c59c PID: 581 TASK: ffff880138205080 CPU: 0 COMMAND: "irqbalance" #3 [ffff8801397bf980] shrink_slab at ffffffff8117c59c PID: 610 TASK: ffff880132f35080 CPU: 1 COMMAND: "crond" #3 [ffff8800b9def980] shrink_slab at ffffffff8117c59c PID: 1135 TASK: ffff8800b9e32e00 CPU: 1 COMMAND: "tuned" #3 [ffff880134593980] shrink_slab at ffffffff8117c59c PID: 1495 TASK: ffff8800b760c500 CPU: 1 COMMAND: "master" #3 [ffff88003497f980] shrink_slab at ffffffff8117c59c PID: 1511 TASK: ffff8800bb5b2280 CPU: 0 COMMAND: "pickup" PID: 1512 TASK: ffff8800bb5b0b80 CPU: 1 COMMAND: "qmgr" #3 [ffff880034c07980] shrink_slab at ffffffff8117c59c PID: 2041 TASK: ffff88013544a280 CPU: 1 COMMAND: "docker" PID: 2069 TASK: ffff8800b771dc00 CPU: 0 COMMAND: "docker" #3 [ffff8801352c3980] shrink_slab at ffffffff8117c59c PID: 2217 TASK: ffff8800b7609700 CPU: 0 COMMAND: "sshd" #3 [ffff8800b76d3980] shrink_slab at ffffffff8117c59c crash> There is no "fix" for a root user who is abusing the system in this manner. There's no way the system can recover because there is no memory to be reclaimed/allocated. And as expected, the same thing happens on an upstream kernel as well, so it's not even a "rhel-7.3" issue. In my opinion, this BZ should be CLOSED/NOTABUG.
hello. vladis from prodsec is here. 1) testing this potential flaw from a security point of view for rhel-7: mount table indeed expands by a power-of-two with each mount command when run by root: # uname -r 3.10.0-327.18.2.el7.x86_64 # cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-327.18.2.el7.x86_64 ... namespace.enable=1 # cat /proc/mounts |wc -l 29 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 30 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 32 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 36 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 44 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 60 # tail -6 /proc/mounts /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0 but have just one entry added when run inside root's mount namespace: # unshare -m # cat /proc/mounts | wc -l 60 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 61 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 62 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 63 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 64 # tail -6 /proc/mounts /dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /root xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 /dev/mapper/rhel_rhel7-root /mnt xfs rw,relatime,attr2,inode64,noquota 0 0 # logout # cat /proc/mounts | wc -l 60 mount or mount+user namespace cannot be entered from non-root user. user namespace entered from non-root user does not allow mount: # su - testuser $ unshare -m unshare: unshare failed: Operation not permitted $ unshare -U -r -m unshare: unshare failed: Operation not permitted $ unshare -U -r # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l mount: permission denied # logout so mount table exponential growth is possible by root only, user/mount namespaces do not allow entering or mounting and so this is not a security flaw (as root user have many simpler ways to destroy the system). 2) on the other hand, this behavior is not reproducible in rhel-6: # uname -r 2.6.32-573.22.1.el6.x86_64 # cat /proc/mounts | wc -l 10 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 11 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 12 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 13 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 14 # mount -o bind /root/ /mnt/ && cat /proc/mounts | wc -l 15 # tail /proc/mounts tmpfs /dev/shm tmpfs rw,relatime 0 0 /dev/vda3 / ext4 rw,relatime,barrier=1,data=ordered 0 0 /proc/bus/usb /proc/bus/usb usbfs rw,relatime 0 0 /dev/vda1 /boot ext4 rw,relatime,barrier=1,data=ordered 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw,relatime 0 0 /dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0 /dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0 /dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0 /dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0 /dev/vda3 /mnt ext4 rw,relatime,barrier=1,data=ordered 0 0 so in some sense this exponential growth of a mount table is a kind of a regression. 3) on the third hand, the same behavior is present in the latest upstream kernel: mount table also expands by a power-of-two with each mount command when run by root: # uname -r 4.6.0-vladis # cat /proc/mounts |wc -l 30 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 31 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 33 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 37 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 45 # tail -5 /proc/mounts /dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 and again, have just one entry added when run inside mount namespace entered from non-root user: # su - testuser $ unshare -U -r -m # cat /proc/mounts |wc -l 45 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 46 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 47 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 48 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 49 # mount -o bind /root/ /mnt/ && cat /proc/mounts |wc -l 50 # tail -6 /proc/mounts /dev/mapper/fedora_feraw-root /root ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 /dev/mapper/fedora_feraw-root /mnt ext4 rw,seclabel,relatime,data=ordered 0 0 so, i do not think this exponential growth of a mount table is easily fixable in rhel-7, as (afaiu) we first need to fix this in an upstream (and make them [Linus] to accept it).
This bug is now all around the user namespace. http://seclists.org/oss-sec/2016/q3/75 First of all, this patch along from Al Viro helps significantly for the softlock up/NMI watchdog with the high mount table entries. ================= On Thu, Jul 14, 2016 at 05:11:13PM -0400, CAI Qian wrote: > Tested it on this large memory machine. consumed 1.5G memory to create 8388640 > entries in the mount table. Immediately afterwards, NMI watchdog/soft-lockup > kicked in and the kernel is dead. Cute... Doesn't look like an OOM, though - more like hash chains growing long enough for hash insertions to happen while we'd been searching. Which triggers repeated lookup, etc. There's your livelock... Actually, it's not even hash insertions; clone_mnt() does lock_mount_hash(); list_add_tail(&mnt->mnt_instance, &sb->s_mounts); unlock_mount_hash(); and _that_ is an overkill. We don't need write_seqlock(&mount_lock) for that; read_seqlock_excl() is enough. No need to bump the count on that one. I suspect that this alone will make the situation with livelocks much better; it's not the only place where we would be fine with read_seqlock_excl(), but it's the easiest source of rapid bumps. Won't do anything about OOM, obviously... diff --git a/fs/namespace.c b/fs/namespace.c index 419f746..5ced8e3 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1013,9 +1013,9 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, mnt->mnt.mnt_root = dget(root); mnt->mnt_mountpoint = mnt->mnt.mnt_root; mnt->mnt_parent = mnt; - lock_mount_hash(); + read_seqlock_excl(&mount_lock); list_add_tail(&mnt->mnt_instance, &sb->s_mounts); - unlock_mount_hash(); + read_sequnlock_excl(&mount_lock); if ((flag & CL_SLAVE) || ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) { =================== The patch does seems improve the system response time a lot once the mount count is over 8 millions. Although it still triggered NMI/softlock during bind mount, the system seems running fine afterwards. No significant deadlock anymore. While the mount table is growing bigger, those NMI/softlock happens more often. Eventually, NMI/softlock is running in a loop and the CPU won't be able to process anything when 9G of memory is consumed by those bind mount.
Secondly, we might need something like this below to limit the number of mount namespaces, so the admin can set an upper limit. https://lists.linuxfoundation.org/pipermail/containers/2016-July/037216.html
hello. vladis from prodsec is here again, analyzing new exploit possibilities suggested. 1) exploit described in http://seclists.org/oss-sec/2016/q3/65 > # docker run -it -v /mnt/:/mnt/:shared --cap-add=SYS_ADMIN rhel7 /bin/bash > (insider container) # for i in `seq 1 20`; mount -o bind /mnt/1 /mnt/2; done still requires a privileged user giving the container --cap-add=SYS_ADMIN. as previously, privileged user has much simpler ways to do bad things to a system. 2) exploit with user namespaces described in http://seclists.org/oss-sec/2016/q3/75 > $ unshare -r -m --propagation shared > # for i in `seq 1 30`; do mount -o bind ~/src/ ~/dst/; done testing this on the latest RHEL-7.3-20160825.1-Server (beta) still shows that unprivileged mount namespaces are not enabled in rhel-7.3: > $ uname -r > 3.10.0-495.el7.x86_64 > > $ unshare -r -m > unshare: unshare failed: Operation not permitted > > $ unshare -r -m --propagation shared > unshare: unshare failed: Operation not permitted making the exploit described impossible. so, under current constraints of rhel-7 it's probably unexploitable. yes, it is still exploitable on fedora/upstream-kernel, bz1356472 is for this. 3) still, there was a discussion, if we are sure that unprivileged mount namespaces will remain disabled in future rhel-7. containers/docker presses quite hard to get these features in, so we don't. so it was suggested to have y-stream rhel-7 tracker bz1322495 (this one) and probably cc: Eric Biederman. if this is solved, the tracker bzs for rhel-7/kernel-rt, mrg-2/realtime-kernel and rhel-7/arm-kernel should be created too.
per Andrew Vagin: https://lkml.org/lkml/2016/8/28/269
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing
Patch(es) available on kernel-3.10.0-644.el7
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842