Bug 1473327
Summary: | Brick Multiplexing: Seeing stale brick process when all gluster processes are stopped and then started with glusterd | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> |
Component: | glusterd | Assignee: | Atin Mukherjee <amukherj> |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> |
Severity: | low | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.3 | CC: | amukherj, rhinduja, rhs-bugs, storage-qa-internal, vbellur |
Target Milestone: | --- | ||
Target Release: | RHGS 3.3.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | glusterfs-3.8.4-36 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-21 05:04:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1465559 | ||
Bug Blocks: | 1417151 |
Description
Nag Pavan Chilakam
2017-07-20 13:46:12 UTC
upstream patch : https://review.gluster.org/#/c/17840/ steps to reproduce: =============== 1)enable brick mux , have say 3 volumes started and run below command #for i in {1..1000};do pkill glusterfsd;pkill glusterfs;service glusterd stop;ps -ef|grep glusterfsd;echo "######### all gluster* down ###########;sleep 3;service glusterd start;sleep 10;ps -ef|grep glusterfsd;echo "######### all gluster* UP ###########sleep 60;echo "=======end of loop 1 =======;done you should see the problem in first few iterations itself as below: root 26788 1 0 19:23 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id vname_1.10.70.35.45.rhs-brick1-vname_1 -p /var/lib/glusterd/vols/vname_1/run/10.70.35.45-rhs-brick1-vname_1.pid -S /var/run/gluster/43cfdbc6a1a48639ba3d5e8f7dda93e8.socket --brick-name /rhs/brick1/vname_1 -l /var/log/glusterfs/bricks/rhs-brick1-vname_1.log --xlator-option *-posix.glusterd-uuid=44e38968-a30c-4e0f-a1ca-f701025335e6 --brick-port 49155 --xlator-option vname_1-server.listen-port=49155 root 26840 1 1 19:23 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id vname_2.10.70.35.45.rhs-brick2-vname_2 -p /var/lib/glusterd/vols/vname_2/run/10.70.35.45-rhs-brick2-vname_2.pid -S /var/run/gluster/1400341996df928682daaa9f3eaa2b7a.socket --brick-name /rhs/brick2/vname_2 -l /var/log/glusterfs/bricks/rhs-brick2-vname_2.log --xlator-option *-posix.glusterd-uuid=44e38968-a30c-4e0f-a1ca-f701025335e6 --brick-port 49156 --xlator-option vname_2-server.listen-port=49156 [root@dhcp35-45 ~]# gluster v status Status of volume: vname_1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick1/vname_1 49153 0 Y 27754 Brick 10.70.35.130:/rhs/brick1/vname_1 49152 0 Y 3870 Brick 10.70.35.122:/rhs/brick1/vname_1 49152 0 Y 30855 Self-heal Daemon on localhost N/A N/A Y 27800 Self-heal Daemon on 10.70.35.23 N/A N/A Y 22918 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14532 Self-heal Daemon on 10.70.35.112 N/A N/A Y 32335 Self-heal Daemon on 10.70.35.122 N/A N/A Y 30923 Self-heal Daemon on 10.70.35.130 N/A N/A Y 3938 Task Status of Volume vname_1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vname_2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick2/vname_2 49153 0 Y 27754 Brick 10.70.35.130:/rhs/brick2/vname_2 49152 0 Y 3870 Brick 10.70.35.122:/rhs/brick2/vname_2 49152 0 Y 30855 Self-heal Daemon on localhost N/A N/A Y 27800 Self-heal Daemon on 10.70.35.23 N/A N/A Y 22918 Self-heal Daemon on 10.70.35.112 N/A N/A Y 32335 Self-heal Daemon on 10.70.35.130 N/A N/A Y 3938 Self-heal Daemon on 10.70.35.122 N/A N/A Y 30923 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14532 Task Status of Volume vname_2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vname_3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick3/vname_3 49153 0 Y 27754 Brick 10.70.35.130:/rhs/brick3/vname_3 49152 0 Y 3870 Brick 10.70.35.122:/rhs/brick3/vname_3 49152 0 Y 30855 Self-heal Daemon on localhost N/A N/A Y 27800 Self-heal Daemon on 10.70.35.23 N/A N/A Y 22918 Self-heal Daemon on 10.70.35.138 N/A N/A Y 14532 Self-heal Daemon on 10.70.35.130 N/A N/A Y 3938 Self-heal Daemon on 10.70.35.122 N/A N/A Y 30923 Self-heal Daemon on 10.70.35.112 N/A N/A Y 32335 Task Status of Volume vname_3 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-45 ~]# gluster v info Volume Name: vname_1 Type: Replicate Volume ID: 9916529a-29f4-428c-bc0b-80008e314794 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick1/vname_1 Brick2: 10.70.35.130:/rhs/brick1/vname_1 Brick3: 10.70.35.122:/rhs/brick1/vname_1 Options Reconfigured: nfs.disable: on transport.address-family: inet cluster.brick-multiplex: on Volume Name: vname_2 Type: Replicate Volume ID: 1bab2084-0272-4ad8-be71-1b4434dd1410 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick2/vname_2 Brick2: 10.70.35.130:/rhs/brick2/vname_2 Brick3: 10.70.35.122:/rhs/brick2/vname_2 Options Reconfigured: nfs.disable: on transport.address-family: inet cluster.brick-multiplex: on Volume Name: vname_3 Type: Replicate Volume ID: ff65c115-91ca-4041-8d1c-c57d15ab5b36 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick3/vname_3 Brick2: 10.70.35.130:/rhs/brick3/vname_3 Brick3: 10.70.35.122:/rhs/brick3/vname_3 Options Reconfigured: nfs.disable: on transport.address-family: inet cluster.brick-multiplex: on Proposing as blocker, because the stale process is increasingly consuming resources(whether idle or with IOs) initally before triggering IOs, it shot up from 0.2 to 0.4% mem and with IOs, slowly, went up to 1.2%(linux untar on both vname_1 and vname_2) [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2102180 40292 3508 S 0.0 0.5 0:00.08 glusterfsd 27782 root 20 0 1020692 14180 3464 S 0.0 0.2 0:00.14 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2102180 40292 3508 S 0.0 0.5 0:00.09 glusterfsd 27782 root 20 0 1020692 20040 3464 S 0.0 0.2 0:00.15 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27782 root 20 0 1020692 29020 4032 S 15.8 0.4 0:00.52 glusterfsd 27754 root 20 0 2102180 42864 3932 S 0.0 0.5 0:00.63 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 45224 4216 S 45.0 0.6 0:06.13 glusterfsd 27782 root 20 0 1086488 30580 4312 S 25.0 0.4 0:04.76 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 46016 4216 S 47.4 0.6 0:07.63 glusterfsd 27782 root 20 0 1086488 30844 4324 S 31.6 0.4 0:05.85 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 47336 4216 S 50.0 0.6 0:10.20 glusterfsd 27782 root 20 0 1086488 31636 4324 S 38.9 0.4 0:07.77 glusterfsd [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 48128 4216 S 42.1 0.6 0:11.28 glusterfsd 27782 root 20 0 1086488 32428 4324 S 31.6 0.4 0:08.60 glusterfsd [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 49184 4216 S 44.4 0.6 0:13.96 glusterfsd 27782 root 20 0 1086488 33220 4324 S 33.3 0.4 0:10.58 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 63148 4348 S 42.1 0.8 0:43.89 glusterfsd 27782 root 20 0 1152284 44568 4324 S 36.8 0.6 0:32.80 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 91112 4348 S 42.9 1.1 1:49.71 glusterfsd 27782 root 20 0 1152284 68064 4324 S 33.3 0.8 1:22.14 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2233772 91112 4348 S 36.4 1.1 2:12.93 glusterfsd 27782 root 20 0 1152284 76472 4324 S 27.3 1.0 1:39.32 glusterfsd [root@dhcp35-45 ~]# apes -bash: apes: command not found [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27782 root 20 0 1152284 83096 4324 S 27.3 1.0 2:41.78 glusterfsd 27754 root 20 0 2233772 91112 4348 S 0.0 1.1 2:46.05 glusterfsd [root@dhcp35-45 ~]# ls / bin dev gluster lib media opt rhev root sbin sys usr boot etc home lib64 mnt proc rhs run srv tmp var [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27782 root 20 0 1152284 83096 4324 S 28.6 1.0 3:33.26 glusterfsd 27754 root 20 0 2233772 91280 4348 S 0.0 1.1 2:46.08 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27782 root 20 0 1152284 83360 4324 S 30.0 1.0 3:57.97 glusterfsd 27754 root 20 0 2233772 91280 4348 S 0.0 1.1 2:46.10 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27782 root 20 0 1152284 83360 4324 S 28.6 1.0 4:39.39 glusterfsd 27754 root 20 0 2233772 91280 4348 S 0.0 1.1 2:46.12 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234032 91280 4348 S 45.0 1.1 2:47.72 glusterfsd 27782 root 20 0 1152284 83360 4324 S 30.0 1.0 4:51.37 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234032 91804 4404 S 42.1 1.1 2:49.69 glusterfsd 27782 root 20 0 1152284 83360 4324 S 26.3 1.0 4:52.62 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 91804 4404 S 50.0 1.1 2:56.10 glusterfsd 27782 root 20 0 1152284 83360 4324 S 30.0 1.0 4:57.11 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 91804 4404 S 52.6 1.1 3:01.19 glusterfsd 27782 root 20 0 1152284 83360 4324 S 36.8 1.0 5:00.64 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 91804 4404 S 45.0 1.1 3:14.37 glusterfsd 27782 root 20 0 1152284 83360 4324 S 35.0 1.0 5:10.02 glusterfsd [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 91804 4404 S 44.4 1.1 3:15.56 glusterfsd 27782 root 20 0 1152284 83360 4324 S 33.3 1.0 5:10.85 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 92068 4404 S 40.9 1.1 3:29.01 glusterfsd 27782 root 20 0 1152284 83360 4324 S 27.3 1.0 5:20.07 glusterfsd [root@dhcp35-45 ~]# ps -ef|grep glusterfd root 28086 23676 0 19:47 pts/1 00:00:00 grep --color=auto glusterfd [root@dhcp35-45 ~]# ps -ef|grep glusterfsd root 27754 1 15 19:24 ? 00:03:33 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id vname_1.10.70.35.45.rhs-brick1-vname_1 -p /var/lib/glusterd/vols/vname_1/run/10.70.35.45-rhs-brick1-vname_1.pid -S /var/run/gluster/43cfdbc6a1a48639ba3d5e8f7dda93e8.socket --brick-name /rhs/brick1/vname_1 -l /var/log/glusterfs/bricks/rhs-brick1-vname_1.log --xlator-option *-posix.glusterd-uuid=44e38968-a30c-4e0f-a1ca-f701025335e6 --brick-port 49153 --xlator-option vname_1-server.listen-port=49153 root 27782 1 23 19:24 ? 00:05:23 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id vname_2.10.70.35.45.rhs-brick2-vname_2 -p /var/lib/glusterd/vols/vname_2/run/10.70.35.45-rhs-brick2-vname_2.pid -S /var/run/gluster/1400341996df928682daaa9f3eaa2b7a.socket --brick-name /rhs/brick2/vname_2 -l /var/log/glusterfs/bricks/rhs-brick2-vname_2.log --xlator-option *-posix.glusterd-uuid=44e38968-a30c-4e0f-a1ca-f701025335e6 --brick-port 49154 --xlator-option vname_2-server.listen-port=49154 root 28088 23676 0 19:47 pts/1 00:00:00 grep --color=auto glusterfsd [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 92068 4404 S 45.0 1.1 3:44.37 glusterfsd 27782 root 20 0 1152284 83360 4324 S 30.0 1.0 5:30.90 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 92332 4404 S 42.9 1.2 3:58.29 glusterfsd 27782 root 20 0 1152284 83360 4324 S 33.3 1.0 5:40.82 glusterfsd [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# [root@dhcp35-45 ~]# top -n 1 -b|egrep "glusterfsd|RES" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 27754 root 20 0 2234292 93124 4404 S 0.0 1.2 5:32.95 glusterfsd 27782 root 20 0 1152284 83588 4328 S 0.0 1.0 6:19.33 glusterfsd [root@dhcp35-45 ~]# downstream patch : https://code.engineering.redhat.com/gerrit/#/c/113212 on_qa validation: on 3.8.4-36 Reran the above steps in description. I did see more than 1 brick processes spawning, but they are not stale. Rather diffent volumes' bricks get different pid( 2 glusterfsds one for each of the volume), hence breaking brick mux feature.For this there is already a bug hence moving to verified Also, I hit a glusterd core, while stopping volumes or stopping glusterd service. For this I will raise a new bug if not existing core and sosreports with logs for the crash available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/sosreport-for-glusterd-core-while-verifying-bz-1473327/ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774 |