1019908 – Error while executing action CommitRemoveGlusterVolumeBricks: Command execution failed return code: 146

Bug 1019908 - Error while executing action CommitRemoveGlusterVolumeBricks: Command execution failed return code: 146

Summary: Error while executing action CommitRemoveGlusterVolumeBricks: Command executi...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterfs
Sub Component:
Version:	2.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 2.1.2
Assignee:	Vijaikumar Mallikarjuna
QA Contact:	Lalatendu Mohanty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1044337
TreeView+	depends on / blocked

Reported:	2013-10-16 15:29 UTC by Prasanth
Modified:	2016-05-11 22:47 UTC (History)
CC List:	15 users (show)
Fixed In Version:	glusterfs-3.4.0.52rhs-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, the replace-brick command would become unresponsive and enter a deadlock situtation after sending commit acknowledgement to the other nodes in the cluster. This led to the CLI command execution failures from glusterd with ECONNREFUSED(146) error. With this fix, the replace-brick command does not become unresponsive and works as expected.
Clone Of:
Clones:	1044337 (view as bug list)
Environment:
Last Closed:	2014-02-25 07:53:01 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot of error (124.47 KB, image/png) 2013-10-16 15:29 UTC, Prasanth	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2014:0208	0	normal	SHIPPED_LIVE	Red Hat Storage 2.1 enhancement and bug fix update #2	2014-02-25 12:20:30 UTC

Description Prasanth 2013-10-16 15:29:43 UTC

Created attachment 813010 [details]
screenshot of error

Description of problem:

Error while executing action CommitRemoveGlusterVolumeBricks: Command execution failed return code: 146

Following exception is seen in the vdsm.log:

---------------
Thread-2183::ERROR::2013-10-16 20:50:17,200::BindingXMLRPC::990::vds::(wrapper) vdsm exception occured
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/gluster/api.py", line 53, in wrapper
    rv = func(*args, **kwargs)
  File "/usr/share/vdsm/gluster/api.py", line 290, in tasksList
    status = self.svdsmProxy.glusterTasksList(taskIds)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterTasksList
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
return code: 2
Thread-2190::DEBUG::2013-10-16 20:50:23,958::task::579::TaskManager.Task::(_updateState) Task=`19d8776f-094c-4319-9abf-551b162641e9`::moving from state init -> state preparing
---------------

Version-Release number of selected component (if applicable):

[root@vm07 /]# rpm -qa |grep rhsc
rhsc-restapi-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-lib-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-cli-2.1.0.0-0.bb3a.el6rhs.noarch
rhsc-webadmin-portal-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-sdk-2.1.0.0-0.bb3a.el6rhs.noarch
rhsc-branding-rhs-3.3.0-1.0.master.201309200500.fc18.noarch
rhsc-backend-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-tools-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-dbscripts-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-setup-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-2.1.2-0.0.scratch.beta1.el6_4.noarch
rhsc-log-collector-2.1-0.1.el6rhs.noarch


[root@vm12 /]#  rpm -qa |grep vdsm
vdsm-4.13.0-17.gitdbbbacd.el6_4.x86_64
vdsm-python-4.13.0-17.gitdbbbacd.el6_4.x86_64
vdsm-python-cpopen-4.13.0-17.gitdbbbacd.el6_4.x86_64
vdsm-xmlrpc-4.13.0-17.gitdbbbacd.el6_4.noarch
vdsm-cli-4.13.0-17.gitdbbbacd.el6_4.noarch
vdsm-gluster-4.13.0-17.gitdbbbacd.el6_4.noarch
vdsm-reg-4.13.0-17.gitdbbbacd.el6_4.noarch


How reproducible: 


Steps to Reproduce:
1. Start remove brick in a distribute volume having some good amount of data
2. Once the data migration is done, click on "Commit"
3. Try to execute some gluster commands from the CLI
4. It keeps rotating for some time and finally it fails

The only difference is: Some I/O was happening on the servers but in a different volume

Actual results: Commit remove brick fails


Expected results: Commit brick should be successful


Additional info: Screenshot is attached at the time of error.
sosreports from the engine and the nodes will be attached soon.

Comment 1 Prasanth 2013-10-16 15:40:21 UTC

sosreports of engine and 2 nodes can be downloaded from: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/

Comment 3 Bala.FA 2013-10-17 10:08:56 UTC

unable to download http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/

Comment 4 Prasanth 2013-10-17 10:28:51 UTC

(In reply to Bala.FA from comment #3)
> unable to download
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/

It was a permission issue. Please try now.

Comment 5 Sahina Bose 2013-11-08 11:27:00 UTC

On server 2
From the vdsm log:
Thread-1941::ERROR::2013-10-16 20:49:08,225::BindingXMLRPC::990::vds::(wrapper) vdsm exception occured
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/gluster/api.py", line 53, in wrapper
    rv = func(*args, **kwargs)
  File "/usr/share/vdsm/gluster/api.py", line 193, in volumeRemoveBrickCommit
    replicaCount)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeRemoveBrickCommit
  File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
return code: 146

From .cmd_log_history
[2013-10-16 15:21:20.626349]  : volume remove-brick vol1 vm13.lab.eng.blr.redhat.com:/home/3 commit : SUCCESS

[2013-10-16 15:20:47.970758] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /home/3 on port 49154
[2013-10-16 15:20:47.987665] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.36.85:992 failed (Broken pipe)
[2013-10-16 15:20:47.987690] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2013-10-16 15:20:54.144921] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0
[2013-10-16 15:20:54.144968] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0
[2013-10-16 15:20:54.144999] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2013-10-16 15:20:54.145286] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-10-16 15:20:54.145301] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully

Bala, can you check this?

Comment 6 Sahina Bose 2013-11-12 10:00:56 UTC

Could not reproduce this after updating the latest RHS U2 build.
# glusterfs --version
glusterfs 3.4.0.42.1u2rhs built on Nov 11 2013 05:03:23

vdsm
vdsm-4.13.0-21.el6rhs.x86_64

Tried the following:
I) commit while another operation in progress
1. Start remove brick on a volume
2. Parallely start rebalance on another volume in cluster
3. Click on commit remove brick (once migration completed)
4. Parallely run remove brick status from gluster CLI (gave an error that another operation is in progress)

-- Brick removal was committed

II) commit while brick process was down
1. Start remove brick on a volume
2. On migration complete, bring down brick process of the brick being removed
3. Click on commit remove brick

-- Brick removal was committed

I'm moving this to ON_QA. Please reopen if you encounter again.

Comment 7 Prasanth 2013-12-06 12:08:47 UTC

This issue is not seen in cb7-cb10 builds. So marking it as verified.

Will reopen, if I encounter it again.

Comment 8 Ben Turner 2013-12-12 17:01:16 UTC

I think we hit this today in BVT.  We were executing replace brick when we got the following error:

:: [   FAIL   ] :: Running 'gluster volume replace-brick hosdu rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick8 start' (Expected 0, got 146)

After hitting that all subsequent tests failed with:

Connection failed. Please check if gluster daemon is operational.
:: [   FAIL   ] :: Running 'gluster volume replace-brick hosdu rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick8 commit' (Expected 0, got 1)

My guess is that this issue is intermittent as this is the first time I have seen this and Prasanth hasn't seen this for several builds.  The package we were running on was:

glusterfs-server-3.4.0.48geo-1386779541.el6.x86_64.rpm

I am going to queue up a several replace brick tests on the latest build and attempt to repro.  I'll update with details as I have them.

Comment 9 Bala.FA 2013-12-13 05:37:19 UTC

Please attach sosreport.

Comment 10 Dusmant 2013-12-13 11:55:27 UTC

Do we support replace brick? I think, that's being taken out... we should not run BVT tests for replace brick...

Comment 11 Ben Turner 2013-12-13 15:05:22 UTC

We were able to repro last night:

Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.430075,  0] lib/fault.c:47(fault_report)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   ===============================================================
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.451292,  0] lib/fault.c:48(fault_report)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   INTERNAL ERROR: Signal 6 in pid 8263 (3.6.9-160.7.el6rhs)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   Please read the Trouble-Shooting section of the Samba3-HOWTO
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.451687,  0] lib/fault.c:50(fault_report)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.452078,  0] lib/fault.c:51(fault_report)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   ===============================================================
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.452339,  0] lib/util.c:1117(smb_panic)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   PANIC (pid 8263): internal error
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.735252,  0] lib/util.c:1221(log_stack_trace)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   BACKTRACE: 11 stack frames:
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #0 smbd(log_stack_trace+0x1a) [0x7f6ae51d34fa]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #1 smbd(smb_panic+0x2b) [0x7f6ae51d35cb]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #2 smbd(+0x41a054) [0x7f6ae51c4054]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #3 /lib64/libc.so.6(+0x32960) [0x7f6ae1070960]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #4 /lib64/libc.so.6(gsignal+0x35) [0x7f6ae10708e5]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #5 /lib64/libc.so.6(abort+0x175) [0x7f6ae10720c5]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #6 /lib64/libc.so.6(+0x707f7) [0x7f6ae10ae7f7]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #7 /lib64/libc.so.6(+0x76126) [0x7f6ae10b4126]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #8 /lib64/libc.so.6(+0x78c53) [0x7f6ae10b6c53]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #9 /usr/lib64/libtalloc.so.2(+0x2389) [0x7f6ae1c15389]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:    #10 /usr/lib64/libtalloc.so.2(+0x2323) [0x7f6ae1c15323]
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.811070,  0] lib/fault.c:372(dump_core)
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]:   dumping core in /var/log/samba/cores/smbd
Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: 

I am attaching sosreports, but the cores are pretty large.  I'll post a link to them.

Comment 13 Vijaikumar Mallikarjuna 2013-12-16 10:07:47 UTC

Hi Ben,

From the SOS report, I see rdma error in the all glusterd logs. I guess glusterd failed because of this.


[2013-12-13 07:43:25.634291] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.249:24007 failed (No data available)
[2013-12-13 07:43:26.269897] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.252:24007 failed (No data available)
[2013-12-13 07:43:26.511746] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.36.252:24007 failed (Connection refused)
[2013-12-13 07:43:29.517576] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.36.249:24007 failed (Connection refused)
[2013-12-13 07:43:32.225933] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x37532e894d] (-->/lib64/libpthread.so.0() [0x3753607851] (-->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x4052ad]))) 0-: received signum (15), shutting down
[2013-12-13 19:23:29.354599] I [glusterfsd.c:2026:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.0.48geo (/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2013-12-13 19:23:29.771320] I [glusterd.c:1148:init] 0-management: Using /var/lib/glusterd as working directory
[2013-12-13 19:23:29.771717] I [rpcsvc.c:2058:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2013-12-13 19:23:30.221835] I [socket.c:3505:socket_init] 0-socket.management: SSL support is NOT enabled
[2013-12-13 19:23:30.221896] I [socket.c:3520:socket_init] 0-socket.management: using system polling thread
[2013-12-13 19:23:30.384476] C [rdma.c:4099:gf_rdma_init] 0-rpc-transport/rdma: Failed to get IB devices
[2013-12-13 19:23:30.384724] E [rpc-transport.c:333:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2013-12-13 19:23:30.384802] W [rpcsvc.c:1481:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2013-12-13 19:23:30.384854] I [rpcsvc.c:2058:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64
[2013-12-13 19:23:30.385028] I [socket.c:3505:socket_init] 0-socket.management: SSL support is NOT enabled
[2013-12-13 19:23:30.385049] I [socket.c:3520:socket_init] 0-socket.management: using system polling thread
[2013-12-13 19:23:38.766390] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/glusterd.info, error: No such file or directory
[2013-12-13 19:23:38.766499] E [glusterd-store.c:1337:glusterd_retrieve_op_version] 0-: Unable to get store handle!
[2013-12-13 19:23:38.766553] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/glusterd.info, error: No such file or directory
[2013-12-13 19:23:38.766592] E [glusterd-store.c:1438:glusterd_retrieve_uuid] 0-: Unable to get store handle!
[2013-12-13 19:23:38.766610] I [glusterd-store.c:1408:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 2
[2013-12-13 19:23:38.766966] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/options, error: No such file or directory

Comment 14 Vijaikumar Mallikarjuna 2013-12-16 12:27:33 UTC

We have a re-creatable setup. glusterd hangs after starting replace brick.

I am working on finding root-cause of the issue, will update on the status soon.

Comment 17 Lalatendu Mohanty 2013-12-16 14:19:08 UTC

After debugging the smbd crash issue further, it seems smbd crash is not related to the glusterd issue. Hence raising a new bug for the smbd issue.

Comment 18 Lalatendu Mohanty 2013-12-16 14:32:34 UTC

BZ 1043519 raised for the smbd crash issue.

Comment 19 Vijaikumar Mallikarjuna 2013-12-19 12:05:33 UTC

Patch: https://code.engineering.redhat.com/gerrit/17713

Comment 20 Prasanth 2013-12-21 10:41:35 UTC

Ben/Lala,

Could you please run the BVT tests and see if you are hitting this issue with glusterfs-3.4.0.52rhs-1.el6rhs?

Comment 21 Lalatendu Mohanty 2013-12-23 09:30:31 UTC

I ran BVT after the code got merged in to the downstream code and tests are passing fine.

Comment 22 Lalatendu Mohanty 2013-12-23 09:31:41 UTC

Marking the Bug as verified as BVT ran with the fix and the concern test case is passing now.

Comment 23 Pavithra 2014-01-03 06:48:30 UTC

Please verify the edited doc text for technical accuracy.

Comment 24 Vijaikumar Mallikarjuna 2014-01-03 07:22:54 UTC

doctext looks good to me.

Comment 26 errata-xmlrpc 2014-02-25 07:53:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html

Note You need to log in before you can comment on or make changes to this bug.