Created attachment 813010 [details] screenshot of error Description of problem: Error while executing action CommitRemoveGlusterVolumeBricks: Command execution failed return code: 146 Following exception is seen in the vdsm.log: --------------- Thread-2183::ERROR::2013-10-16 20:50:17,200::BindingXMLRPC::990::vds::(wrapper) vdsm exception occured Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/gluster/api.py", line 53, in wrapper rv = func(*args, **kwargs) File "/usr/share/vdsm/gluster/api.py", line 290, in tasksList status = self.svdsmProxy.glusterTasksList(taskIds) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in glusterTasksList File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) GlusterCmdExecFailedException: Command execution failed return code: 2 Thread-2190::DEBUG::2013-10-16 20:50:23,958::task::579::TaskManager.Task::(_updateState) Task=`19d8776f-094c-4319-9abf-551b162641e9`::moving from state init -> state preparing --------------- Version-Release number of selected component (if applicable): [root@vm07 /]# rpm -qa |grep rhsc rhsc-restapi-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-lib-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-cli-2.1.0.0-0.bb3a.el6rhs.noarch rhsc-webadmin-portal-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-sdk-2.1.0.0-0.bb3a.el6rhs.noarch rhsc-branding-rhs-3.3.0-1.0.master.201309200500.fc18.noarch rhsc-backend-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-tools-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-dbscripts-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-setup-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-2.1.2-0.0.scratch.beta1.el6_4.noarch rhsc-log-collector-2.1-0.1.el6rhs.noarch [root@vm12 /]# rpm -qa |grep vdsm vdsm-4.13.0-17.gitdbbbacd.el6_4.x86_64 vdsm-python-4.13.0-17.gitdbbbacd.el6_4.x86_64 vdsm-python-cpopen-4.13.0-17.gitdbbbacd.el6_4.x86_64 vdsm-xmlrpc-4.13.0-17.gitdbbbacd.el6_4.noarch vdsm-cli-4.13.0-17.gitdbbbacd.el6_4.noarch vdsm-gluster-4.13.0-17.gitdbbbacd.el6_4.noarch vdsm-reg-4.13.0-17.gitdbbbacd.el6_4.noarch How reproducible: Steps to Reproduce: 1. Start remove brick in a distribute volume having some good amount of data 2. Once the data migration is done, click on "Commit" 3. Try to execute some gluster commands from the CLI 4. It keeps rotating for some time and finally it fails The only difference is: Some I/O was happening on the servers but in a different volume Actual results: Commit remove brick fails Expected results: Commit brick should be successful Additional info: Screenshot is attached at the time of error. sosreports from the engine and the nodes will be attached soon.
sosreports of engine and 2 nodes can be downloaded from: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/
unable to download http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/
(In reply to Bala.FA from comment #3) > unable to download > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/rhsc/1019908/ It was a permission issue. Please try now.
On server 2 From the vdsm log: Thread-1941::ERROR::2013-10-16 20:49:08,225::BindingXMLRPC::990::vds::(wrapper) vdsm exception occured Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/gluster/api.py", line 53, in wrapper rv = func(*args, **kwargs) File "/usr/share/vdsm/gluster/api.py", line 193, in volumeRemoveBrickCommit replicaCount) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in glusterVolumeRemoveBrickCommit File "/usr/lib64/python2.6/multiprocessing/managers.py", line 740, in _callmethod raise convert_to_error(kind, result) GlusterCmdExecFailedException: Command execution failed return code: 146 From .cmd_log_history [2013-10-16 15:21:20.626349] : volume remove-brick vol1 vm13.lab.eng.blr.redhat.com:/home/3 commit : SUCCESS [2013-10-16 15:20:47.970758] I [glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /home/3 on port 49154 [2013-10-16 15:20:47.987665] W [socket.c:522:__socket_rwv] 0-socket.management: writev on 10.70.36.85:992 failed (Broken pipe) [2013-10-16 15:20:47.987690] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now [2013-10-16 15:20:54.144921] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=2236 max=0 total=0 [2013-10-16 15:20:54.144968] I [mem-pool.c:539:mem_pool_destroy] 0-management: size=124 max=0 total=0 [2013-10-16 15:20:54.144999] I [socket.c:2235:socket_event_handler] 0-transport: disconnecting now [2013-10-16 15:20:54.145286] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-10-16 15:20:54.145301] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully Bala, can you check this?
Could not reproduce this after updating the latest RHS U2 build. # glusterfs --version glusterfs 3.4.0.42.1u2rhs built on Nov 11 2013 05:03:23 vdsm vdsm-4.13.0-21.el6rhs.x86_64 Tried the following: I) commit while another operation in progress 1. Start remove brick on a volume 2. Parallely start rebalance on another volume in cluster 3. Click on commit remove brick (once migration completed) 4. Parallely run remove brick status from gluster CLI (gave an error that another operation is in progress) -- Brick removal was committed II) commit while brick process was down 1. Start remove brick on a volume 2. On migration complete, bring down brick process of the brick being removed 3. Click on commit remove brick -- Brick removal was committed I'm moving this to ON_QA. Please reopen if you encounter again.
This issue is not seen in cb7-cb10 builds. So marking it as verified. Will reopen, if I encounter it again.
I think we hit this today in BVT. We were executing replace brick when we got the following error: :: [ FAIL ] :: Running 'gluster volume replace-brick hosdu rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick8 start' (Expected 0, got 146) After hitting that all subsequent tests failed with: Connection failed. Please check if gluster daemon is operational. :: [ FAIL ] :: Running 'gluster volume replace-brick hosdu rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick7 rhsauto022.lab.eng.blr.redhat.com:/bricks/hosdu_brick8 commit' (Expected 0, got 1) My guess is that this issue is intermittent as this is the first time I have seen this and Prasanth hasn't seen this for several builds. The package we were running on was: glusterfs-server-3.4.0.48geo-1386779541.el6.x86_64.rpm I am going to queue up a several replace brick tests on the latest build and attempt to repro. I'll update with details as I have them.
Please attach sosreport.
Do we support replace brick? I think, that's being taken out... we should not run BVT tests for replace brick...
We were able to repro last night: Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.430075, 0] lib/fault.c:47(fault_report) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: =============================================================== Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.451292, 0] lib/fault.c:48(fault_report) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: INTERNAL ERROR: Signal 6 in pid 8263 (3.6.9-160.7.el6rhs) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: Please read the Trouble-Shooting section of the Samba3-HOWTO Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.451687, 0] lib/fault.c:50(fault_report) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: From: http://www.samba.org/samba/docs/Samba3-HOWTO.pdf Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.452078, 0] lib/fault.c:51(fault_report) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: =============================================================== Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.452339, 0] lib/util.c:1117(smb_panic) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: PANIC (pid 8263): internal error Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.735252, 0] lib/util.c:1221(log_stack_trace) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: BACKTRACE: 11 stack frames: Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #0 smbd(log_stack_trace+0x1a) [0x7f6ae51d34fa] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #1 smbd(smb_panic+0x2b) [0x7f6ae51d35cb] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #2 smbd(+0x41a054) [0x7f6ae51c4054] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #3 /lib64/libc.so.6(+0x32960) [0x7f6ae1070960] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #4 /lib64/libc.so.6(gsignal+0x35) [0x7f6ae10708e5] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #5 /lib64/libc.so.6(abort+0x175) [0x7f6ae10720c5] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #6 /lib64/libc.so.6(+0x707f7) [0x7f6ae10ae7f7] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #7 /lib64/libc.so.6(+0x76126) [0x7f6ae10b4126] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #8 /lib64/libc.so.6(+0x78c53) [0x7f6ae10b6c53] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #9 /usr/lib64/libtalloc.so.2(+0x2389) [0x7f6ae1c15389] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: #10 /usr/lib64/libtalloc.so.2(+0x2323) [0x7f6ae1c15323] Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: [2013/12/12 21:20:51.811070, 0] lib/fault.c:372(dump_core) Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: dumping core in /var/log/samba/cores/smbd Dec 12 21:20:51 rhsauto019 GlusterFS[8263]: I am attaching sosreports, but the cores are pretty large. I'll post a link to them.
Hi Ben, From the SOS report, I see rdma error in the all glusterd logs. I guess glusterd failed because of this. [2013-12-13 07:43:25.634291] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.249:24007 failed (No data available) [2013-12-13 07:43:26.269897] W [socket.c:522:__socket_rwv] 0-management: readv on 10.70.36.252:24007 failed (No data available) [2013-12-13 07:43:26.511746] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.36.252:24007 failed (Connection refused) [2013-12-13 07:43:29.517576] E [socket.c:2158:socket_connect_finish] 0-management: connection to 10.70.36.249:24007 failed (Connection refused) [2013-12-13 07:43:32.225933] W [glusterfsd.c:1099:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x37532e894d] (-->/lib64/libpthread.so.0() [0x3753607851] (-->/usr/sbin/glusterd(glusterfs_sigwaiter+0xcd) [0x4052ad]))) 0-: received signum (15), shutting down [2013-12-13 19:23:29.354599] I [glusterfsd.c:2026:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.0.48geo (/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid) [2013-12-13 19:23:29.771320] I [glusterd.c:1148:init] 0-management: Using /var/lib/glusterd as working directory [2013-12-13 19:23:29.771717] I [rpcsvc.c:2058:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2013-12-13 19:23:30.221835] I [socket.c:3505:socket_init] 0-socket.management: SSL support is NOT enabled [2013-12-13 19:23:30.221896] I [socket.c:3520:socket_init] 0-socket.management: using system polling thread [2013-12-13 19:23:30.384476] C [rdma.c:4099:gf_rdma_init] 0-rpc-transport/rdma: Failed to get IB devices [2013-12-13 19:23:30.384724] E [rpc-transport.c:333:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2013-12-13 19:23:30.384802] W [rpcsvc.c:1481:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2013-12-13 19:23:30.384854] I [rpcsvc.c:2058:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2013-12-13 19:23:30.385028] I [socket.c:3505:socket_init] 0-socket.management: SSL support is NOT enabled [2013-12-13 19:23:30.385049] I [socket.c:3520:socket_init] 0-socket.management: using system polling thread [2013-12-13 19:23:38.766390] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/glusterd.info, error: No such file or directory [2013-12-13 19:23:38.766499] E [glusterd-store.c:1337:glusterd_retrieve_op_version] 0-: Unable to get store handle! [2013-12-13 19:23:38.766553] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/glusterd.info, error: No such file or directory [2013-12-13 19:23:38.766592] E [glusterd-store.c:1438:glusterd_retrieve_uuid] 0-: Unable to get store handle! [2013-12-13 19:23:38.766610] I [glusterd-store.c:1408:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 2 [2013-12-13 19:23:38.766966] E [store.c:407:gf_store_handle_retrieve] 0-: Unable to retrieve store handle /var/lib/glusterd/options, error: No such file or directory
We have a re-creatable setup. glusterd hangs after starting replace brick. I am working on finding root-cause of the issue, will update on the status soon.
After debugging the smbd crash issue further, it seems smbd crash is not related to the glusterd issue. Hence raising a new bug for the smbd issue.
BZ 1043519 raised for the smbd crash issue.
Patch: https://code.engineering.redhat.com/gerrit/17713
Ben/Lala, Could you please run the BVT tests and see if you are hitting this issue with glusterfs-3.4.0.52rhs-1.el6rhs?
I ran BVT after the code got merged in to the downstream code and tests are passing fine.
Marking the Bug as verified as BVT ran with the fix and the concern test case is passing now.
Please verify the edited doc text for technical accuracy.
doctext looks good to me.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html