Bug 1462508 - [Stress] : Worker crashed with "[Errno 16] Device or resource busy" [NEEDINFO]
[Stress] : Worker crashed with "[Errno 16] Device or resource busy"
Status: NEW
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: geo-replication (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Aravinda VK
Rahul Hinduja
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-18 04:57 EDT by Ambarish
Modified: 2017-06-28 05:06 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
khiremat: needinfo? (rhinduja)
khiremat: needinfo? (avishwan)


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-06-18 04:57:47 EDT
Description of problem:
-----------------------


4 Node Master
2 Node Slave.

Geo-rep syncing was I.P.

Was running Bonnie,smallfiles,iozone and kernel untar from 8 FUSE mounts.

Worker crashed with the following traceback :

[2017-06-16 18:05:19.641469] E [repce(/bricks2/A1):207:__call__] RepceClient: call 28951:139803919714112:1497636314.17 (entry_ops) failed on peer with OSError
[2017-06-16 18:05:19.641887] E [syncdutils(/bricks2/A1):296:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1566, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy
[2017-06-16 18:05:19.644381] I [syncdutils(/bricks2/A1):237:finalize] <top>: exiting.



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-28

How reproducible:
-----------------

1/1


Actual results:
---------------

Worker crashed.

Expected results:
-----------------

No worker crashed.

Additional info:
----------------


*MAster* :

 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 29f18f45-c822-4c0e-84ef-737e128e0368
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x 2 = 24
Transport-type: tcp
Bricks:
Brick1: gqas005.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas013.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick5: gqas005.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick6: gqas013.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick7: gqas006.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick8: gqas008.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick9: gqas005.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick10: gqas013.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick11: gqas006.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick12: gqas008.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick13: gqas005.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick14: gqas013.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick15: gqas006.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick16: gqas008.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick17: gqas005.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick18: gqas013.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick19: gqas006.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick20: gqas008.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick21: gqas005.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick22: gqas013.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick23: gqas006.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick24: gqas008.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Options Reconfigured:
nfs.disable: off
transport.address-family: inet
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
server.event-threads: 4
client.event-threads: 4
geo-replication.indexing: on
geo-replication.ignore-pid-check: on
changelog.changelog: on
cluster.enable-shared-storage: enable
[root@gqas013 glusterfs]# 

*Slave* :

[root@gqas015 ~]# gluster v info
 
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: 6de155ee-2200-44bb-a8ed-bdae5acf348f
Status: Started
Snapshot Count: 0
Number of Bricks: 4 x (4 + 2) = 24
Transport-type: tcp
Bricks:
Brick1: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick2: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/A1
Brick3: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick4: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/A1
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/A1
Brick7: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick8: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/A1
Brick9: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick10: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/A1
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/A1
Brick13: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick14: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/A1
Brick15: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick16: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/A1
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/A1
Brick19: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick20: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/A1
Brick21: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick22: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/A1
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/A1
Options Reconfigured:
network.inode-lru-limit: 50000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
transport.address-family: inet
nfs.disable: off
[root@gqas015 ~]# 



[root@gqas013 glusterfs]# gluster volume geo-replication testvol gqas014.sbu.lab.eng.bos.redhat.com::butcher  status  detail
 
MASTER NODE                           MASTER VOL    MASTER BRICK    SLAVE USER    SLAVE                                          SLAVE NODE                            STATUS     CRAWL STATUS    LAST_SYNCED    ENTRY    DATA    META    FAILURES    CHECKPOINT TIME        CHECKPOINT COMPLETED    CHECKPOINT COMPLETION TIME   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7299    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7321    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7303    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            8192     7318    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7308    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas013.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7293    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas005.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7310    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas008.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas014.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks1/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7305    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks2/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            8192     7311    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks3/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Passive    N/A             N/A            N/A      N/A     N/A     N/A         N/A                    N/A                     N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks4/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7311    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks5/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7309    0       0           2017-06-16 05:32:22    No                      N/A                          
gqas006.sbu.lab.eng.bos.redhat.com    testvol       /bricks6/A1     root          gqas014.sbu.lab.eng.bos.redhat.com::butcher    gqas015.sbu.lab.eng.bos.redhat.com    Active     Hybrid Crawl    N/A            0        7311    0       0           2017-06-16 05:32:22    No                      N/A                          
[root@gqas013 glusterfs]#
Comment 2 Ambarish 2017-06-18 05:14:44 EDT
I found these ones to be unique :

On master (gqas013) :

Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 204, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 780, in main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1566, in service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 570, in crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1204, in crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1111, in changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 994, in process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 935, in process_change
    failures = self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in __call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in __call__
    raise res
OSError: [Errno 16] Device or resource busy




And thsi on one of the slaves (gqas014):

[2017-06-18 08:55:59.804363] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 725, in entry_ops
    er = entry_purge(entry, gfid)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 618, in entry_purge
    if not matching_disk_gfid(gfid, entry):
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 650, in matching_disk_gfid
    disk_gfid = cls.gfid_mnt(entry)
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 392, in gfid_mnt
    cls.GX_GFID_CANONICAL_LEN], [ENOENT], [ESTALE])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 495, in errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 55, in lgetxattr
    return cls._query_xattr(path, siz, 'lgetxattr', attr)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 47, in _query_xattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 16] Device or resource busy
Comment 4 Kotresh HR 2017-06-19 02:45:09 EDT
I see the following errors from slave mount logs during the same time.

[2017-06-16 18:05:19.761267] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 1069: MKDIR() /.gfid/01dd8259-061e-4cf5-be56-84a6f9d4c8d4 => -1 (Operation not permitted)
[2017-06-16 18:05:19.784205] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 1096: MKDIR() /.gfid/b9f67339-5128-46ca-84a3-fb7d27748503 => -1 (Operation not permitted)

Not sure which xlator returned EBUSY. Strangely that's not logged. But this happens in rsync mode, which tries to create entry when the entry is not created.
We can avoid this by taking in patch [1]

1. Take in patch [1]
2. Also add logic to retry on EBUSY during lgetxattr in above path.


[1] https://review.gluster.org/#/c/16010/

Aravinda, Rahul,

What do you think?

Note You need to log in before you can comment on or make changes to this bug.