1728183 – SMBD thread panics on file operations from Windows, OS X and Linux when using vfs_glusterfs

Bug 1728183 - SMBD thread panics on file operations from Windows, OS X and Linux when using vfs_glusterfs

Summary: SMBD thread panics on file operations from Windows, OS X and Linux when using...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	gluster-smb
Sub Component:
Version:	6
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Anoop C S
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-09 09:14 UTC by ryan
Modified:	2020-03-12 12:30 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-03-12 12:30:59 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Windows error 01 (29.46 KB, image/png) 2019-07-09 09:14 UTC, ryan	no flags	Details
Windows error 02 (22.76 KB, image/png) 2019-07-09 09:15 UTC, ryan	no flags	Details
Samba Debug 10 logs (3.87 MB, text/plain) 2019-07-09 09:15 UTC, ryan	no flags	Details
Gluster client logs (70.67 KB, text/plain) 2019-07-09 09:16 UTC, ryan	no flags	Details
Screenshot of Windows 10 error (23.74 KB, image/png) 2019-10-17 14:47 UTC, ryan	no flags	Details
View All

Description ryan 2019-07-09 09:14:27 UTC

Created attachment 1588661 [details]
Windows error 01

Description of problem:
SMBD thread panics when a file operation performed from a Windows, Linux or OS X client when the share is using the glusterfs VFS module, either on its own, or in conjunction with others i.e.:
>    vfs objects = catia fruit streams_xattr glusterfs


Gluster volume info:
Volume Name: mcv01
Type: Distributed-Replicate
Volume ID: 1580ab45-0a14-4f2f-8958-b55b435cdc47
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: mcn01:/mnt/h1a/mcv01_data
Brick2: mcn02:/mnt/h1b/mcv01_data
Brick3: mcn01:/mnt/h2a/mcv01_data
Brick4: mcn02:/mnt/h2b/mcv01_data
Options Reconfigured:
features.quota-deem-statfs: on
nfs.disable: on
features.inode-quota: on
features.quota: on
cluster.brick-multiplex: off
cluster.server-quorum-ratio: 50%


Version-Release number of selected component (if applicable):
Gluster 6.3
Samba 4.10.6-5

How reproducible:
Every time

Steps to Reproduce:
1. Mount share as mapped drive
2. Write to share or read from share

Actual results:
Multiple error messages, attached to bug
In OS X or Linux, running 'dd if=/dev/zero of=/mnt/share/test.dat bs=1M count=100' results in a hang. Tailing OS X console logs reveals that the share is timing out.

Expected results:
File operation is successful

Additional info:
Gluster client logs, and SMB debug 10 logs attached

Comment 1 ryan 2019-07-09 09:15:08 UTC

Created attachment 1588662 [details]
Windows error 02

Comment 2 ryan 2019-07-09 09:15:48 UTC

Created attachment 1588663 [details]
Samba Debug 10 logs

Comment 3 ryan 2019-07-09 09:16:25 UTC

Created attachment 1588664 [details]
Gluster client logs

Comment 4 ryan 2019-07-09 11:47:56 UTC

Tested on Gluster 6.1 with the same issue.
Gluster 5.6 works fine.

Comment 5 Anoop C S 2019-07-19 10:19:41 UTC

(In reply to ryan from comment #0)
> Created attachment 1588661 [details]
> Windows error 01
> 
> Description of problem:
> SMBD thread panics when a file operation performed from a Windows, Linux or
> OS X client when the share is using the glusterfs VFS module, either on its
> own, or in conjunction with others i.e.:
> >    vfs objects = catia fruit streams_xattr glusterfs
> 
> 
> Gluster volume info:
> Volume Name: mcv01
> Type: Distributed-Replicate
> Volume ID: 1580ab45-0a14-4f2f-8958-b55b435cdc47
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 2 x 2 = 4
> Transport-type: tcp
> Bricks:
> Brick1: mcn01:/mnt/h1a/mcv01_data
> Brick2: mcn02:/mnt/h1b/mcv01_data
> Brick3: mcn01:/mnt/h2a/mcv01_data
> Brick4: mcn02:/mnt/h2b/mcv01_data
> Options Reconfigured:
> features.quota-deem-statfs: on
> nfs.disable: on
> features.inode-quota: on
> features.quota: on
> cluster.brick-multiplex: off
> cluster.server-quorum-ratio: 50%
> 
> 
> Version-Release number of selected component (if applicable):
> Gluster 6.3
> Samba 4.10.6-5
> 
> How reproducible:
> Every time
> 
> Steps to Reproduce:
> 1. Mount share as mapped drive
> 2. Write to share or read from share
> 
> Actual results:
> Multiple error messages, attached to bug
> In OS X or Linux, running 'dd if=/dev/zero of=/mnt/share/test.dat bs=1M
> count=100' results in a hang. Tailing OS X console logs reveals that the
> share is timing out.

This is weird. Can you post your smb.conf?

Comment 6 ryan 2019-07-19 10:29:03 UTC

Hi Anoop,

It's very odd, i've got a feeling it's something related to the upgrade/downgrade process I've been using to test different versions of Gluster for the different bug tickets I've got open.

Currently I'm using the following script to upgrade/downgrade (This one's to upgrade to 6):
yum remove centos-release-gluster* -y
yum install centos-release-gluster6 -y
yum remove glusterfs* -y
yum install glusterfs-server* -y
yum install sernet-samba-vfs-glusterfs -y
systemctl stop glusterd
systemctl stop glusterfsd
sed -i 's/operating-version=.*/operating-version=60000/gi' /var/lib/glusterd/glusterd.info
systemctl stop glusterfsd
systemctl restart glusterd
gluster volume set all cluster.op-version 60000

Please could you flag any issues with this? Or a recommended way of downgrading particularly.

SMB config:
[global]
security = ADS
workgroup = MAGENTA
realm = MAGENTA.LOCAL
netbios name = MAGENTANAS01
max protocol = SMB3
min protocol = SMB2
ea support = yes
clustering = yes
server signing = no
max log size = 10000
glusterfs:loglevel = 7
log file = /var/log/samba/log-%M.smbd
logging = file
log level = 2
template shell = /sbin/nologin
winbind offline logon = false
winbind refresh tickets = yes
winbind enum users = Yes
winbind enum groups = Yes
allow trusted domains = yes
passdb backend = tdbsam
idmap cache time = 604800
idmap negative cache time = 300
winbind cache time = 604800
idmap config magenta:backend = rid
idmap config magenta:range = 10000-999999
idmap config * : backend = tdb
idmap config * : range = 3000-7999
guest account = nobody
map to guest = bad user
force directory mode = 0777
force create mode = 0777
create mask = 0777
directory mask = 0777
hide unreadable = no
store dos attributes = no
unix extensions = no
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
glusterfs:volfile_server = localhost
kernel share modes = No
strict locking = auto
oplocks = yes
durable handles = yes
kernel oplocks = no
posix locking = no
level2 oplocks = no
readdir_attr:aapl_rsize = yes
readdir_attr:aapl_finder_info = no
readdir_attr:aapl_max_access = no
fruit:aapl = yes

[QC]
guest ok = no
read only = no
vfs objects = glusterfs
glusterfs:volume = mcv01
path = "/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log

[QC-GlusterFuse]
guest ok = no
read only = no
vfs objects = glusterfs_fuse
path = "/mnt/mcv01/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log

[QC-FUSE]
guest ok = no
read only = no
path = "/mnt/mcv01/data/qc_only"
valid users = @"QC_ops"
recycle:repository = .recycle
recycle:keeptree = yes
recycle:versions = yes
recycle:directory_mode = 0770
recycle:subdir_mode = 0777
glusterfs:logfile = /var/log/samba/glusterfs-mcv01-fuse.%M.log

______

Many thanks,
Ryan

Comment 7 ryan 2019-09-02 13:54:01 UTC

Anyone able to offer some assistance with this?
We're still seeing the issue on two of our servers after upgrading to Gluster 6.5 and Samba 4.10.7.

Comment 8 ryan 2019-10-17 14:46:40 UTC

Trying to copy a file with a windows 10 client results in the transfer failing with error (See screenshot).
Looking through the smb logs shows this:

  mag-desktop-01 (ipv4:10.0.3.12:57488) connect to service Grading initially as user editor01 (uid=2000, gid=2900) (pid 296596)
[2019/10/17 14:09:35.784481,  2] ../../source3/smbd/smbXsrv_open.c:675(smbXsrv_open_global_verify_record)
  smbXsrv_open_global_verify_record: key 'FA7F6275' server_id 296320 does not exist.
[2019/10/17 14:09:35.784509,  1] ../../librpc/ndr/ndr.c:422(ndr_print_debug)
       &global_blob: struct smbXsrv_open_globalB
          version                  : SMBXSRV_VERSION_0 (0)
          seqnum                   : 0x00000002 (2)
          info                     : union smbXsrv_open_globalU(case 0)
          info0                    : *
              info0: struct smbXsrv_open_global0
                  db_rec                   : NULL
                  server_id: struct server_id
                      pid                      : 0x0000000000048580 (296320)
                      task_id                  : 0x00000000 (0)
                      vnn                      : 0xffffffff (4294967295)
                      unique_id                : 0x3f2d4bc50a3ad530 (4552378107993707824)
                  open_global_id           : 0xfa7f6275 (4202652277)
                  open_persistent_id       : 0x00000000fa7f6275 (4202652277)
                  open_volatile_id         : 0x0000000037cfd301 (936366849)
                  open_owner               : S-1-5-21-3658843901-2482107748-408451428-1000
                  open_time                : Thu Oct 17 14:09:36 2019 BST
                  create_guid              : aea7fead-f0de-11e9-b036-b88584997125
                  client_guid              : aea7fb79-f0de-11e9-b036-b88584997125
                  app_instance_id          : 00000000-0000-0000-0000-000000000000
                  disconnect_time          : NTTIME(0)
                  durable_timeout_msec     : 0x0000ea60 (60000)
                  durable                  : 0x01 (1)
                  backend_cookie           : DATA_BLOB length=452
  [0000] 56 46 53 5F 44 45 46 41   55 4C 54 5F 44 55 52 41   VFS_DEFA ULT_DURA
  [0010] 42 4C 45 5F 43 4F 4F 4B   49 45 5F 4D 41 47 49 43   BLE_COOK IE_MAGIC
  [0020] 20 20 20 20 20 20 20 20   20 20 20 20 20 20 20 20
  [0030] 00 00 00 00 00 00 00 00   96 89 8E 03 00 00 00 00   ........ ........
  [0040] 39 E0 DF F2 31 0E 74 B0   00 00 00 00 00 00 00 00   9...1.t. ........
  [0050] 00 00 02 00 04 00 02 00   00 00 10 37 00 00 00 00   ........ ...7....
  skipping zero buffer bytes
  [0080] 96 89 8E 03 00 00 00 00   39 E0 DF F2 31 0E 74 B0   ........ 9...1.t.
  [0090] FF 81 00 00 00 00 00 00   01 00 00 00 00 00 00 00   ........ ........
  [00A0] D0 07 00 00 00 00 00 00   54 0B 00 00 00 00 00 00   ........ T.......
  [00B0] 00 00 00 00 00 00 00 00   7A E3 01 37 00 00 00 00   ........ z..7....
  [00C0] F3 67 A8 5D 00 00 00 00   00 00 00 00 00 00 00 00   .g.].... ........
  [00D0] F3 67 A8 5D 00 00 00 00   00 00 00 00 00 00 00 00   .g.].... ........
  [00E0] F3 67 A8 5D 00 00 00 00   00 00 00 00 00 00 00 00   .g.].... ........
  [00F0] F3 67 A8 5D 00 00 00 00   00 00 00 00 00 00 00 00   .g.].... ........
  [0100] 99 7F 00 00 00 00 00 00   08 00 00 00 00 00 00 00   ........ ........
  [0110] 00 00 00 00 00 00 00 00   06 00 00 00 00 00 00 00   ........ ........
  [0120] 06 00 00 00 2F 64 61 74   61 00 00 00 3C 00 00 00   ..../dat a...<...
  [0130] 00 00 00 00 3C 00 00 00   6E 65 77 20 66 6F 6C 64   ....<... new fold
  [0140] 65 72 20 66 72 6F 6D 20   6D 61 63 2F 77 38 6B 76   er from  mac/w8kv
  [0150] 76 2D 73 74 68 2D 34 38   66 70 73 2D 31 30 74 6F   v-sth-48 fps-10to
  [0160] 31 72 65 64 63 6F 64 65   5F 46 46 2E 52 44 43 2E   1redcode _FF.RDC.
  [0170] 7A 69 70 00 00 00 00 00   00 00 00 00 00 00 00 00   zip..... ........
  [0180] 00 00 00 00 00 00 00 00   F3 67 A8 5D 00 00 00 00   ........ .g.]....
  [0190] 00 00 00 00 00 00 00 00   F3 67 A8 5D 00 00 00 00   ........ .g.]....
  [01A0] 00 00 00 00 00 00 00 00   F3 67 A8 5D 00 00 00 00   ........ .g.]....
  [01B0] 00 00 00 00 00 00 00 00   F3 67 A8 5D 00 00 00 00   ........ .g.]....
  [01C0] 00 00 00 00                                        ....
                  channel_sequence         : 0x0000 (0)
                  channel_generation       : 0x0000000000000000 (0)
[2019/10/17 14:09:35.785374,  3] ../../source3/smbd/smb2_create.c:800(smbd_smb2_create_send)

Comment 9 ryan 2019-10-17 14:47:11 UTC

Created attachment 1626823 [details]
Screenshot of Windows 10 error

Comment 10 Anoop C S 2019-10-22 10:58:09 UTC

What are your current Samba and GlusterFS versions?

(In reply to ryan from comment #8)
> Trying to copy a file with a windows 10 client results in the transfer
> failing with error (See screenshot).

* Does it happen every time you attempt a copy of the same file?
* Is it something specific to a file/directory type?

>   mag-desktop-01 (ipv4:10.0.3.12:57488) connect to service Grading initially
> as user editor01 (uid=2000, gid=2900) (pid 296596)

I don't see a share named [Grading] in the smb.conf from comment #6. If that's newly added, was there any changes to global parameters?

Comment 11 ryan 2019-10-22 15:32:19 UTC

Hi Anoop,

Versions:
Gluster = 6.5
Samba = 4.10.8

> Does it happen every time you attempt a copy of the same file?
> Is it something specific to a file/directory type?
Yes, this happens with any file being copied, or a new file being created (write fails). Happens in multiple directories 100% of the time.


In an effort to reduce the variables in play, I'd changed the config. Complete config below:

[global]
security = user
username map script = /bin/echo
max protocol = SMB3
min protocol = SMB2
ea support = yes
clustering = no
server signing = no
max log size = 10000
glusterfs:loglevel = 5
log file = /var/log/samba/log-%M.smbd
logging = file
log level = 3
template shell = /sbin/nologin
passdb backend = tdbsam
guest account = nobody
map to guest = bad user
force directory mode = 0777
force create mode = 0777
create mask = 0777
directory mask = 0777
hide unreadable = no
unix extensions = no
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
glusterfs:volfile_server = localhost
kernel share modes = No

[Grading]
read only = no
guest ok = yes
vfs objects = catia fruit streams_xattr glusterfs
glusterfs:volume = mcv01
path = "/data"
valid users = "nobody" @"audio" @"QC_ops" @"MAGENTA\domain admins" @"MAGENTA\domain users" @"nas_users"
glusterfs:logfile = /var/log/samba/glusterfs-mcv01.%M.log


Best,
Ryan

Comment 12 ryan 2019-10-30 13:57:23 UTC

Hi Anoop,

Did you get a chance to look into this?
Can I assist in anyway?

Comment 13 ryan 2019-11-05 10:04:27 UTC

Hi Anoop,

I believe we have found the issue with this, however require some assistance with the workaround.
When running the op-version at 40100 with Gluster 6.5 we don't have any issues.
However, when running at the max cluster op version of 60000 we get lots of panics in the SMB logs.
I contacted Sernet about this, and it seems the issue is because they still compile the VFS against Gluster 3.12.
We're going to try testing with a package compiled against 6.5 to see if the issue goes away.

In the meantime, is it possible to downgrade the op-version?

Many thanks,
Ryan

Comment 14 Anoop C S 2019-11-18 10:50:08 UTC

(In reply to ryan from comment #13)
> and it seems the issue is because they still compile the VFS against Gluster 3.12.

GFAPI uses symbol versions. Unless some API got removed(zero chance for this to happen) every old version of a modified API must be still present in newer GlusterFS. Assuming Samba version is maintained I am curious how such a incompatibility can lead to panics.
 
> We're going to try testing with a package compiled against 6.5 to see if the
> issue goes away.

How did it go?

> In the meantime, is it possible to downgrade the op-version?

I would suggest to stay or operate at maximum available op-version to make use of latest features in updated GlusterFS.

Comment 15 ryan 2019-11-18 10:58:22 UTC

Hi Anoop,

Below were the test versions and results

Gluster 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = PASS
Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = FAIL
Gluster 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster 3.12) = PASS
Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster 6.5) = PASS

The VFS packages compiled for us by Sernet, against Gluster 6.5 has resolved this issue for us.
I also downgraded the op-version by modifying the vol config files, which resulted in the Gluster 3.12 VFS, which fixed the issue.

Please let me know if you need any more info/data.
Best regards,
Ryan

Comment 16 Anoop C S 2019-11-18 12:01:14 UTC

(In reply to ryan from comment #15)
> Hi Anoop,
> 
> Below were the test versions and results
> 
> Gluster 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against
> Gluster 3.12) = PASS

Expected..

> Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against
> Gluster 3.12) = FAIL
> Gluster 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against
> Gluster 3.12) = PASS

Just like GlusterFS VFS module based on v3.12 works fine with op-version 40100, I would expect it to work with op-version 60000 too. Or else it needs some investigation.

> Gluster 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against
> Gluster 6.5) = PASS

Fine.
 
> The VFS packages compiled for us by Sernet, against Gluster 6.5 has resolved
> this issue for us.
> I also downgraded the op-version by modifying the vol config files, which
> resulted in the Gluster 3.12 VFS, which fixed the issue.

Good.

> Please let me know if you need any more info/data.

I remember that you were blocked in testing bz #1680085 due to this bug. Can you re-visit bz #1680085 now?

Comment 17 Kaleb KEITHLEY 2019-11-19 14:54:47 UTC

IMO the results you see are consistent with the design of the versioned symbols in gfapi; i.e. that old programs (and other consumers of gfapi such as the Samba glusterfs VFS) that were originally compiled and linked with old libraries can be used with newer versions of gfapi without having to rebuild and relink.

For now this does imply that gluster needs to use the same (or close) op-version associated with 3.12 if you're using a gluster VFS that was linked with 3.12 libgfapi.

Comment 18 ryan 2019-11-20 11:40:20 UTC

Hi Kaleb,

Thanks for confirming.
Is there a recommended way of downgrading the op-version, other than editing the vol file?

Best,
Ryan

Comment 19 Niels de Vos 2019-11-21 09:31:13 UTC

I'm missing a little detail in this bug report. Compiling the vfs_gluster Samba module against glusterfs-3.12 results in a binary that can be used with glusterfs-6.x (on the Gluster client, the Samba server). It is not clear to me what versuin of Gluster client was used in the test of comment #15. Did it match the version of the Gluster server, or was it kept at 3.12?

Comment 20 ryan 2019-11-21 09:40:25 UTC

Hi Niels,

Please see revised comment, does this answer your question?

Gluster Server 4.1 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = PASS
Gluster Server 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = FAIL
Gluster Server 6.5 (op-version 40100) + Sernet Samba Gluster VFS (Built against Gluster Client 3.12) = PASS
Gluster Server 6.5 (op-version 60000) + Sernet Samba Gluster VFS (Built against Gluster Client 6.5) = PASS

Best,
Ryan

Comment 21 Niels de Vos 2019-11-21 10:14:01 UTC

Does that also mean the Gluster client packages on the Samba server are kept at the "Built against Gluster Client" version?

This is not a requirement from a libgfapi gluster-bindings perspective. It expected to work correctly when compiling Samba against glusterfs-3.12, but run the resulting vfs_gluster module (Built against Gluster Client 3.12) on a system that has only the glusterfs-6.x versions installed. The built Samba/vfs_gluster binary should be compatible with glusterfs-6.x. It is recommended that Gluster clients and Gluster servers run with the same Gluster version (even when Samba/vfs_gluster is built with an older version of Gluster).

Comment 22 ryan 2019-11-21 10:20:42 UTC

In our test case, the Gluster client & Samba server is on the same nodes as the Gluster server, so would all be on the same version as the server.

Best,
Ryan

Comment 23 Niels de Vos 2019-11-21 11:04:52 UTC

Thanks!

In that case I'm really surprised to hear that different op-versions can cause a panic in Samba... Anoop would be the best person to help with this.

Comment 24 Worker Ant 2020-03-12 12:30:59 UTC

This bug is moved to https://github.com/gluster/glusterfs/issues/898, and will be tracked there from now on. Visit GitHub issues URL for further details

Note You need to log in before you can comment on or make changes to this bug.