Bug 1315201 - [GSS] - smbd crashes on 3.1.1 with samba-vfs 4.1
Summary: [GSS] - smbd crashes on 3.1.1 with samba-vfs 4.1
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: samba
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: RHGS 3.1.3
Assignee: Anoop C S
QA Contact: Vivek Das
URL:
Whiteboard:
: 1314834 (view as bug list)
Depends On: 1317940 1319374 1319989
Blocks: 1299184
TreeView+ depends on / blocked
 
Reported: 2016-03-07 09:01 UTC by Mukul Malhotra
Modified: 2019-11-14 07:32 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 1301120
Environment:
Last Closed: 2016-06-23 05:10:50 UTC
Embargoed:


Attachments (Terms of Use)
Core dump file (14.93 MB, application/x-xz)
2016-03-07 09:04 UTC, Mukul Malhotra
no flags Details
Patch that applies on 3.1.2 rhgs (10.86 KB, patch)
2016-03-26 15:39 UTC, Poornima G
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1240 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.1 Update 3 2016-06-23 08:51:28 UTC

Description Mukul Malhotra 2016-03-07 09:01:20 UTC
+++ This bug was initially created as a clone of Bug #1301120 +++

Description of problem:
Hi!

Have the same problems as reported in bug id 1234877.

smbd goes into a panic every 6 minutes and produces a core dump.

smbd[27140]: [2016/01/22 16:58:22.581586,  0] ../lib/util/fault.c:78(fault_report)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  ===============================================================
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: [2016/01/22 16:58:22.581611,  0] ../lib/util/fault.c:79(fault_report)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  INTERNAL ERROR: Signal 6 in pid 27140 (4.2.3)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  Please read the Trouble-Shooting section of the Samba HOWTO
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: [2016/01/22 16:58:22.581622,  0] ../lib/util/fault.c:81(fault_report)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  ===============================================================
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: [2016/01/22 16:58:22.581629,  0] ../source3/lib/util.c:788(smb_panic_s3)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  PANIC (pid 27140): internal error
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: [2016/01/22 16:58:22.581807,  0] ../source3/lib/util.c:899(log_stack_trace)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  BACKTRACE: 14 stack frames:
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #0 /lib64/libsmbconf.so.0(log_stack_trace+0x1a) [0x7f2db2310cea]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #1 /lib64/libsmbconf.so.0(smb_panic_s3+0x20) [0x7f2db2310dc0]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #2 /lib64/libsamba-util.so.0(smb_panic+0x2f) [0x7f2db41608cf]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #3 /lib64/libsamba-util.so.0(+0x1aae6) [0x7f2db4160ae6]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #4 /lib64/libpthread.so.0(+0xf100) [0x7f2db4389100]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #5 /lib64/libc.so.6(gsignal+0x37) [0x7f2db09bf5f7]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #6 /lib64/libc.so.6(abort+0x148) [0x7f2db09c0ce8]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #7 /lib64/libc.so.6(+0x75317) [0x7f2db09ff317]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #8 /lib64/libc.so.6(+0x7cfe1) [0x7f2db0a06fe1]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #9 /lib64/libglusterfs.so.0(gf_timer_call_cancel+0x52) [0x7f2d9bc77652]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #10 /lib64/libglusterfs.so.0(gf_log_inject_timer_event+0x37) [0x7f2d9bc58de7]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #11 /lib64/libglusterfs.so.0(gf_timer_proc+0x10b) [0x7f2d9bc7781b]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #12 /lib64/libpthread.so.0(+0x7dc5) [0x7f2db4381dc5]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:   #13 /lib64/libc.so.6(clone+0x6d) [0x7f2db0a8021d]
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: [2016/01/22 16:58:22.582688,  0] ../source3/lib/dumpcore.c:318(dump_core)
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]:  dumping core in /var/log/samba/cores/smbd
Jan 22 16:58:22 ch-mb-ph-gfs-01 smbd[27140]: 


Version-Release number of selected component (if applicable):
cat /etc/redhat-release 
CentOS Linux release 7.2.1511 (Core)

The gluster and samba packages are coming from the CentOS repo.

rpm -qa | grep gluster
glusterfs-fuse-3.7.6-1.el7.x86_64
glusterfs-coreutils-0.0.1-0.1.git0c86f7f.el7.x86_64
centos-release-gluster37-1.0-4.el7.centos.noarch
glusterfs-3.7.6-1.el7.x86_64
glusterfs-server-3.7.6-1.el7.x86_64
samba-vfs-glusterfs-4.2.3-11.el7_2.x86_64
glusterfs-client-xlators-3.7.6-1.el7.x86_64
glusterfs-cli-3.7.6-1.el7.x86_64
glusterfs-libs-3.7.6-1.el7.x86_64
glusterfs-api-3.7.6-1.el7.x86_64

rpm -qa | grep samba
samba-libs-4.2.3-11.el7_2.x86_64
samba-client-libs-4.2.3-11.el7_2.x86_64
samba-vfs-glusterfs-4.2.3-11.el7_2.x86_64
samba-common-4.2.3-11.el7_2.noarch
samba-4.2.3-11.el7_2.x86_64
samba-common-tools-4.2.3-11.el7_2.x86_64
samba-common-libs-4.2.3-11.el7_2.x86_64

Volume Name: ch-online
Type: Replicate
Volume ID: 9f91a44a-edd9-401c-9ecc-a40e7e01332c
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: ch-mb-ph-gfs-01:/gfs/brick1/brick
Brick2: ch-mb-ph-gfs-02:/gfs/brick1/brick
Options Reconfigured:
cluster.lookup-optimize: on
performance.stat-prefetch: off
cluster.ensure-durability: on
performance.normal-prio-threads: 16
performance.high-prio-threads: 32
performance.cache-size: 1024MB
performance.io-thread-count: 32
cluster.lookup-unhashed: off
server.allow-insecure: on
performance.readdir-ahead: on
client.bind-insecure: on
client.event-threads: 8
storage.owner-uid: 10003
storage.owner-gid: 10007

cat /etc/glusterfs/glusterd.vol
volume management
    type mgmt/glusterd
    option working-directory /var/lib/glusterd
    option transport-type socket,rdma
    option transport.socket.keepalive-time 10
    option transport.socket.keepalive-interval 2
    option transport.socket.read-fail-log off
    option ping-timeout 0
    option event-threads 1
    option rpc-auth-allow-insecure on
#   option base-port 49152

cat /etc/samba/smb.conf
[global]
    netbios name = ch-mb-ph-samba
    idmap backend = tdb2
    private dir = /mnt/ch-online/.smblock/
    workgroup = mediabank
    server string = Samba Server Version %v
    log file = /var/log/samba/%m.log
    max log size = 50
    security = user
    map to guest = Bad Password
    printing = bsd
    printcap name = /dev/null

[customer-data]
    path = /customer-data
    read only = no
    browseable = yes
    guest ok = no
    kernel share modes = no
    force user = mediabank-service
    create mask = 4770
    directory mask = 4770
    valid users = mediabank-service
    vfs objects = glusterfs
    glusterfs:loglevel = 7
    glusterfs:volume = ch-online
    glusterfs:volfile_server = localhost
    glusterfs:logfile = /var/log/samba/glusterfs-customer-data.%M.log

[MBFileExchangeMTBCH]
    path = /customer-data/CHMEDIATEC/FileExchange
    read only = no
    browseable = yes
    guest ok = no
    kernel share modes = no
    force user = mediabank-service
    create mask = 4770
    directory mask = 4770
    valid users = mediabank-service dvb
    vfs objects = glusterfs
    glusterfs:loglevel = 7
    glusterfs:volume = ch-online
    glusterfs:volfile_server = localhost
    glusterfs:logfile = /var/log/samba/glusterfs-fileexchange.%M.log

[postprodMTBCH]
    path = /customer-data/postprod
    read only = no
    browseable = yes
    guest ok = no
    kernel share modes = no
    force user = mediabank-service
    create mask = 4770
    directory mask = 4770
    valid users = mediabank-service postprod dvb
    vfs objects = glusterfs
    glusterfs:loglevel = 7
    glusterfs:volume = ch-online
    glusterfs:volfile_server = localhost
    glusterfs:logfile = /var/log/samba/glusterfs-postprod.%M.log

How reproducible:
Just start the smb service and have users access the different shares. There is no need for any heavy load to trigger this issue.

Steps to Reproduce:
1.
2.
3.

Actual results:
smbd not crashing.

Expected results:


Additional info:

--- Additional comment from Anders Rydmell on 2016-01-25 08:09 EST ---



--- Additional comment from Niels de Vos on 2016-03-06 01:30:31 EST ---

Bug 1234877 was fixed in the Samba package, we'll need to find out if samba-4.2.3-11 contains that patch. If this requires a change to the Samba RPM, please update the product (RHEL?) and component.

--- Additional comment from Anoop C S on 2016-03-07 02:45:02 EST ---

Samba-4.2.3 already contains the fix for issue mentioned in the following upstream bug:

https://bugzilla.samba.org/show_bug.cgi?id=11115

Back trace provided here is different from what we have seen from https://bugzilla.redhat.com/show_bug.cgi?id=1234877 and needs some investigation. Therefore https://bugzilla.samba.org/show_bug.cgi?id=11115 is not related to this bug.

See my reply to the following thread:

http://www.gluster.org/pipermail/gluster-users/2016-February/025293.html

From a quick look from the dmesg bt, I suspect a race between some glusterfs timer related threads. But need to find the exact root cause.

Comment 2 Mukul Malhotra 2016-03-07 09:04:04 UTC
Created attachment 1133702 [details]
Core dump file

Comment 4 Mukul Malhotra 2016-03-07 09:07:26 UTC
*** Bug 1314834 has been marked as a duplicate of this bug. ***

Comment 5 Mukul Malhotra 2016-03-07 14:21:39 UTC
<anoopcs> Mukul, Regarding 1315201
<anoopcs> Mukul, Do we have any reproducer?
<Mukul> anoopcs, no, I have to verify with customer if he can reproduce in his end
<anoopcs> Mukul, Ok. I will comment in the bug with the patch that I suspect to be the fix for this crash.
<Mukul> anoopcs, OK

Comment 7 Anoop C S 2016-03-09 02:18:21 UTC
Hi Mukul,

Thanks for all your support. As a first step, I could finally root cause the issue of getting truncated core files every time.

Due to the absence of LimitCORE parameter in smb service file, systemd defaults the soft and hard limits for coredump files to 0. And we have a strange piece of code in Samba where we set soft limit to maximum of 16MB and current soft limit(which will be 0). Thus we always end up in creation of truncated core files of 16MB for Samba crashes.This limitation in Samba have been recently fixed upstream. (https://git.samba.org/?p=samba.git;a=commit;h=58d3462bc58290d8eb5e554c6c59cf6b73ccf58a) 

So at this moment, I would like to request them to modify smb service file(/usr/lib/systemd/system/smb.service) to include the following under [Service] section, restart smb/ctdb services and try accessing the shares.

LimitCORE=infinity

After restarting services, as I mentioned before in previous comment https://bugzilla.redhat.com/show_bug.cgi?id=1315201#c6, verify that cat /proc/<smbd-pid>/limits shows Max core file size as unlimited for soft and hard columns.

You can then attach the newly found cores which I would assume to be complete.

Comment 10 Mukul Malhotra 2016-03-09 12:08:33 UTC
Hello Anoop,

I had attached the fresh core files.

Thanks
Mukul

Comment 11 Anoop C S 2016-03-10 10:38:17 UTC
Hi Mukul,

Since we can't move forward with the first solution to generate complete core files I will now put forward an alternate procedure. This procedure will enable Samba to dump complete cores thereafter for all new/existing client connections.

prerequisite:
If not present, install util-linux package in order to get prlimit binary.

1. Run the following one liner:
# for i in $(pgrep smbd); do prlimit --pid=$i --core=unlimited; done;

2. Verify the changes made in step 1 for soft limits:
(scripted way)
# > /tmp/samba-core-file-size; for i in $(pgrep smbd); do cat /proc/$i/limits | grep "Max core file size" | tr -s ' ' | cut -d  ' ' -f5 >> /tmp/samba-core-file-size; done;
# cat /tmp/samba-core-file-size
must display all as "unlimited".
# > /tmp/samba-core-file-size; for i in $(pgrep smbd); do cat /proc/$i/limits | grep "Max core file size" | tr -s ' ' | cut -d  ' ' -f6 >> /tmp/samba-core-file-size; done;
# cat /tmp/samba-core-file-size
must display all as "unlimited".

OR

(manual way)
Soft and hard limits for 'Max core file size' field from the output of `cat /proc/<pid>/limits`, where pid=each pid from the output of `pgrep smbd`, must show as unlimited.

Comment 19 Mukul Malhotra 2016-03-15 13:22:20 UTC
Hello,

Thanks Anoop for the analysis done on the core dump. 

Waiting for your update regarding any workaround or patch which can be provided to the customer.

Mukul

Comment 26 Mukul Malhotra 2016-03-16 15:15:41 UTC
Hello,

Thanks Anoop

Mukul

Comment 27 Mukul Malhotra 2016-03-21 09:35:11 UTC
Hello Anoop,

Can the test build be prioritize as customer is waiting for the same ?

Thanks
Mukul

Comment 31 Mukul Malhotra 2016-03-22 17:36:06 UTC
Hello Michael,

I had corrected the $subject

Thanks
Mukul

Comment 34 Poornima G 2016-03-23 10:24:30 UTC
Hi,

I have attached the fixes for the crashes reported in this bug, both the patches need to be applied on top of 3.7.1.16(which customer is having). Let me know if you need anything else.

Comment 36 Bipin Kunal 2016-03-23 10:36:45 UTC
Poornima,

Please provide the patch link as well on the case.

Thanks,
Bipin Kunal

Comment 38 Poornima G 2016-03-23 11:50:36 UTC
The corresponding upstream patches link:

http://review.gluster.org/#/c/13784
http://review.gluster.org/#/c/6459/

Comment 39 Poornima G 2016-03-23 13:23:58 UTC
As discussed/concluded, if the customer is ok with the downtime for updating to 3.1.2(gluster, samba, ctdb), then we can provide the hostfix for 3.1.2 else for 3.1.1

Hotfix for 3.1.1:
The patches are attached in the BZ:
- https://bugzilla.redhat.com/show_bug.cgi?id=1315201#c32
- https://bugzilla.redhat.com/show_bug.cgi?id=1315201#c33

Hotfix for 3.1.2:
The upstream patches apply cleanly, hence can be cherrypicked from the below locations:
- http://review.gluster.org/#/c/6459/
- http://review.gluster.org/#/c/13784

Comment 42 Raghavendra Talur 2016-03-25 13:56:17 UTC
Reply for comment 40,  it is ok if you exclude /libglusterfs/src/unittest/log_mock.c changes in the patch. 


Reply for comment 41, accessing same volume simultaneously from Two or more Samba nodes which are not part of Samba ctdb cluster will lead to lot of problems with locking etc. It is preferred if a test volume having same contents in created on updated node. 

I would like to wait for poornima's comment before proceeding with test.

Comment 43 Poornima G 2016-03-26 15:39:22 UTC
Created attachment 1140610 [details]
Patch that applies on 3.1.2 rhgs

This patch should be used instead of http://review.gluster.org/#/c/13784

Comment 44 Poornima G 2016-03-26 15:50:52 UTC
Mukul,

Please use the attached patch, and http://review.gluster.org/#/c/6459/ for the build.

The problem was, a file(libglusterfs/src/unittest/log_mock.c) modified in this patch, is not packaged.

Regarding the testing, as mentioned above, the same gluster volume should not be used by standalone samba and clustered samba simultaneously. This can lead to data corruption. Hence volume should not be used simultaneously or export a testvolume from standalone samba.

Also i see that the volume has some nfs options set, is the volume being accessed by samba and nfs simultaneously?

Comment 45 Bipin Kunal 2016-03-28 05:54:46 UTC
Thanks Poornima and Raghavendra. Removed log_mock.c changes from the patch and it works fine.


@ Mukul : Here is the test-fix available. Check build :
          https://brewweb.devel.redhat.com/taskinfo?taskID=10737756

Please ask on the customer to test this fix. Please inform customer that this is only for testing purpose and has not been fully tested.

Hotfix will be given once he is satisfied with the fix. Please recommend necessary measure to be taken before they upgrade.

I will recommend you even to use these rpms to test basic functionality, like upgrading, creating new volume, using existing volume, using samba etc.


-Regards,
Bipin

Comment 46 Mukul Malhotra 2016-04-04 15:36:30 UTC
Hello,

Customer wanted the hotfix build https://brewweb.devel.redhat.com/taskinfo?taskID=10770475 to be tested by QE as customer will be applying the hotfix on the prod environment.

So, can the QE test the hotfix build before providing the same to customer as they have an ETA for tomorrow i.e Tuesday.

Thanks
Mukul

Comment 56 Bhaskarakiran 2016-04-11 10:38:32 UTC
From QE side, we have run couple of regressions using both windows and linux cifs clients and tried to simulate the transcoding / encoding using a tool (multiple files were used from multiple clients). No crashes were seen during these runs.

Comment 59 Bipin Kunal 2016-04-12 10:33:19 UTC
I have forwarded hotfix to customer based on #46 #56.

Rejy, Please provide hot_fix_requested+

Comment 61 Vivek Das 2016-04-14 09:43:37 UTC
Adding to QE testing. Apart from the sanity tests including windows and linux cifs side. We have also tested running Iozone tool over multiple clients and more over we ran a rigorous test of running huge IOs and simultaneously multiple connect and disconnect of the mounted share. No crashes were seen during this run.

Comment 62 Vivek Das 2016-04-26 12:00:44 UTC
Transcoding / encoding tests over video file formats and rigorous test of running huge IOs and simultaneously multiple connect and disconnect of the mounted share on windows client where performed.
No crashes were seen during these run.

Comment 64 errata-xmlrpc 2016-06-23 05:10:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240


Note You need to log in before you can comment on or make changes to this bug.