Description of problem: Brick got crashed on a node with the following error in the brick logs: "E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop" Version-Release number of selected component (if applicable): GlusterFS3.6.3 [root@gqas006 ssl]# rpm -qa | grep gluster glusterfs-hadoop-distribution-glusterfs-hadoop-setup_hadoop-0.1-122.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gluster_selfheal-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_bigtop-0.2.1-24.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_user_mapred_job-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_file_dir_permissions-0.1-9.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_home_dir_listing-0.1-5.noarch glusterfs-libs-3.6.3-1.fc20.x86_64 glusterfs-geo-replication-3.6.3-1.fc20.x86_64 glusterfs-resource-agents-3.5.3-1.fc20.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_default_block_size-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multiuser_support-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multiple_volumes-0.1-18.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hive-0.1-12.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gridmix3-0.1-2.noarch glusterfs-devel-3.6.3-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-setup_common-0.2-119.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_gluster-0.2-78.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-glusterd_tests-0.2-1.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop-0.1-7.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_special_char_in_path-0.1-2.noarch glusterfs-debuginfo-3.6.2-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_dfsio_io_exception-0.1-8.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_ldap-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_fileappend-0.1-5.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_missing_dirs_create-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_sqoop-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_quota-0.1-6.noarch glusterfs-3.6.3-1.fc20.x86_64 glusterfs-cli-3.6.3-1.fc20.x86_64 glusterfs-rdma-3.6.3-1.fc20.x86_64 glusterfs-hadoop-2.1.2-2.fc20.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_hcfs_testcli-0.2-7.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_dfsio-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_multifilewc_null_pointer_exception-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_gluster_quota_selfheal-0.2-11.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_append_to_file-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hbase-0.1-4.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_shim_access_error_messages-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_hadoop_mapreduce-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_mahout-0.1-6.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_erroneous_multivolume_filepaths-0.1-4.noarch glusterfs-fuse-3.6.3-1.fc20.x86_64 glusterfs-server-3.6.3-1.fc20.x86_64 glusterfs-hadoop-javadoc-2.1.2-2.fc20.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_groovy_sync-0.1-24.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_rhs_georep-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_setting_working_directory-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_junit_shim-0.1-13.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-setup_hadoop_security-0.0.1-11.noarch glusterfs-extra-xlators-3.6.3-1.fc20.x86_64 glusterfs-hadoop-distribution-glusterfs-hadoop-test_brick_sorted_order_of_filenames-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_fs_counters-0.1-11.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_generate_gridmix2_data-0.1-3.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_selinux_persistently_disabled-0.1-2.noarch glusterfs-hadoop-distribution-glusterfs-hadoop-test_bigtop_pig-0.1-9.noarch glusterfs-api-3.6.3-1.fc20.x86_64 glusterfs-api-devel-3.6.3-1.fc20.x86_64 [root@gqas006 ssl]# How reproducible: I am not certain which caused the crash. I will update more details if I reproduce it again. Steps to Reproduce: 1. Create a 2*2 dist-rep volume and start it 2. Create a private+public file for each server and client nodes 3. Concatenate the ca file and copy to server and client nodes. Set the necessary volume option for SSL/TLS to work properly. https://github.com/gluster/glusterfs/blob/master/doc/admin-guide/en-US/markdown/admin_ssl.md 4. Mount from the client Actual results: There is a brick crash for some nodes. Expected results: Bricks should not crash. Additional info: [root@remote-gluster-server ~]# gluster volume status Status of volume: gluster-native-volume-1G-1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.156.12:/rhs/brick1/newvol7 49159 Y 22418 Brick 10.16.156.15:/rhs/brick1/newvol7 49159 Y 6059 Brick 10.16.156.24:/rhs/brick1/newvol7 49170 Y 24581 Brick 10.16.156.24:/rhs/brick2/newvol7 49171 Y 24605 NFS Server on localhost 2049 Y 24043 Self-heal Daemon on localhost N/A Y 24050 NFS Server on gqas006.sbu.lab.eng.bos.redhat.com 2049 Y 7212 Self-heal Daemon on gqas006.sbu.lab.eng.bos.redhat.com N/A Y 7219 NFS Server on gqas009.sbu.lab.eng.bos.redhat.com 2049 Y 26026 Self-heal Daemon on gqas009.sbu.lab.eng.bos.redhat.com N/A Y 26033 Task Status of Volume gluster-native-volume-1G-1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gluster-native-volume-3G-1 Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.156.12:/rhs/brick1/newvol8 49160 Y 13626 Brick 10.16.156.15:/rhs/brick1/newvol8 N/A N 31104 ---> Brick crashed Brick 10.16.156.24:/rhs/brick1/newvol8 N/A N 14854 Brick 10.16.156.24:/rhs/brick2/newvol8 49173 Y 14865 NFS Server on localhost 2049 Y 24043 Self-heal Daemon on localhost N/A Y 24050 NFS Server on gqas006.sbu.lab.eng.bos.redhat.com 2049 Y 7212 Self-heal Daemon on gqas006.sbu.lab.eng.bos.redhat.com N/A Y 7219 NFS Server on gqas009.sbu.lab.eng.bos.redhat.com 2049 Y 26026 Self-heal Daemon on gqas009.sbu.lab.eng.bos.redhat.com N/A Y 26033 Task Status of Volume gluster-native-volume-3G-1 ------------------------------------------------------------------------------ There are no active volume tasks [root@remote-gluster-server ~]# yum info openssl Installed Packages Name : openssl Arch : x86_64 Epoch : 1 Version : 1.0.1e Release : 42.fc20 Size : 1.5 M Repo : installed From repo : fedora-updates Summary : Utilities from the general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [root@remote-gluster-server ~]# [2015-04-29 09:32:46.921692] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop [2015-04-29 09:32:47.927424] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop [2015-04-29 09:32:49.084098] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop [2015-04-29 09:32:49.242428] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop [2015-04-29 09:32:50.089756] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop [2015-04-29 09:32:50.250215] E [socket.c:2495:socket_poller] 0-tcp.gluster-native-volume-3G-1-server: error in polling loop pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-04-29 09:32:51 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.3 pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2015-04-29 09:32:51 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.6.3 /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f2022cac362] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f2022cc385d] /lib64/libc.so.6(+0x358f0)[0x7f2021cc68f0] /lib64/libcrypto.so.10(sk_value+0x19)[0x7f20221323f9] /lib64/libcrypto.so.10(+0x10126b)[0x7f202215026b] /lib64/libcrypto.so.10(ASN1_item_ex_i2d+0x163)[0x7f2022154f03] /lib64/libcrypto.so.10(+0x1061ff)[0x7f20221551ff] /lib64/libcrypto.so.10(X509_NAME_cmp+0x5a)[0x7f202216963a] /lib64/libcrypto.so.10(X509_check_issued+0x28)[0x7f202217b628] /lib64/libcrypto.so.10(+0x11b8a5)[0x7f202216a8a5] /lib64/libcrypto.so.10(X509_verify_cert+0xb4)[0x7f202216bfa4] /lib64/libssl.so.10(ssl3_output_cert_chain+0x1a8)[0x7f2013bacb68] /lib64/libssl.so.10(ssl3_send_server_certificate+0x35)[0x7f2013ba03d5] /lib64/libssl.so.10(ssl3_accept+0xd1d)[0x7f2013ba184d] /usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0x478a)[0x7f2013def78a] /usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0x5e50)[0x7f2013df0e50] /usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0xb159)[0x7f2013df6159] /lib64/libpthread.so.0(+0x7ee5)[0x7f202243eee5] /lib64/libc.so.6(clone+0x6d)[0x7f2021d85d1d] --------- /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xb2)[0x7f2022cac362] /lib64/libglusterfs.so.0(gf_print_trace+0x32d)[0x7f2022cc385d] /lib64/libc.so.6(+0x358f0)[0x7f2021cc68f0] sos-repots: http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/sosreport-gqas006.sbu.lab.eng.bos.redhat.com-20150504043010.tar.xz
I was unable to reproduce this (in 100 tries) on a Fedora 21 system with the 3.6.3-1 packages from download.gluster.org and OpenSSL 1.0.1j. I notice that you were using OpenSSL 1.0.1e. Before I downgrade my test system, or build a new one, can we please verify that 1.0.1e was the correct OpenSSL version to have on your test system?
Also, how exactly were the certificates generated? What SSL "subject" did you use?
It's possible that this is a manifestation of a multi-threading issue, which tends to show up in X509_verify_cert. See http://review.gluster.org/#/c/10075/ for details. That would explain the non-deterministic appearance of the bug. Perhaps we need to backport that patch to 3.6?
REVIEW: http://review.gluster.org/10591 (socket: use OpenSSL multi-threading interfaces) posted (#1) for review on release-3.6 by Jeff Darcy (jdarcy)
After talking with Kaushal earlier I came to know that openssl version to be used is OpenSSL 1.0.1e. I/O data access path from single client(without enabling management SSL/TLS) was working fine with GlusterFS3.6.2. Do we need to use OpenSSL 1.0.1j? I am using fedora-20. Installed Packages Name : openssl Arch : x86_64 Epoch : 1 Version : 1.0.1e Release : 42.fc20 Size : 1.5 M Repo : installed From repo : fedora-updates Summary : Utilities from the general purpose cryptography library with TLS implementation URL : http://www.openssl.org/ License : OpenSSL Description : The OpenSSL toolkit provides support for secure communications between : machines. OpenSSL includes a certificate management tool and shared : libraries which provide various cryptographic algorithms and : protocols. [root@remote-gluster-server ~]#
REVIEW: http://review.gluster.org/10591 (socket: use OpenSSL multi-threading interfaces) posted (#2) for review on release-3.6 by Jeff Darcy (jdarcy)
This issue is more frequently seen and so marking this as a blocker.
I think the OpenSSL version is a red herring. At the time I asked, I was still pretty much in the dark and trying to gather information; I hadn't yet realized that the symptom here closely matches that which http://review.gluster.org/#/c/10075/ had fixed in later versions. I've posted http://review.gluster.org/10591 as a 3.6 backport, and http://review.gluster.org/10617 so that it can pass regression tests (nothing will on 3.6 because of changes to the test machines). They both *have* passed regression tests, and merely await review/merging.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-v3.6.4, please open a new bug report. glusterfs-v3.6.4 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] https://www.gluster.org/pipermail/gluster-users/2015-July/022826.html [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user