Bug 1467958

Summary: [GSS] MQ objects damaged on pushing loads of messages
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Cal Calhoun <ccalhoun>
Component: CNS-deploymentAssignee: Michael Adam <madam>
Status: CLOSED NOTABUG QA Contact: Anoop <annair>
Severity: urgent Docs Contact:
Priority: urgent    
Version: rhgs-3.2CC: abhishku, akhakhar, annair, bkunal, ccalhoun, hchiramm, jrivera, madam, mliyazud, mzywusko, pprakash, rhs-bugs, rreddy, rtalur, tcarlin, vbellur, vinug
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-18 21:41:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreport for node 10.98.60.24
none
sosreport for node 10.98.62.148
none
sosreport for node 10.98.62.152 none

Description Cal Calhoun 2017-07-05 15:49:29 UTC
Created attachment 1294660 [details]
heketi topology file

Description of problem:

Originally there were two separate gluster clusters, each having three nodes (one in each AZ) with replica 3 volumes.  The gluster cluster was build on OCP and deployed with heketi.  When large numbers of messages are pushed to the MQ queue, the queue files (MQ objects) get damaged.  Thinking it could be due to insufficient resource available on the gluster nodes that cause the damage architecture changes were made to have only one gluster cluster with 6 nodes (still replica 3 volumes).  On performing the same load test, they are seeing the same issue of MQ objects being damaged.

Version-Release number of selected component (if applicable):

gluster 3.8.4
IBM MQ v9

Have requested cns and heketi versions.

How reproducible:

Happens with regularity during load testing.

Steps to Reproduce:

Routine writing of large numbers of messages to the MQ queue seems to cause the damage.  I've requested more quantitative information on the MQ message volume that seems to trigger the problem. 

Actual results:

MQ files are being 'damaged'.  Have asked for additional detail on what that means.

Expected results:

Would expect MQ to be able to write messages to the gluster volume(s) without damaging the queue files.

Additional info:

I have requested new MQ logs.

These errors were posted to the case:

Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                                    -------------   --------        ------          -------
  1h            25m             16      {kubelet ip-10-98-60-24.eu-west-1.compute.internal}                     Warning         FailedMount     MountVolume.SetUp failed for volume "kubernetes.io/secret/353351d2-5d9c-11e7-85a9-0a879126be0e-default-token-qo8l4" (spec.Name: "default-token-qo8l4") pod "353351d2-5d9c-11e7-85a9-0a879126be0e" (UID: "353351d2-5d9c-11e7-85a9-0a879126be0e") with: Get https://internal-paperboyprj-techtest-master-1067458796.eu-west-1.elb.amazonaws.com:8443/api/v1/namespaces/storage-utif/secrets/default-token-qo8l4: net/http: TLS handshake timeout
  1h            4m              23      {kubelet ip-10-98-60-24.eu-west-1.compute.internal}                     Warning         FailedMount     MountVolume.SetUp failed for volume "kubernetes.io/secret/353351d2-5d9c-11e7-85a9-0a879126be0e-default-token-qo8l4" (spec.Name: "default-token-qo8l4") pod "353351d2-5d9c-11e7-85a9-0a879126be0e" (UID: "353351d2-5d9c-11e7-85a9-0a879126be0e") with: Get https://internal-paperboyprj-techtest-master-1067458796.eu-west-1.elb.amazonaws.com:8443/api/v1/namespaces/storage-utif/secrets/default-token-qo8l4: EOF

Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath   Type            Reason          Message
  ---------     --------        -----   ----                                                    -------------   --------        ------          -------
  20h           32m             14      {kubelet ip-10-98-62-152.eu-west-1.compute.internal}                    Warning         FailedMount     MountVolume.SetUp failed for volume "kubernetes.io/secret/3569b8d3-5d9c-11e7-85a9-0a879126be0e-default-token-qo8l4" (spec.Name: "default-token-qo8l4") pod "3569b8d3-5d9c-11e7-85a9-0a879126be0e" (UID: "3569b8d3-5d9c-11e7-85a9-0a879126be0e") with: Get https://internal-paperboyprj-techtest-master-1067458796.eu-west-1.elb.amazonaws.com:8443/api/v1/namespaces/storage-utif/secrets/default-token-qo8l4: net/http: TLS handshake timeout
  20h           22m             30      {kubelet ip-10-98-62-152.eu-west-1.compute.internal}                    Warning         FailedMount     MountVolume.SetUp failed for volume "kubernetes.io/secret/3569b8d3-5d9c-11e7-85a9-0a879126be0e-default-token-qo8l4" (spec.Name: "default-token-qo8l4") pod "3569b8d3-5d9c-11e7-85a9-0a879126be0e" (UID: "3569b8d3-5d9c-11e7-85a9-0a879126be0e") with: Get https://internal-paperboyprj-techtest-master-1067458796.eu-west-1.elb.amazonaws.com:8443/api/v1/namespaces/storage-utif/secrets/default-token-qo8l4: EOF

Comment 5 Cal Calhoun 2017-07-05 16:24:48 UTC
Version Information:

  cns-deploy-3.1.0-14.el7rhgs.x86_64
  heketi-3.1.0-14.el7rhgs.x86_64
  heketi-client-3.1.0-14.el7rhgs.x86_64

Comment 7 Cal Calhoun 2017-07-05 20:15:45 UTC
@Vijay: I'll attach the three that I have and ask for the others.

Comment 8 Cal Calhoun 2017-07-05 20:20:53 UTC
Created attachment 1294737 [details]
sosreport for node 10.98.60.24

Comment 9 Cal Calhoun 2017-07-05 20:21:54 UTC
Created attachment 1294738 [details]
sosreport for node 10.98.62.148

Comment 10 Cal Calhoun 2017-07-05 20:22:36 UTC
Created attachment 1294739 [details]
sosreport for node 10.98.62.152