Description of problem ====================== Sometimes "Create Cluster" task fails without a clear reason. The lack of good logs makes it hard to debug, but it may be connected with collectd/influxdb auth failure. This BZ is concerned with *the fact that the create cluster operation fails* in this way every now and then. There is another BZ 1296215 which deals with the fact that there is no good evidence what actually happens in similar cases. Version-Release number of selected component ============================================ usm server machine: ~~~ # rpm -qa 'rhscon*' rhscon-core-0.0.6-0.1.alpha1.el7.x86_64 rhscon-ui-0.0.6-0.1.alpha1.el7.noarch rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64 # rpm -qa 'salt*' salt-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch # rpm -qa 'influxdb*' influxdb-0.9.5.1-1.x86_64 # rpm -qa | grep mongo mongodb-server-2.6.5-4.el7ost.x86_64 mongodb-2.6.5-4.el7ost.x86_64 ~~~ node machines (OSD only): ~~~ # rpm -qa 'rhscon*' rhscon-agent-0.0.3-0.2.alpha1.el7.noarch # rpm -qa 'salt*' salt-minion-2015.5.5-1.el7.noarch salt-2015.5.5-1.el7.noarch # rpm -qa collectd collectd-5.4.2-1.el7ost.x86_64 ~~~ node machines (OSD and MON): ~~~ # rpm -qa 'rhscon*' rhscon-agent-0.0.3-0.2.alpha1.el7.noarch # rpm -qa 'salt*' salt-minion-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch salt-2015.5.5-1.el7.noarch # rpm -qa collectd collectd-5.4.2-1.el7ost.x86_64 ~~~ How reproducible ================ It happens now and then, I noticed this in 2 cases out of 8 so far. Steps to Reproduce ================== 1. Install skyring on server and prepare few hosts for cluster setup 2. Accept all nodes 3. Create new cluster using "Create Cluster" wizard Actual results ============== Create Cluster task fails quickly after the moment it was started (without clear description of the problem available from the usm as noted in BZ 1296215). Skyring console log shows this (note that operation started at 10:34 and failed immediatelly): ~~~ [negroni] Started POST /api/v1/clusters ESC[36m2016-01-21T10:34:57.829+01:00 DEBUG cluster.go:173 POST_Clusters]ESC[0m Task Created: %!(EXTRA string=fb46db72-4718-4aca-a93d-c58e82817d83) [negroni] Completed 202 Accepted in 6.674096ms ESC[31m2016-01-21T10:34:57.837+01:00 ERROR utils.go:133 FailTask]ESC[0m Cluster creation failed: CreateCluster(): function failed at python side [negroni] Started GET /api/v1/events [negroni] Completed 200 OK in 1.562161ms [negroni] Started GET /api/v1/tasks/fb46db72-4718-4aca-a93d-c58e82817d83 [negroni] Completed 200 OK in 4.224033ms [negroni] Started GET /views/modal/custom-modal.html [negroni] Completed 304 Not Modified in 117.358µs [negroni] Started GET /api/v1/users/me [negroni] Completed 200 OK in 4.308777ms [negroni] Started GET /api/v1/clusters [negroni] Completed 200 OK in 2.523943ms [negroni] Started GET /api/v1/tasks/fb46db72-4718-4aca-a93d-c58e82817d83 [negroni] Completed 200 OK in 3.7848ms [negroni] Started GET /api/v1/unmanaged_nodes [negroni] Completed 200 OK in 93.116745ms [negroni] Started GET /api/v1/unmanaged_nodes [negroni] Completed 200 OK in 95.89419ms ~~~ Note that all log examples presented here are cut to show only timeframe around the create cluster failure. There is no good evidence on the usm server: * nothing special in /var/log/mongodb/mongodb.log: ~~~ 2016-01-21T10:32:47.868+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:33:07.871+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:33:17.869+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:33:37.878+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:33:47.866+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:07.867+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:17.867+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:37.866+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:47.871+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:57.831+0100 [conn3] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:34:59.840+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:35:07.870+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:35:17.255+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:35:32.257+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:35:37.255+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:35:47.252+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } 2016-01-21T10:36:02.252+0100 [conn8] authenticate db: skyring { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" } ~~~ * nothing special in /var/log/skyring/bigfin.log: ~~~ 2016-01-21T10:34:57.837+01:00 ERROR utils.go:133 FailTask] Cluster creation failed: CreateCluster(): function failed at python side ~~~ * nothing special in /var/log/skyring/bigfin.log: ~~~ 2016-01-21T10:34:57.829+01:00 DEBUG cluster.go:173 POST_Clusters] Task Created: %!(EXTRA string=fb46db72-4718-4aca-a93d-c58e82817d83) ~~~ * nothing special in /var/log/influxdb/influxdb.log: ~~~ [wal] 2016/01/21 10:33:51 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:33:51 write to index of partition 1 took 4.263082ms [wal] 2016/01/21 10:34:01 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:01 write to index of partition 1 took 8.919023ms [wal] 2016/01/21 10:34:11 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:11 write to index of partition 1 took 6.906756ms [wal] 2016/01/21 10:34:21 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:21 write to index of partition 1 took 9.810116ms [wal] 2016/01/21 10:34:31 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:31 write to index of partition 1 took 5.614884ms [wal] 2016/01/21 10:34:41 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:41 write to index of partition 1 took 5.868523ms [wal] 2016/01/21 10:34:51 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:34:51 write to index of partition 1 took 13.605395ms [wal] 2016/01/21 10:35:01 Flush due to idle. Flushing 10 series with 10 points and 620 bytes from partition 1 [wal] 2016/01/21 10:35:01 write to index of partition 1 took 7.658229ms ~~~ On the node machines: * there is nothing in /var/log/skynet/skynet.log (besides 1st init entry) * there is no event in /var/log/salt/minion.log during the failure: ~~~ 2016-01-21 10:23:17,869 [salt.utils.schedule][INFO ][325] Running scheduled job: __mine_interval 2016-01-21 10:23:17,871 [salt.utils.schedule][DEBUG ][325] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True) 2016-01-21 10:23:17,871 [salt.utils.schedule][DEBUG ][325] schedule: This job was scheduled with a max number of 2 2016-01-21 10:23:17,890 [salt.utils.schedule][DEBUG ][7567] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20160121102317878909', 'pid': 7567, 'id': 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'schedule': '__mine_interval'} 2016-01-21 10:23:17,893 [salt.crypt ][DEBUG ][7567] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') 2016-01-21 10:23:17,895 [salt.crypt ][DEBUG ][7567] Loaded minion key: /etc/salt/pki/minion/minion.pem 2016-01-21 10:23:17,909 [salt.utils.event ][DEBUG ][7567] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_b09f10ffda_pub.ipc 2016-01-21 10:23:17,909 [salt.utils.event ][DEBUG ][7567] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_b09f10ffda_pull.ipc 2016-01-21 10:23:17,912 [salt.utils.event ][DEBUG ][7567] Sending event - data = {'_stamp': '2016-01-21T09:23:17.911837'} 2016-01-21 10:23:17,913 [salt.minion ][DEBUG ][325] Handling event 'salt/event/new_client\n\n\x81\xa6_stamp\xba2016-01-21T09:23:17.911837' 2016-01-21 10:23:17,915 [salt.utils.event ][DEBUG ][7567] Sending event - data = {'_stamp': '2016-01-21T09:23:17.915037', 'clear': False, 'cmd': '_mine', 'tok': '4K\x98\xbc\xf9\xb9\xdaWh\xcf~\x99\xba9\x10\xf3\xdd\xaf\x81?\xf2`\xc3j|\xec\xb7\xb5\n\x90\xc0\t\xf9\xfa\xce\x05P`{}\x8b\x14\xd8\xcby\x94o0\xce\xfa%\xc6\x86\x07T\xb6\x9a\xe8\xe3\x9e\xc7\x9a\x8c\rbry\xf1\xc9\xf2J\x82\x9a>\xc4\xda\xd4\x9e\xaeo\xca\xe3\xc7A\xa7S\x8f\xc1y\xd8\xbd\xde\xb5<*\x8f36"}\n=tZ\xa8\xe5\xb1\xe6-|\xad\x0c\x00D\xb5\xfeq\x96\x10\x00\x0b\xe1\xe3\xb6&\x00#\xe5F\x03\x8e2\x0bN\x7f\xfb\xdcg\xc4\x92W\x17\xda\xce\xc1\xa94\xca\x14\xb7S\x91\x85\xec\x059\xdfs\xed$\xf0E\x899\xc2\x15SV\xfb \xb7]\xa2\xc879\xb7\xca\xe1Is\xfa\xa2\xad\xef\xc2\x0eXM)\xe9+\x15\x9f\x1b\x99\xe6[x\xc3\xd9\x17\xed\xa6\xe4~@A\x89}\x08\xd5WvV\xf3\x89\x82|\x14\x0c\x95q\xc2M\xf0r7\x05\x98\xc8Lg\xdc\x8ca\xa9\xcf\xf6?5d\x91\xc4\t\xd6\xb4\x8f"\xbdG\xbd\xe7\xe6O\xc2', 'data': {}, 'id': 'mbukatov-usm1-node1.os1.phx2.redhat.com'} 2016-01-21 10:23:17,916 [salt.minion ][DEBUG ][325] Handling event '_minion_mine\n\n\x86\xa6_stamp\xba2016-01-21T09:23:17.915037\xa5clear\xc2\xa3cmd\xa5_mine\xa3tok\xda\x01\x004K\x98\xbc\xf9\xb9\xdaWh\xcf~\x99\xba9\x10\xf3\xdd\xaf\x81?\xf2`\xc3j|\xec\xb7\xb5\n\x90\xc0\t\xf9\xfa\xce\x05P`{}\x8b\x14\xd8\xcby\x94o0\xce\xfa%\xc6\x86\x07T\xb6\x9a\xe8\xe3\x9e\xc7\x9a\x8c\rbry\xf1\xc9\xf2J\x82\x9a>\xc4\xda\xd4\x9e\xaeo\xca\xe3\xc7A\xa7S\x8f\xc1y\xd8\xbd\xde\xb5<*\x8f36"}\n=tZ\xa8\xe5\xb1\xe6-|\xad\x0c\x00D\xb5\xfeq\x96\x10\x00\x0b\xe1\xe3\xb6&\x00#\xe5F\x03\x8e2\x0bN\x7f\xfb\xdcg\xc4\x92W\x17\xda\xce\xc1\xa94\xca\x14\xb7S\x91\x85\xec\x059\xdfs\xed$\xf0E\x899\xc2\x15SV\xfb \xb7]\xa2\xc879\xb7\xca\xe1Is\xfa\xa2\xad\xef\xc2\x0eXM)\xe9+\x15\x9f\x1b\x99\xe6[x\xc3\xd9\x17\xed\xa6\xe4~@A\x89}\x08\xd5WvV\xf3\x89\x82|\x14\x0c\x95q\xc2M\xf0r7\x05\x98\xc8Lg\xdc\x8ca\xa9\xcf\xf6?5d\x91\xc4\t\xd6\xb4\x8f"\xbdG\xbd\xe7\xe6O\xc2\xa4data\x80\xa2id\xda\x00\'mbukatov-usm1-node1.os1.phx2.redhat.com' 2016-01-21 10:23:17,917 [salt.crypt ][DEBUG ][325] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') 2016-01-21 10:23:19,944 [salt.crypt ][DEBUG ][7567] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') 2016-01-21 11:23:17,281 [salt.utils.schedule][INFO ][325] Running scheduled job: __mine_interval 2016-01-21 11:23:17,283 [salt.utils.schedule][DEBUG ][325] schedule: This job was scheduled with jid_include, adding to cache (jid_include defaults to True) 2016-01-21 11:23:17,284 [salt.utils.schedule][DEBUG ][325] schedule: This job was scheduled with a max number of 2 2016-01-21 11:23:17,307 [salt.utils.schedule][DEBUG ][11671] schedule.handle_func: adding this job to the jobcache with data {'fun': 'mine.update', 'jid': '20160121112317293530', 'pid': 11671, 'id': 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'schedule': '__mine_interval'} 2016-01-21 11:23:17,351 [salt.crypt ][DEBUG ][11671] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') 2016-01-21 11:23:17,353 [salt.crypt ][DEBUG ][11671] Loaded minion key: /etc/salt/pki/minion/minion.pem 2016-01-21 11:23:17,384 [salt.utils.event ][DEBUG ][11671] MinionEvent PUB socket URI: ipc:///var/run/salt/minion/minion_event_b09f10ffda_pub.ipc 2016-01-21 11:23:17,384 [salt.utils.event ][DEBUG ][11671] MinionEvent PULL socket URI: ipc:///var/run/salt/minion/minion_event_b09f10ffda_pull.ipc 2016-01-21 11:23:17,386 [salt.utils.event ][DEBUG ][11671] Sending event - data = {'_stamp': '2016-01-21T10:23:17.386092'} 2016-01-21 11:23:17,388 [salt.minion ][DEBUG ][325] Handling event 'salt/event/new_client\n\n\x81\xa6_stamp\xba2016-01-21T10:23:17.386092' 2016-01-21 11:23:17,390 [salt.utils.event ][DEBUG ][11671] Sending event - data = {'_stamp': '2016-01-21T10:23:17.389795', 'clear': False, 'cmd': '_mine', 'tok': '4K\x98\xbc\xf9\xb9\xdaWh\xcf~\x99\xba9\x10\xf3\xdd\xaf\x81?\xf2`\xc3j|\xec\xb7\xb5\n\x90\xc0\t\xf9\xfa\xce\x05P`{}\x8b\x14\xd8\xcby\x94o0\xce\xfa%\xc6\x86\x07T\xb6\x9a\xe8\xe3\x9e\xc7\x9a\x8c\rbry\xf1\xc9\xf2J\x82\x9a>\xc4\xda\xd4\x9e\xaeo\xca\xe3\xc7A\xa7S\x8f\xc1y\xd8\xbd\xde\xb5<*\x8f36"}\n=tZ\xa8\xe5\xb1\xe6-|\xad\x0c\x00D\xb5\xfeq\x96\x10\x00\x0b\xe1\xe3\xb6&\x00#\xe5F\x03\x8e2\x0bN\x7f\xfb\xdcg\xc4\x92W\x17\xda\xce\xc1\xa94\xca\x14\xb7S\x91\x85\xec\x059\xdfs\xed$\xf0E\x899\xc2\x15SV\xfb \xb7]\xa2\xc879\xb7\xca\xe1Is\xfa\xa2\xad\xef\xc2\x0eXM)\xe9+\x15\x9f\x1b\x99\xe6[x\xc3\xd9\x17\xed\xa6\xe4~@A\x89}\x08\xd5WvV\xf3\x89\x82|\x14\x0c\x95q\xc2M\xf0r7\x05\x98\xc8Lg\xdc\x8ca\xa9\xcf\xf6?5d\x91\xc4\t\xd6\xb4\x8f"\xbdG\xbd\xe7\xe6O\xc2', 'data': {}, 'id': 'mbukatov-usm1-node1.os1.phx2.redhat.com'} 2016-01-21 11:23:17,391 [salt.minion ][DEBUG ][325] Handling event '_minion_mine\n\n\x86\xa6_stamp\xba2016-01-21T10:23:17.389795\xa5clear\xc2\xa3cmd\xa5_mine\xa3tok\xda\x01\x004K\x98\xbc\xf9\xb9\xdaWh\xcf~\x99\xba9\x10\xf3\xdd\xaf\x81?\xf2`\xc3j|\xec\xb7\xb5\n\x90\xc0\t\xf9\xfa\xce\x05P`{}\x8b\x14\xd8\xcby\x94o0\xce\xfa%\xc6\x86\x07T\xb6\x9a\xe8\xe3\x9e\xc7\x9a\x8c\rbry\xf1\xc9\xf2J\x82\x9a>\xc4\xda\xd4\x9e\xaeo\xca\xe3\xc7A\xa7S\x8f\xc1y\xd8\xbd\xde\xb5<*\x8f36"}\n=tZ\xa8\xe5\xb1\xe6-|\xad\x0c\x00D\xb5\xfeq\x96\x10\x00\x0b\xe1\xe3\xb6&\x00#\xe5F\x03\x8e2\x0bN\x7f\xfb\xdcg\xc4\x92W\x17\xda\xce\xc1\xa94\xca\x14\xb7S\x91\x85\xec\x059\xdfs\xed$\xf0E\x899\xc2\x15SV\xfb \xb7]\xa2\xc879\xb7\xca\xe1Is\xfa\xa2\xad\xef\xc2\x0eXM)\xe9+\x15\x9f\x1b\x99\xe6[x\xc3\xd9\x17\xed\xa6\xe4~@A\x89}\x08\xd5WvV\xf3\x89\x82|\x14\x0c\x95q\xc2M\xf0r7\x05\x98\xc8Lg\xdc\x8ca\xa9\xcf\xf6?5d\x91\xc4\t\xd6\xb4\x8f"\xbdG\xbd\xe7\xe6O\xc2\xa4data\x80\xa2id\xda\x00\'mbukatov-usm1-node1.os1.phx2.redhat.com' 2016-01-21 11:23:17,393 [salt.crypt ][DEBUG ][325] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') 2016-01-21 11:23:19,420 [salt.crypt ][DEBUG ][11671] Re-using SAuth for ('/etc/salt/pki/minion', 'mbukatov-usm1-node1.os1.phx2.redhat.com', 'tcp://172.16.180.90:4506') ~~~ * it seems that collectd has some auth problem: ~~~ # journalctl -u collectd --since="2016-01-21 10:33" --until="2016-01-21 10:36" -- Logs begin at Wed 2016-01-20 07:41:57 CET, end at Thu 2016-01-21 12:00:02 CET. -- Jan 21 10:33:02 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) Jan 21 10:33:32 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) Jan 21 10:34:02 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) Jan 21 10:34:32 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) Jan 21 10:35:02 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) Jan 21 10:35:32 mbukatov-usm1-node1.os1.phx2.redhat.com collectd[5402]: Collectd-InfluxDB.py: Error when sending metrics influxdb ([Errno 13] Permission denied) ~~~ Expected results ================ Create Cluster operation finishes with success. Additional info =============== When this issue happens, I can actually retry Create Cluster operation via usm web ui, but it always immediately fails in the same way. Moreover given the lack of details, I'm not entirely sure about the root cause of this issue.
This is very old bug filed before integration with new ceph-installer. If there is cluster creation failure I can find reason why it failed(example in the attachment). Last verified with ceph-ansible-1.0.5-31.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.37-1.el7scon.x86_64 rhscon-core-0.0.37-1.el7scon.x86_64 rhscon-core-selinux-0.0.37-1.el7scon.noarch rhscon-ui-0.0.51-1.el7scon.noarch
Created attachment 1185152 [details] cluster creatio nfailure