Created attachment 1132231 [details] screenshot of Hosts page (initializing circle is shown there) Description of problem ====================== When I run "accept all" task, the "Accept Node" tasks all finished, but the "Initialize node" tasks were not started adn when I look at Hosts page, I see all nodes stuck in "Initializing .." state. Version-Release number of selected component ============================================ rhscon-ceph-0.0.6-8.el7.x86_64 rhscon-core-0.0.8-7.el7.x86_64 rhscon-ui-0.0.16-1.el7.noarch rhscon-agent-0.0.3-2.el7.noarch How reproducible ================ 5 % (I hit this only once) Steps to Reproduce ================== 1. Install USM according to the documentation 2. Accept all nodes 3. Check Tasks and Hosts pages Actual results ============== Tasks page shows all "Accept Node" tasks as finished, but doesn't show any "Initialize" task. Hosts page shows all nodes in "Initializing ..." phase (the state icon is spinning in a circle). Expected results ================ Initialize tasks are actually started and should finish with a success. In case of a problem, admin can kill stuck task. Moreover some timeout should be introduced. Additional info =============== I let it run in this state for few days, but nothing changed. All nodes were still "Initializing ...". Based on a discussing with developers, some solution (timeouts, ability to reinitialize node) has been already implemented or is coming soon. This BZ is here for QE team to test this use case properly.
Is it reproducible? We are not seeing this issue.
I have just reproduced this issue again with the current build: ceph storage machines: * rhscon-agent-0.0.6-1.el7scon.noarch RHSC 2.0 machine: * ceph-ansible-1.0.5-5.el7scon.noarch * ceph-installer-1.0.6-1.el7scon.noarch * rhscon-ceph-0.0.13-1.el7scon.x86_64 * rhscon-core-0.0.16-1.el7scon.x86_64 * rhscon-ui-0.0.29-1.el7scon.noarch This time, *the outcome is not as bad* as it was when I hit this issue for the first time, but it's bad enough to reopen this bug for further investigation and tracking. Here is updated *Actual results* section: Tasks page shows all "Accept Node" tasks as finished, but doesn't show "Initialize" task for 3 hosts (out of 7 hosts in total). First, the Hosts page showed these 3 hosts in "Initializing ..." phase (the state icon is spinning in a circle), even though there was no "Initialize" task for them. After some time, the state changed from "Iinitializing ..." into "Failed" and the icon changed to a red cross in a circle. So the timeout feature (as Nishanth points out in comment 8) works and the node no longer remains in the "initializing ..." state forewer. So the main difference and a good news are: 1) the timeout feature actually works and moved the Host into "Failed" state 2) there is a "Reinitialize" action available in the left popup menu The bad news are: 1) The root cause of this issue remains: sometimes the Initialize task is not properly started for a Host so that it's not visible in the Tasks list, which is bad as the user is can miss the issue since there is no error neither in the Tasks or Events list. 2) While there is a "Reinitialize" action available, when I click on it, nothing happens with the node, so it's not possible to actually reinitialize the node from the UI. Additional problems: When I try to reinitialize failed Host via "Reinitialize" button, I can find an entry about this in the Evens list, but it has: * is classified as "Info" event, even though the message states: Failed to reinitialize node: dhcp-126-79.test.example.com (I would expect to see something like a "Fail" or "Error" as a type of this Event) * no details are available,
Digging deeper, I see the following error in salt minion log on one machine which failed to initialize: ``` 2016-05-10 14:43:20,229 [salt.crypt ][INFO ][1108] Waiting 10 seconds before retry. 2016-05-10 14:43:30,251 [salt.crypt ][DEBUG ][1108] Decrypting the current master AES key 2016-05-10 14:43:30,251 [salt.crypt ][DEBUG ][1108] Loaded minion key: /etc/salt/pki/minion/minion.pem 2016-05-10 14:43:30,261 [salt.crypt ][ERROR ][1108] The master failed to decrypt the random minion token 2016-05-10 14:43:30,261 [salt.crypt ][CRITICAL][1108] The Salt Master server's public key did not authenticate! The master may need to be updated if it is a version of Salt lower than 2015.5.5, or If you are confident that you are connecting to a valid Salt Master, then remove the master public key and restart the Salt Minion. The master public key can be found at: /etc/salt/pki/minion/minion_master.pub 2016-05-10 14:43:30,262 [salt.cli.daemons ][INFO ][1108] The salt minion is shut down ```
Created attachment 1155784 [details] screenshot of Taks List page Attaching new screenshots: rhsc.bz-1313770.01.png shows: There are only 4 "Initiazlie Note" tasks in the Taks list, while there are 7 machines in total.
Created attachment 1155785 [details] screenshot of Hosts page Attaching new screenshots: rhsc.bz-1313770.04.png shows some nodes in a failed state
This seems to be related to the work flow I use when maintaining my virtual machines: 1) deploy the qe machines via jenkins (creates machines, installs usm packages) 2) shutdown all the machines 3) create snapshot 4) start the machines again 5) start the testing: accept hosts, ...
Additional information extending comment 10: The error message in /var/log/salt/minion on affected machine looks like I need to delete a salt master public key there to retry, but the key is the same on all hosts (no matter if it's Initialize task was created or not): ~~~ $ ansible -i mbukatov-usm3.hosts usm_nodes -m shell -a 'md5sum /etc/salt/pki/minion/minion_master.pub' mbukatov-usm3-mon1.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-node4.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-node2.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-mon2.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-node1.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-node3.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub mbukatov-usm3-mon3.usersys.redhat.com | SUCCESS | rc=0 >> 0c86e018c85368edcdd69faaa8995621 /etc/salt/pki/minion/minion_master.pub ~~~ That said, I'm able to fix the problem on affected host by: * restarting salt-minion there * trying to "Reinitialize" from the usm web ui This time, the "initialize Node" task for the machine is actually created and succeeds.
This looks more of an issue with your infrastructure. If the minion or master has an old key cached, the the salt-communication will not be up(even though the node gets accepted) for communication and the event will not be emitted by salt-master on which the initialization process starts. As per 'Comment 15' I am not sure the impact of step 2,3, and 4. Are you starting machine from the snapshot? In any case its a pre-requisite from the USM standpoint that the minions reached USM cleanly without any errors(as in comment 10) , before starting the acceptance process.
(In reply to Nishanth Thomas from comment #17) > This looks more of an issue with your infrastructure. If the minion or > master has an old key cached, the the salt-communication will not be up(even > though the node gets accepted) for communication and the event will not be > emitted by salt-master on which the initialization process starts. This looks like a corner case, that's true. But this is not a issue with my infrastructure (see details below). Moreover when it happens, RHSC 2.0 is unable to even notice and properly inform the user about problem, which leads to many inconsistencies, which I still consider a valid issue. > As per 'Comment 15' I am not sure the impact of step 2,3, and 4. Are you > starting machine from the snapshot? All I did (from salt or RHSC 2.0 point of view) was a restart of all machines of the cluster (including usm server and ceph storage servers). The fact that I did a snapshot of a turned off machine has no effect on the machine itself, because this was done while the machines were powered down. I did a little experiment here to check this: Run "Accept All" on freshly provisioned cluster 3 times: * 1 node failed (initialize task was shown, so ok), the rest was ok * all node accepted with success (both accept and initialize tasks) * all node accepted with success (both accept and initialize tasks) Run "Accept All" on freshly provisioned cluster which I have just rebooted (I restarted all machines, including RHSC 2.0 server itself) 3 times: * 1 node stuck in "accepting" (reproducing this BZ), the rest was ok, this means that for the one stuck node: - the accept task succeeded - but there was no initialize task for this node - there was no event for this - salt-minion on that node shut itself off * 1 node stuck in "accepting" (reproducing this BZ), the rest was ok * 4 node stuck in "accepting" (reproducing this BZ), the rest was ok Run "Accept All" on freshly provisioned cluster which I have just rebooted (this time I restarted all machines of the cluster with the exception of RHSC 2.0 server) 3 times: * 1 node stuck in "accepting" (reproducing this BZ), the rest was ok * 3 node stuck in "accepting" (reproducing this BZ), the rest was ok * 3 node stuck in "accepting" (reproducing this BZ), the rest was ok This is clear evidence that rebooting nodes before running "Accept All" increases the likelihood that this issues happens. > In any case its a pre-requisite from the USM standpoint that the minions > reached USM cleanly without any errors(as in comment 10) , before starting > the acceptance process. Ok, that's true and I don't object it here. On the other hand, I would expect that RHSC 2.0 would be able to *notice* that the prerequisites are not fulfilled and to provide a report about this problem to the user in clear way. Right now, I see that the node is stuck in "Accepting" state even though the "Accept" task finished with success, there is no error/event reported and "Initialize" task is missing without any explanation ...
For the record: version of salt packages On RHSC 2.0 machine: salt-master-2015.5.5-1.el7.noarch salt-2015.5.5-1.el7.noarch On storage server machines: salt-minion-2015.5.5-1.el7.noarch salt-2015.5.5-1.el7.noarch
Based on the request for the dev team, I rechecked the issue with the following results. 01: initial state ================= I have rebooted all machines which were just provisioned by our USM QE jenkins. On USM server I see: ~~~ # salt-key --finger-all Local Keys: master.pem: 96:da:8c:88:71:81:fd:af:66:ea:08:ac:f7:2b:41:78 master.pub: 58:da:a6:dd:ec:65:5a:67:80:00:00:f9:bb:94:b4:6e Unaccepted Keys: dhcp-126-79.lab.eng.brq.redhat.com: 54:d5:be:1a:e2:85:14:32:bb:f7:c9:ab:e0:03:68:3d dhcp-126-80.lab.eng.brq.redhat.com: b2:f8:03:36:22:24:04:49:5c:90:95:bb:77:98:ad:d8 dhcp-126-83.lab.eng.brq.redhat.com: e3:46:f4:29:81:ac:1e:29:f4:4a:29:be:5a:c5:40:55 dhcp-126-81.lab.eng.brq.redhat.com: b4:c0:31:84:74:81:e7:21:28:b7:dd:f2:1a:d4:1b:03 dhcp-126-82.lab.eng.brq.redhat.com: 3e:6f:85:a9:a0:ea:0c:c2:5c:0a:67:a8:44:1e:57:e4 dhcp-126-85.lab.eng.brq.redhat.com: b0:88:da:20:6a:a1:3b:d3:df:6e:74:98:69:cd:f2:59 dhcp-126-84.lab.eng.brq.redhat.com: ef:ec:48:47:c5:45:72:73:7b:a4:73:db:89:3d:ff:a8 ~~~ And I have verified that master has cached the keys right. Every key salt master stores in /etc/salt/pki/master/minions_pre/${host} matches particular /etc/salt/pki/minion/minion.pub on the ${host}. 02: Running "Accept All" ======================== Then I start "Accept All" task via USM UI: ~~~ # salt-key --finger-all Local Keys: master.pem: 96:da:8c:88:71:81:fd:af:66:ea:08:ac:f7:2b:41:78 master.pub: 58:da:a6:dd:ec:65:5a:67:80:00:00:f9:bb:94:b4:6e Accepted Keys: dhcp-126-79.lab.eng.brq.redhat.com: 54:d5:be:1a:e2:85:14:32:bb:f7:c9:ab:e0:03:68:3d dhcp-126-80.lab.eng.brq.redhat.com: b2:f8:03:36:22:24:04:49:5c:90:95:bb:77:98:ad:d8 dhcp-126-83.lab.eng.brq.redhat.com: e3:46:f4:29:81:ac:1e:29:f4:4a:29:be:5a:c5:40:55 dhcp-126-81.lab.eng.brq.redhat.com: b4:c0:31:84:74:81:e7:21:28:b7:dd:f2:1a:d4:1b:03 dhcp-126-82.lab.eng.brq.redhat.com: 3e:6f:85:a9:a0:ea:0c:c2:5c:0a:67:a8:44:1e:57:e4 dhcp-126-85.lab.eng.brq.redhat.com: b0:88:da:20:6a:a1:3b:d3:df:6e:74:98:69:cd:f2:59 dhcp-126-84.lab.eng.brq.redhat.com: ef:ec:48:47:c5:45:72:73:7b:a4:73:db:89:3d:ff:a8 ~~~ And everything looks good. But in the meantime, I hit the issue and on 3 nodes, salt minion shutdown itself: ~~~ # salt '*' test.ping dhcp-126-82.lab.eng.brq.redhat.com: True dhcp-126-79.lab.eng.brq.redhat.com: True dhcp-126-83.lab.eng.brq.redhat.com: True dhcp-126-85.lab.eng.brq.redhat.com: True dhcp-126-84.lab.eng.brq.redhat.com: Minion did not return. [Not connected] dhcp-126-81.lab.eng.brq.redhat.com: Minion did not return. [Not connected] dhcp-126-80.lab.eng.brq.redhat.com: Minion did not return. [Not connected] ~~~ On one affected machine: ~~~ mbukatov-usm3-mon2.usersys.redhat.com | FAILED | rc=3 >> ● salt-minion.service - The Salt Minion Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled; vendor preset: disabled) Active: inactive (dead) since Wed 2016-06-08 19:21:17 CEST; 5min ago Process: 893 ExecStart=/usr/bin/salt-minion (code=exited, status=0/SUCCESS) Main PID: 893 (code=exited, status=0/SUCCESS) Jun 08 19:21:07 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [INFO ] Waiting 10 seconds before retry. Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [DEBUG ] Decrypting the current master AES key Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [DEBUG ] Loaded minion key: /etc/salt/pki/minion/minion.pem Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [ERROR ] The master failed to decrypt the random minion token Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [CRITICAL] The Salt Master server's public key did not authenticate! Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: The master may need to be updated if it is a version of Salt lower than 2015.5.5, or Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: If you are confident that you are connecting to a valid Salt Master, then remove the mast er public key and restart the Salt Minion. Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: The master public key can be found at: Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: /etc/salt/pki/minion/minion_master.pub Jun 08 19:21:17 dhcp-126-80.lab.eng.brq.redhat.com salt-minion[893]: [INFO ] The salt minion is shut down ~~~ If I recheck the keys now, all keys are ok: Every key salt master stores in /etc/salt/pki/master/minions_pre/${host} matches particular /etc/salt/pki/minion/minion.pub on the ${host}. Moreover /etc/salt/pki/minion/minion_master.pub file on each salt minion host matches /etc/salt/pki/master/master.pub on salt master machine. So it seems that it's not a problem with the keys. 03: Restarting salt-minions =========================== So now I restart salt-minion on all storage nodes. And I see that: 1) salt-minion is up and running everywhere: ~~~ # salt '*' test.ping dhcp-126-79.lab.eng.brq.redhat.com: True dhcp-126-85.lab.eng.brq.redhat.com: True dhcp-126-84.lab.eng.brq.redhat.com: True dhcp-126-82.lab.eng.brq.redhat.com: True dhcp-126-83.lab.eng.brq.redhat.com: True dhcp-126-81.lab.eng.brq.redhat.com: True dhcp-126-80.lab.eng.brq.redhat.com: True ~~~ 2) All the keys are still ok (here I did the same checks as in previous step) So it all works fine after all and no key was changed during this process. Again, based on this it seems to me that it's not a problem with the keys. 04: Reinitialize ================ If I click on "reinitialize" from the Hosts page now, new Initialize task is created and finished with success for all affected hosts. So it all works after all. On affected storage servers (salt minion machines), I just needed to: * restart salt-minion services * run "reinitialize" via USM web
After closer examination, I noticed that while keys hasn't been changed, configuration of affected nodes differs a bit compared to the nodes which initialized successfully: On salt minion machines which initialized with success, the following lines were added into /etc/salt/minion.d/_schedule.conf file: ~~~ schedule: __mine_interval: {function: mine.update, jid_include: true, maxrunning: 2, minutes: 60} ~~~ While on the nodes which failed to initialize, file _schedule.conf hasn't been changed at all. Then after I restarted affected salt minion services, I noticed that _schedule.conf file was changed in the same way as was previously done on successful machines only. So it looks like that for some reason, configuration randomly fails on some salt minion machines when I reboot them before running "Accept All" via RHSC 2.0
So based on my investigation, I see 2 problems here: 1) The root cause is missing reconfiguration of salt minion service (changes of /etc/salt/minion.d/_schedule.conf). Question: which component is responsible for this step? 2) RHSC 2.0 fails to notice the issue. And instead of presenting that Initialize task failed to start immediately, it reports *nothing at all* as if the start of Initialize task were delayed while having task spinner icon running until it timeouts. This is not acceptable. Console should be able to detect the problem immediately (which *should be possible* - the initialize task failed to start after all). I consider this 2nd point more severe. I just wasted lot of time debugging this problem just because console *didn't properly report the problem*. => increasing severity to HIGH
Question for the problem #1: which component is responsible for configuration changes done in etc/salt/minion.d/_schedule.conf file?
(In reply to Martin Bukatovic from comment #24) > So based on my investigation, I see 2 problems here: > > 1) The root cause is missing reconfiguration of salt minion service > (changes of /etc/salt/minion.d/_schedule.conf). > > Question: which component is responsible for this step? RHSC is not managing this conf file. This is updated by salt itself after successful acceptance > > 2) RHSC 2.0 fails to notice the issue. And instead of presenting that > Initialize > task failed to start immediately, it reports *nothing at all* as if the start > of Initialize task were delayed while having task spinner icon running until > it timeouts. This is not acceptable. Console should be able to detect the > problem > immediately (which *should be possible* - the initialize task failed to start > after all). > One thing to understand here is that RHSC start managing the node once it is accepted and initialized. There are two steps involved here 1. Accepting the node(just the keys) - local operation, happens locally at the USM server. acceptance will succeed even if the salt minion is down 2. Once acceptance is done master tries to establish the communication to the minion. once the channel is up, master sends an event which is consumed by the USM to start the initialization. > I consider this 2nd point more severe. I just wasted lot of time debugging > this > problem just because console *didn't properly report the problem*. In this case the channel will never come up(as the salt minion is shutdown), USM keep waiting for the event and finally times out. USM or salt master can never detect this until and unless the the minion is accepted and initialized. What we could do is that we can generate an audit log(Info event) stating the the initialize failed once the node acceptance marked as failed > > => increasing severity to HIGH
I had tried to reproduce the problem, But everything is works fine. Steps followed: 1.Installed RHS in fresh RHEL machine. 2.Configured everything in fresh storage nodes. 3.After curl command i have checked server whether it shows all unaccepted nodes with the help of salt-key -L command. 4.Restarted all the storage nodes and sever. 5.Run RHS as service. 6.Accepted all the unmanaged nodes from UI. 7.After acceptance nodes initialized successfully.
Closing this bug as per the discussion and comment 26
The problem is still valid, reproduced on rhscon-core-0.0.39-1.el7scon.x86_64 rhscon-core-selinux-0.0.39-1.el7scon.noarch rhscon-ceph-0.0.39-1.el7scon.x86_64 rhscon-ui-0.0.51-1.el7scon.noarch salt-2015.5.5-1.el7.noarch salt-master-2015.5.5-1.el7.noarch salt-selinux-0.0.39-1.el7scon.noarch salt-minion-2015.5.5-1.el7.noarch # cat /var/log/salt/minion ... 2016-08-03 15:04:07,772 [salt.crypt ][DEBUG ][1001] Loaded minion key: /etc/salt/pki/minion/minion.pem 2016-08-03 15:04:07,793 [salt.crypt ][ERROR ][1001] The master failed to decrypt the random minion token 2016-08-03 15:04:07,794 [salt.crypt ][CRITICAL][1001] The Salt Master server's public key did not authenticate! The master may need to be updated if it is a version of Salt lower than 2015.5.5, or If you are confident that you are connecting to a valid Salt Master, then remove the master public key and restart the Salt Minion. The master public key can be found at: /etc/salt/pki/minion/minion_master.pub 2016-08-03 15:04:07,794 [salt.cli.daemons ][INFO ][1001] The salt minion is shut down ... Time out on usm side without any knowledge where the issue is. At least some indication that salt master lost contact with salt minion could be helpful.