|
| 1 | +:title: Troubleshooting Deis |
| 2 | +:description: Resolutions for common issues encountered when running Deis. |
| 3 | + |
| 4 | +.. _troubleshooting_deis: |
| 5 | + |
| 6 | +Troubleshooting Deis |
| 7 | +==================== |
| 8 | + |
| 9 | +Common issues that users have run into when provisioning Deis are detailed below. |
| 10 | + |
| 11 | +A deis-store component fails to start |
| 12 | +------------------------------------- |
| 13 | + |
| 14 | +The store component is the most complex component of Deis. As such, there are many ways for it to fail. |
| 15 | +Recall that the store components represent Ceph services as follows: |
| 16 | + |
| 17 | +* ``store-monitor``: http://ceph.com/docs/firefly/man/8/ceph-mon/ |
| 18 | +* ``store-daemon``: http://ceph.com/docs/firefly/man/8/ceph-osd/ |
| 19 | +* ``store-gateway``: http://ceph.com/docs/firefly/radosgw/ |
| 20 | +* ``store-metadata``: http://ceph.com/docs/firefly/man/8/ceph-mds/ |
| 21 | +* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components |
| 22 | + |
| 23 | +Log output for store components can be viewed with ``deisctl status store-<component>`` (such as |
| 24 | +``deisctl status store-volume``). Additionally, the Ceph health can be queried by entering |
| 25 | +a store container with ``nse deis-store-monitor`` and then issuing a ``ceph -s``. This should output the |
| 26 | +health of the cluster like: |
| 27 | + |
| 28 | +.. code-block:: console |
| 29 | +
|
| 30 | + cluster 6506db0c-9eae-4bb6-a40a-95954dd3c4c3 |
| 31 | + health HEALTH_OK |
| 32 | + monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3 |
| 33 | + osdmap e7: 3 osds: 3 up, 3 in |
| 34 | + pgmap v14: 192 pgs, 3 pools, 0 bytes data, 0 objects |
| 35 | + 19378 MB used, 28944 MB / 49200 MB avail |
| 36 | + 192 active+clean |
| 37 | +
|
| 38 | +If you see ``HEALTH_OK``, this means everything is working as it should. |
| 39 | +Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding, |
| 40 | +and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running. |
| 41 | + |
| 42 | +We can also see from the ``pgmap`` that we have 192 placement groups, all of which are ``active+clean``. |
| 43 | + |
| 44 | +For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with |
| 45 | +specific store components are detailed below. |
| 46 | + |
| 47 | +store-monitor |
| 48 | +~~~~~~~~~~~~~ |
| 49 | + |
| 50 | +The monitor is the first store component to start, and is required for any of the other store |
| 51 | +components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing, |
| 52 | +it is likely due to a host issue. Common failure scenarios include not |
| 53 | +having adequate free storage on the host node - in that case, monitors will fail with errors similar to: |
| 54 | + |
| 55 | +.. code-block:: console |
| 56 | +
|
| 57 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700 0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655 |
| 58 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on |
| 59 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700 0 ** Shutdown via Data Health Service ** |
| 60 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt *** |
| 61 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700 1 mon.deis-staging-node1@0(leader) e3 shutdown |
| 62 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700 0 quorum service shutdown |
| 63 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700 0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services |
| 64 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700 0 quorum service shutdown |
| 65 | +
|
| 66 | +This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately |
| 67 | +large volumes. |
| 68 | + |
| 69 | +store-daemon |
| 70 | +~~~~~~~~~~~~ |
| 71 | + |
| 72 | +The daemons are responsible for actually storing the data on the filesystem. The cluster is configured |
| 73 | +to allow writes with just one daemon running, but the cluster will be running in a degraded state, so |
| 74 | +restoring all daemons to a running state as quickly as possible is paramount. |
| 75 | + |
| 76 | +Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons, |
| 77 | +resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a |
| 78 | +``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just |
| 79 | +that daemon. |
| 80 | + |
| 81 | +store-gateway |
| 82 | +~~~~~~~~~~~~~ |
| 83 | + |
| 84 | +The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway |
| 85 | +will result in a short downtime for the registry component (and will prevent the database from |
| 86 | +backing up), but those components should recover as soon as the gateway comes back up. |
| 87 | + |
| 88 | +store-metadata |
| 89 | +~~~~~~~~~~~~~~ |
| 90 | + |
| 91 | +The metadata servers are required for the **volume** to function properly. Only one is active at |
| 92 | +any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata |
| 93 | +server should the active one fail. |
| 94 | + |
| 95 | +store-volume |
| 96 | +~~~~~~~~~~~~ |
| 97 | + |
| 98 | +Without functioning monitors, daemons, and metadata servers, the volume service will likely hang |
| 99 | +indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a |
| 100 | +failing store-volume, application logs will be lost until the volume recovers. |
| 101 | + |
| 102 | +Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module. |
| 103 | + |
| 104 | +Any component fails to start |
| 105 | +---------------------------- |
| 106 | + |
| 107 | +Use `deisctl status <component>` to view the status of the component. |
| 108 | +You can also use `deisctl journal <component>` to tail logs for a component, or `deisctl list` |
| 109 | +to list all components. |
| 110 | + |
| 111 | +Failed initializing SSH client |
| 112 | +------------------------------ |
| 113 | + |
| 114 | +A `deisctl` command fails with: 'Failed initializing SSH client: ssh: handshake failed: ssh: unable to authenticate'. |
| 115 | +Did you remember to add your SSH key to the ssh-agent? `ssh-add -L` should list the key you used |
| 116 | +to provision the servers. If it's not there, `ssh-add -K /path/to/your/key`. |
| 117 | + |
| 118 | +All the given peers are not reachable |
| 119 | +------------------------------------- |
| 120 | + |
| 121 | +A `deisctl` command fails with: 'All the given peers are not reachable (Tried to connect to each peer twice and failed)'. |
| 122 | +The most common cause of this issue is that a [new discovery URL](https://discovery.etcd.io/new) |
| 123 | +wasn't generated and updated in `contrib/coreos/user-data` before the cluster was launched. |
| 124 | +Each Deis cluster must have a unique discovery URL, or else `etcd` will try and fail to connect to old hosts. |
| 125 | +Try destroying the cluster and relaunching the cluster with a fresh discovery URL. |
| 126 | + |
| 127 | +You can use ``make discovery-url`` to automatically fetch a new discovery URL. |
| 128 | + |
| 129 | +Other issues |
| 130 | +------------ |
| 131 | + |
| 132 | +Running into something not detailed here? Please `open an issue`_ or hop into #deis on Freenode IRC and we'll help! |
| 133 | + |
| 134 | +.. _`Ceph FS`: https://ceph.com/docs/firefly/cephfs/ |
| 135 | +.. _`open an issue`: https://github.com/deis/deis/issues/new |
| 136 | +.. _`troubleshooting`: http://docs.ceph.com/docs/firefly/rados/troubleshooting/ |
| 137 | + |
0 commit comments