|
| 1 | +:title: Troubleshooting deis-store |
| 2 | +:description: Resolutions for common issues with deis-store and Ceph. |
| 3 | + |
| 4 | +.. _troubleshooting-store: |
| 5 | + |
| 6 | +Troubleshooting deis-store |
| 7 | +========================== |
| 8 | + |
| 9 | +The store component is the most complex component of Deis. As such, there are many ways for it to fail. |
| 10 | +Recall that the store components represent Ceph services as follows: |
| 11 | + |
| 12 | +* ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/ |
| 13 | +* ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/ |
| 14 | +* ``store-gateway``: http://ceph.com/docs/giant/radosgw/ |
| 15 | +* ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/ |
| 16 | +* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components |
| 17 | + |
| 18 | +Log output for store components can be viewed with ``deisctl status store-<component>`` (such as |
| 19 | +``deisctl status store-volume``). Additionally, the Ceph health can be queried by using the ``deis-store-admin`` |
| 20 | +administrative container to access the cluster. |
| 21 | + |
| 22 | +.. _using-store-admin: |
| 23 | + |
| 24 | +Using store-admin |
| 25 | +----------------- |
| 26 | + |
| 27 | +``deis-store-admin`` is an optional component that is helpful when diagnosing problems with ``deis-store``. |
| 28 | +It contains the ``ceph`` client and writes the necessary Ceph configuration files so it always has the |
| 29 | +most up-to-date configuration for the cluster. |
| 30 | + |
| 31 | +To use ``deis-store-admin``, install and start it with ``deisctl``: |
| 32 | + |
| 33 | +.. code-block:: console |
| 34 | +
|
| 35 | + $ deisctl install store-admin |
| 36 | + $ deisctl start store-admin |
| 37 | +
|
| 38 | +The container will now be running on all hosts in the cluster. Log into any of the hosts, enter |
| 39 | +the container with ``nse deis-store-admin``, and then issue a ``ceph -s`` to query the cluster's health. |
| 40 | + |
| 41 | +The output should be similar to the following: |
| 42 | + |
| 43 | +.. code-block:: console |
| 44 | +
|
| 45 | + core@deis-1 ~ $ nse deis-store-admin |
| 46 | + root@deis-1:/# ceph -s |
| 47 | + cluster 20038e38-4108-4e79-95d4-291d0eef2949 |
| 48 | + health HEALTH_OK |
| 49 | + monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3 |
| 50 | + mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby |
| 51 | + osdmap e36: 3 osds: 3 up, 3 in |
| 52 | + pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects |
| 53 | + 24198 MB used, 23659 MB / 49206 MB avail |
| 54 | + 1344 active+clean |
| 55 | +
|
| 56 | +If you see ``HEALTH_OK``, this means everything is working as it should. |
| 57 | +Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding, |
| 58 | +``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding, |
| 59 | +and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running. |
| 60 | + |
| 61 | +We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``. |
| 62 | + |
| 63 | +For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with |
| 64 | +specific store components are detailed below. |
| 65 | + |
| 66 | +store-monitor |
| 67 | +------------- |
| 68 | + |
| 69 | +The monitor is the first store component to start, and is required for any of the other store |
| 70 | +components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing, |
| 71 | +it is likely due to a host issue. Common failure scenarios include not |
| 72 | +having adequate free storage on the host node - in that case, monitors will fail with errors similar to: |
| 73 | + |
| 74 | +.. code-block:: console |
| 75 | +
|
| 76 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700 0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655 |
| 77 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on |
| 78 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700 0 ** Shutdown via Data Health Service ** |
| 79 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt *** |
| 80 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700 1 mon.deis-staging-node1@0(leader) e3 shutdown |
| 81 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700 0 quorum service shutdown |
| 82 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700 0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services |
| 83 | + Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700 0 quorum service shutdown |
| 84 | +
|
| 85 | +This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately |
| 86 | +large volumes. |
| 87 | + |
| 88 | +store-daemon |
| 89 | +------------ |
| 90 | + |
| 91 | +The daemons are responsible for actually storing the data on the filesystem. The cluster is configured |
| 92 | +to allow writes with just one daemon running, but the cluster will be running in a degraded state, so |
| 93 | +restoring all daemons to a running state as quickly as possible is paramount. |
| 94 | + |
| 95 | +Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons, |
| 96 | +resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a |
| 97 | +``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just |
| 98 | +that daemon. |
| 99 | + |
| 100 | +store-gateway |
| 101 | +------------- |
| 102 | + |
| 103 | +The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway |
| 104 | +will result in a short downtime for the registry component (and will prevent the database from |
| 105 | +backing up), but those components should recover as soon as the gateway comes back up. |
| 106 | + |
| 107 | +store-metadata |
| 108 | +-------------- |
| 109 | + |
| 110 | +The metadata servers are required for the **volume** to function properly. Only one is active at |
| 111 | +any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata |
| 112 | +server should the active one fail. |
| 113 | + |
| 114 | +store-volume |
| 115 | +------------ |
| 116 | + |
| 117 | +Without functioning monitors, daemons, and metadata servers, the volume service will likely hang |
| 118 | +indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a |
| 119 | +failing store-volume, application logs will be lost until the volume recovers. |
| 120 | + |
| 121 | +Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module. |
| 122 | + |
| 123 | +.. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/ |
| 124 | +.. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/ |
0 commit comments