Skip to content

Commit cfae0e3

Browse files
committed
docs(*): reference deis-store-admin for admin tasks
1 parent 326f1a3 commit cfae0e3

3 files changed

Lines changed: 138 additions & 101 deletions

File tree

docs/managing_deis/add_remove_host.rst

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ Inspecting health
2525
-----------------
2626

2727
Before we begin, we should check the state of the Ceph cluster to be sure it's healthy.
28-
We can do this by logging into any machine in the cluster, entering a store container, and then querying Ceph:
28+
To do this, we use ``deis-store-admin`` - see :ref:`using-store-admin`.
2929

3030
.. code-block:: console
3131
32-
core@deis-1 ~ $ nse deis-store-monitor
32+
core@deis-1 ~ $ nse deis-store-admin
3333
root@deis-1:/# ceph -s
3434
cluster 20038e38-4108-4e79-95d4-291d0eef2949
3535
health HEALTH_OK
@@ -130,7 +130,7 @@ on any host in the cluster (except the one we're removing). In this example, I a
130130

131131
.. code-block:: console
132132
133-
core@deis-2 ~ $ nse deis-store-monitor
133+
core@deis-2 ~ $ nse deis-store-admin
134134
root@deis-2:/# ceph osd out 2
135135
marked out osd.2.
136136
@@ -178,7 +178,7 @@ Back inside a store container on ``deis-2``, we can finally remove the OSD:
178178

179179
.. code-block:: console
180180
181-
core@deis-2 ~ $ nse deis-store-monitor
181+
core@deis-2 ~ $ nse deis-store-admin
182182
root@deis-2:/# ceph osd crush remove osd.2
183183
removed item id 2 name 'osd.2' from crush map
184184
root@deis-2:/# ceph auth del osd.2
@@ -196,7 +196,7 @@ That's it! If we inspect the health, we see that there are now 3 osds again, and
196196

197197
.. code-block:: console
198198
199-
core@deis-2 ~ $ nse deis-store-monitor
199+
core@deis-2 ~ $ nse deis-store-admin
200200
root@deis-2:/# ceph -s
201201
cluster 20038e38-4108-4e79-95d4-291d0eef2949
202202
health HEALTH_OK
@@ -231,7 +231,7 @@ Back on another host, we can again enter a store container and then remove this
231231

232232
.. code-block:: console
233233
234-
core@deis-2 ~ $ nse deis-store-monitor
234+
core@deis-2 ~ $ nse deis-store-admin
235235
root@deis-2:/# ceph mon remove deis-1
236236
removed mon.deis-1 at 172.17.8.100:6789/0, there are now 3 monitors
237237
2014-11-04 06:57:59.712934 7f04bc942700 0 monclient: hunting for new mon

docs/troubleshooting_deis/index.rst

Lines changed: 8 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,13 @@
66
Troubleshooting Deis
77
====================
88

9+
:Release: |version|
10+
:Date: |today|
11+
12+
.. toctree::
13+
14+
troubleshooting-store
15+
916
Common issues that users have run into when provisioning Deis are detailed below.
1017

1118
Logging in to the cluster
@@ -37,99 +44,7 @@ which will lead to issues running Deis successfully.
3744
A deis-store component fails to start
3845
-------------------------------------
3946

40-
The store component is the most complex component of Deis. As such, there are many ways for it to fail.
41-
Recall that the store components represent Ceph services as follows:
42-
43-
* ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
44-
* ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
45-
* ``store-gateway``: http://ceph.com/docs/giant/radosgw/
46-
* ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
47-
* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
48-
49-
Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
50-
``deisctl status store-volume``). Additionally, the Ceph health can be queried by entering
51-
a store container with ``nse deis-store-monitor`` and then issuing a ``ceph -s``. This should output the
52-
health of the cluster like:
53-
54-
.. code-block:: console
55-
56-
core@deis-1 ~ $ nse deis-store-monitor
57-
root@deis-1:/# ceph -s
58-
cluster 20038e38-4108-4e79-95d4-291d0eef2949
59-
health HEALTH_OK
60-
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
61-
mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
62-
osdmap e36: 3 osds: 3 up, 3 in
63-
pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
64-
24198 MB used, 23659 MB / 49206 MB avail
65-
1344 active+clean
66-
67-
If you see ``HEALTH_OK``, this means everything is working as it should.
68-
Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
69-
``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
70-
and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
71-
72-
We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
73-
74-
For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
75-
specific store components are detailed below.
76-
77-
store-monitor
78-
~~~~~~~~~~~~~
79-
80-
The monitor is the first store component to start, and is required for any of the other store
81-
components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
82-
it is likely due to a host issue. Common failure scenarios include not
83-
having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
84-
85-
.. code-block:: console
86-
87-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700 0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
88-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
89-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700 0 ** Shutdown via Data Health Service **
90-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
91-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700 1 mon.deis-staging-node1@0(leader) e3 shutdown
92-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700 0 quorum service shutdown
93-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700 0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
94-
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700 0 quorum service shutdown
95-
96-
This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
97-
large volumes.
98-
99-
store-daemon
100-
~~~~~~~~~~~~
101-
102-
The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
103-
to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
104-
restoring all daemons to a running state as quickly as possible is paramount.
105-
106-
Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
107-
resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
108-
``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
109-
that daemon.
110-
111-
store-gateway
112-
~~~~~~~~~~~~~
113-
114-
The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
115-
will result in a short downtime for the registry component (and will prevent the database from
116-
backing up), but those components should recover as soon as the gateway comes back up.
117-
118-
store-metadata
119-
~~~~~~~~~~~~~~
120-
121-
The metadata servers are required for the **volume** to function properly. Only one is active at
122-
any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
123-
server should the active one fail.
124-
125-
store-volume
126-
~~~~~~~~~~~~
127-
128-
Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
129-
indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
130-
failing store-volume, application logs will be lost until the volume recovers.
131-
132-
Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
47+
For information on troubleshooting a ``deis-store`` component, see :ref:`troubleshooting-store`.
13348

13449
Any component fails to start
13550
----------------------------
@@ -178,6 +93,4 @@ Other issues
17893

17994
Running into something not detailed here? Please `open an issue`_ or hop into #deis on Freenode IRC and we'll help!
18095

181-
.. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
18296
.. _`open an issue`: https://github.com/deis/deis/issues/new
183-
.. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
:title: Troubleshooting deis-store
2+
:description: Resolutions for common issues with deis-store and Ceph.
3+
4+
.. _troubleshooting-store:
5+
6+
Troubleshooting deis-store
7+
==========================
8+
9+
The store component is the most complex component of Deis. As such, there are many ways for it to fail.
10+
Recall that the store components represent Ceph services as follows:
11+
12+
* ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
13+
* ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
14+
* ``store-gateway``: http://ceph.com/docs/giant/radosgw/
15+
* ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
16+
* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
17+
18+
Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
19+
``deisctl status store-volume``). Additionally, the Ceph health can be queried by using the ``deis-store-admin``
20+
administrative container to access the cluster.
21+
22+
.. _using-store-admin:
23+
24+
Using store-admin
25+
-----------------
26+
27+
``deis-store-admin`` is an optional component that is helpful when diagnosing problems with ``deis-store``.
28+
It contains the ``ceph`` client and writes the necessary Ceph configuration files so it always has the
29+
most up-to-date configuration for the cluster.
30+
31+
To use ``deis-store-admin``, install and start it with ``deisctl``:
32+
33+
.. code-block:: console
34+
35+
$ deisctl install store-admin
36+
$ deisctl start store-admin
37+
38+
The container will now be running on all hosts in the cluster. Log into any of the hosts, enter
39+
the container with ``nse deis-store-admin``, and then issue a ``ceph -s`` to query the cluster's health.
40+
41+
The output should be similar to the following:
42+
43+
.. code-block:: console
44+
45+
core@deis-1 ~ $ nse deis-store-admin
46+
root@deis-1:/# ceph -s
47+
cluster 20038e38-4108-4e79-95d4-291d0eef2949
48+
health HEALTH_OK
49+
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
50+
mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
51+
osdmap e36: 3 osds: 3 up, 3 in
52+
pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
53+
24198 MB used, 23659 MB / 49206 MB avail
54+
1344 active+clean
55+
56+
If you see ``HEALTH_OK``, this means everything is working as it should.
57+
Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
58+
``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
59+
and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
60+
61+
We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
62+
63+
For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
64+
specific store components are detailed below.
65+
66+
store-monitor
67+
-------------
68+
69+
The monitor is the first store component to start, and is required for any of the other store
70+
components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
71+
it is likely due to a host issue. Common failure scenarios include not
72+
having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
73+
74+
.. code-block:: console
75+
76+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700 0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
77+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
78+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700 0 ** Shutdown via Data Health Service **
79+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
80+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700 1 mon.deis-staging-node1@0(leader) e3 shutdown
81+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700 0 quorum service shutdown
82+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700 0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
83+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700 0 quorum service shutdown
84+
85+
This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
86+
large volumes.
87+
88+
store-daemon
89+
------------
90+
91+
The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
92+
to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
93+
restoring all daemons to a running state as quickly as possible is paramount.
94+
95+
Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
96+
resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
97+
``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
98+
that daemon.
99+
100+
store-gateway
101+
-------------
102+
103+
The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
104+
will result in a short downtime for the registry component (and will prevent the database from
105+
backing up), but those components should recover as soon as the gateway comes back up.
106+
107+
store-metadata
108+
--------------
109+
110+
The metadata servers are required for the **volume** to function properly. Only one is active at
111+
any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
112+
server should the active one fail.
113+
114+
store-volume
115+
------------
116+
117+
Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
118+
indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
119+
failing store-volume, application logs will be lost until the volume recovers.
120+
121+
Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
122+
123+
.. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
124+
.. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/

0 commit comments

Comments
 (0)