Skip to content

Commit 6d43571

Browse files
committed
docs(*): add Troubleshooting Deis docs
1 parent 8ec4b31 commit 6d43571

5 files changed

Lines changed: 148 additions & 42 deletions

File tree

CONTRIBUTING.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,13 @@ right to make the contribution.
1414

1515
# Support Channels
1616

17+
Before opening a new issue, it's helpful to search the project - it's likely that another user
18+
has already reported the issue you're facing, or it's a known issue that we're already aware of.
19+
20+
Additionally, see the [Troubleshooting Deis][troubleshooting] documentation for common issues.
21+
22+
Our official support channels are:
23+
1724
- GitHub issues: https://github.com/deis/deis/issues/new
1825
- IRC: #[deis](irc://irc.freenode.org:6667/#deis) IRC channel on freenode.org
1926

@@ -88,3 +95,4 @@ For more details see the [commit style guide][style-guide].
8895

8996
[dco]: DCO
9097
[style-guide]: http://docs.deis.io/en/latest/contributing/standards/#commit-style-guide
98+
[troubleshooting]: http://docs.deis.io/en/latest/troubleshooting_deis/

README.md

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -197,21 +197,8 @@ Learn how to [hack on Deis](http://docs.deis.io/en/latest/contributing/hacking/)
197197

198198
## Troubleshooting
199199

200-
Common issues that users have run into when provisioning Deis are detailed below.
201-
202-
#### When running a `deisctl` command - 'Failed initializing SSH client: ssh: handshake failed: ssh: unable to authenticate'
203-
Did you remember to add your SSH key to the ssh-agent? `ssh-add -L` should list the key you used to provision the servers. If it's not there, `ssh-add -K /path/to/your/key`.
204-
205-
#### When running a `deisctl` command - 'All the given peers are not reachable (Tried to connect to each peer twice and failed)'
206-
The most common cause of this issue is that a [new discovery URL](https://discovery.etcd.io/new) wasn't generated and updated in `contrib/coreos/user-data` before the cluster was launched. Each Deis cluster must have a unique discovery URL, or else `etcd` will try and fail to connect to old hosts. Try destroying the cluster and relaunching the cluster with a fresh discovery URL.
207-
208-
#### A Deis component fails to start
209-
Use `deisctl status <component>` to view the status of the component. You can also use `deisctl journal <component>` to tail logs for a component, or `deisctl list` to list all components.
210-
211-
The most common cause of services failing to start are sporadic issues with Docker Hub. We are exploring workarounds and are working with the Docker team to improve Docker Hub reliability. In the meantime, try starting the service again with `deisctl restart <component>`.
212-
213-
### Any other issues
214-
Running into something not detailed here? Please [open an issue](https://github.com/deis/deis/issues/new) or hop into [#deis](https://botbot.me/freenode/deis/) and we'll help!
200+
See the [Troubleshooting Deis](http://docs.deis.io/en/latest/troubleshooting_deis/) documentation for
201+
assistance with common issues.
215202

216203
## License
217204

docs/managing_deis/operational_tasks.rst

Lines changed: 0 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -6,31 +6,6 @@
66
Operational tasks
77
~~~~~~~~~~~~~~~~~
88

9-
Inspecting store
10-
================
11-
It is sometimes helpful to query the :Ref:`Store` component to ask about the health of the Ceph cluster.
12-
To do this, log into any machine running a ``store-monitor`` or ``store-daemon`` service. Then,
13-
``nse deis-store-monitor`` or ``nse deis-store-daemon`` and issue a ``ceph -s``. This should output the
14-
health of the cluster like:
15-
16-
.. code-block:: console
17-
18-
cluster 6506db0c-9eae-4bb6-a40a-95954dd3c4c3
19-
health HEALTH_OK
20-
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
21-
osdmap e7: 3 osds: 3 up, 3 in
22-
pgmap v14: 192 pgs, 3 pools, 0 bytes data, 0 objects
23-
19378 MB used, 28944 MB / 49200 MB avail
24-
192 active+clean
25-
26-
If you see ``HEALTH_OK``, this means everything is working as it should.
27-
Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
28-
and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
29-
30-
We can also see from the ``pgmap`` that we have 192 placement groups, all of which are ``active+clean``.
31-
32-
For additional information on troubleshooting Ceph, see `troubleshooting`_.
33-
349
Managing users
3510
==============
3611

@@ -49,5 +24,3 @@ You can use the ``deis perms`` command to promote a user to an administrator:
4924
.. code-block:: console
5025
5126
$ deis perms:create john --admin
52-
53-
.. _`troubleshooting`: http://docs.ceph.com/docs/firefly/rados/troubleshooting/

docs/toctree.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ This documentation has the following resources:
1515
installing_deis/index
1616
using_deis/index
1717
managing_deis/index
18+
troubleshooting_deis/index
1819
contributing/index
1920
reference/index
2021
faq
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
:title: Troubleshooting Deis
2+
:description: Resolutions for common issues encountered when running Deis.
3+
4+
.. _troubleshooting_deis:
5+
6+
Troubleshooting Deis
7+
====================
8+
9+
Common issues that users have run into when provisioning Deis are detailed below.
10+
11+
A deis-store component fails to start
12+
-------------------------------------
13+
14+
The store component is the most complex component of Deis. As such, there are many ways for it to fail.
15+
Recall that the store components represent Ceph services as follows:
16+
17+
* ``store-monitor``: http://ceph.com/docs/firefly/man/8/ceph-mon/
18+
* ``store-daemon``: http://ceph.com/docs/firefly/man/8/ceph-osd/
19+
* ``store-gateway``: http://ceph.com/docs/firefly/radosgw/
20+
* ``store-metadata``: http://ceph.com/docs/firefly/man/8/ceph-mds/
21+
* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
22+
23+
Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
24+
``deisctl status store-volume``). Additionally, the Ceph health can be queried by entering
25+
a store container with ``nse deis-store-monitor`` and then issuing a ``ceph -s``. This should output the
26+
health of the cluster like:
27+
28+
.. code-block:: console
29+
30+
cluster 6506db0c-9eae-4bb6-a40a-95954dd3c4c3
31+
health HEALTH_OK
32+
monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 8, quorum 0,1,2 deis-1,deis-2,deis-3
33+
osdmap e7: 3 osds: 3 up, 3 in
34+
pgmap v14: 192 pgs, 3 pools, 0 bytes data, 0 objects
35+
19378 MB used, 28944 MB / 49200 MB avail
36+
192 active+clean
37+
38+
If you see ``HEALTH_OK``, this means everything is working as it should.
39+
Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
40+
and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
41+
42+
We can also see from the ``pgmap`` that we have 192 placement groups, all of which are ``active+clean``.
43+
44+
For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
45+
specific store components are detailed below.
46+
47+
store-monitor
48+
~~~~~~~~~~~~~
49+
50+
The monitor is the first store component to start, and is required for any of the other store
51+
components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
52+
it is likely due to a host issue. Common failure scenarios include not
53+
having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
54+
55+
.. code-block:: console
56+
57+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700 0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
58+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
59+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700 0 ** Shutdown via Data Health Service **
60+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
61+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700 1 mon.deis-staging-node1@0(leader) e3 shutdown
62+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700 0 quorum service shutdown
63+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700 0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
64+
Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700 0 quorum service shutdown
65+
66+
This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
67+
large volumes.
68+
69+
store-daemon
70+
~~~~~~~~~~~~
71+
72+
The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
73+
to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
74+
restoring all daemons to a running state as quickly as possible is paramount.
75+
76+
Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
77+
resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
78+
``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
79+
that daemon.
80+
81+
store-gateway
82+
~~~~~~~~~~~~~
83+
84+
The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
85+
will result in a short downtime for the registry component (and will prevent the database from
86+
backing up), but those components should recover as soon as the gateway comes back up.
87+
88+
store-metadata
89+
~~~~~~~~~~~~~~
90+
91+
The metadata servers are required for the **volume** to function properly. Only one is active at
92+
any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
93+
server should the active one fail.
94+
95+
store-volume
96+
~~~~~~~~~~~~
97+
98+
Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
99+
indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
100+
failing store-volume, application logs will be lost until the volume recovers.
101+
102+
Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
103+
104+
Any component fails to start
105+
----------------------------
106+
107+
Use `deisctl status <component>` to view the status of the component.
108+
You can also use `deisctl journal <component>` to tail logs for a component, or `deisctl list`
109+
to list all components.
110+
111+
Failed initializing SSH client
112+
------------------------------
113+
114+
A `deisctl` command fails with: 'Failed initializing SSH client: ssh: handshake failed: ssh: unable to authenticate'.
115+
Did you remember to add your SSH key to the ssh-agent? `ssh-add -L` should list the key you used
116+
to provision the servers. If it's not there, `ssh-add -K /path/to/your/key`.
117+
118+
All the given peers are not reachable
119+
-------------------------------------
120+
121+
A `deisctl` command fails with: 'All the given peers are not reachable (Tried to connect to each peer twice and failed)'.
122+
The most common cause of this issue is that a [new discovery URL](https://discovery.etcd.io/new)
123+
wasn't generated and updated in `contrib/coreos/user-data` before the cluster was launched.
124+
Each Deis cluster must have a unique discovery URL, or else `etcd` will try and fail to connect to old hosts.
125+
Try destroying the cluster and relaunching the cluster with a fresh discovery URL.
126+
127+
You can use ``make discovery-url`` to automatically fetch a new discovery URL.
128+
129+
Other issues
130+
------------
131+
132+
Running into something not detailed here? Please `open an issue`_ or hop into #deis on Freenode IRC and we'll help!
133+
134+
.. _`Ceph FS`: https://ceph.com/docs/firefly/cephfs/
135+
.. _`open an issue`: https://github.com/deis/deis/issues/new
136+
.. _`troubleshooting`: http://docs.ceph.com/docs/firefly/rados/troubleshooting/
137+

0 commit comments

Comments
 (0)