Merge pull request #2953 from carmstrong/docs-ceph_quorum

carmstrong · carmstrong · commit 901523fc77e7 · 2015-01-26T09:30:06.000-08:00
docs(*): add Ceph quorum documentation
diff --git a/docs/managing_deis/add_remove_host.rst b/docs/managing_deis/add_remove_host.rst
@@ -25,11 +25,11 @@ Inspecting health
 -----------------
 
 Before we begin, we should check the state of the Ceph cluster to be sure it's healthy.
-We can do this by logging into any machine in the cluster, entering a store container, and then querying Ceph:
+To do this, we use ``deis-store-admin`` - see :ref:`using-store-admin`.
 
 .. code-block:: console
 
-    core@deis-1 ~ $ nse deis-store-monitor
+    core@deis-1 ~ $ nse deis-store-admin
     root@deis-1:/# ceph -s
         cluster 20038e38-4108-4e79-95d4-291d0eef2949
          health HEALTH_OK
@@ -111,6 +111,8 @@ that the store services on this host will be leaving the cluster.
 In this example we're going to remove the first node in our cluster, deis-1.
 That machine has an IP address of ``172.17.8.100``.
 
+.. _removing_an_osd:
+
 Removing an OSD
 ~~~~~~~~~~~~~~~
 
@@ -130,7 +132,7 @@ on any host in the cluster (except the one we're removing). In this example, I a
 
 .. code-block:: console
 
-    core@deis-2 ~ $ nse deis-store-monitor
+    core@deis-2 ~ $ nse deis-store-admin
     root@deis-2:/# ceph osd out 2
     marked out osd.2.
 
@@ -178,7 +180,7 @@ Back inside a store container on ``deis-2``, we can finally remove the OSD:
 
 .. code-block:: console
 
-    core@deis-2 ~ $ nse deis-store-monitor
+    core@deis-2 ~ $ nse deis-store-admin
     root@deis-2:/# ceph osd crush remove osd.2
     removed item id 2 name 'osd.2' from crush map
     root@deis-2:/# ceph auth del osd.2
@@ -196,7 +198,7 @@ That's it! If we inspect the health, we see that there are now 3 osds again, and
 
 .. code-block:: console
 
-    core@deis-2 ~ $ nse deis-store-monitor
+    core@deis-2 ~ $ nse deis-store-admin
     root@deis-2:/# ceph -s
         cluster 20038e38-4108-4e79-95d4-291d0eef2949
          health HEALTH_OK
@@ -231,7 +233,7 @@ Back on another host, we can again enter a store container and then remove this
 
 .. code-block:: console
 
-    core@deis-2 ~ $ nse deis-store-monitor
+    core@deis-2 ~ $ nse deis-store-admin
     root@deis-2:/# ceph mon remove deis-1
     removed mon.deis-1 at 172.17.8.100:6789/0, there are now 3 monitors
     2014-11-04 06:57:59.712934 7f04bc942700  0 monclient: hunting for new mon
diff --git a/docs/managing_deis/index.rst b/docs/managing_deis/index.rst
@@ -18,6 +18,7 @@ Managing Deis
     operational_tasks
     platform_logging
     platform_monitoring
+    recovering-ceph-quorum
     security_considerations
     ssl-endpoints
     upgrading-deis
diff --git a/docs/managing_deis/recovering-ceph-quorum.rst b/docs/managing_deis/recovering-ceph-quorum.rst
@@ -0,0 +1,49 @@
+:title: Recovering Ceph quorum
+:description: Additional information for recovering clusters once Ceph has lost quorum.
+
+.. _recovering-ceph-quorum:
+
+Recovering Ceph quorum
+======================
+
+Ceph relies on `Paxos`_ to maintain a quorum among monitor services so that they agree on cluster state.
+In some cases Ceph can lose quorum, such as when hosts are added and removed from the cluster in
+quick successtion, without removing the old hosts from Ceph (see :ref:`add_remove_host`).
+
+A telltale sign of quorum loss is when querying cluster health, ``ceph -s`` times out with monitor
+faults on every host in the cluster.
+
+.. important::
+
+    Ceph refusing to do anything when it has lost quorum is a safety precaution to prevent you
+    from losing data. Attempting to recover from this situation rquires knowledge about the state
+    of your cluster, and should only be attempted if data loss is not considered catastrophic (such as
+    when a recent backup is available). When in doubt, consult the Ceph and Deis communities for
+    assistance. Deis recommends regular backups to minimize impact should an issue like this occur.
+    For more information, see :ref:`backing_up_data`.
+
+The instructions below are intentionally vague, as each recovery scenario will be unique. They are
+intended only to point users in the right direction for recovery.
+
+To recover from Ceph quorum loss:
+
+#. Suspect quorum loss because ``ceph -s`` shows nothing but timeouts and/or monitor faults
+#. :ref:`using-store-admin`, use the Ceph `admin socket`_ to query the `mon status`_, identifying that there are enough stale entries to prevent Ceph from gaining quorum
+#. Stop the platform with ``deisctl stop platform`` so components stop trying to write data to store (note that instead, manually stopping all components except router will allow application containers to remain up, unaffected)
+#. Clean up stale entries in ``/deis/store/hosts`` so that dead monitors are not written out to clients
+#. Update ``/deis/store/monSetupLock`` to point to the healthy monitor -- note that this isn't strictly necessary, as this value is only used if wiping clean and starting a fresh cluster from scratch with no data, but it's good cleanup
+#. Start the healthy monitor and use the admin socket to get the current state of the cluster.
+#. Given the cluster state as the monitor sees it, use `monmaptool`_ to manually remove stale monitor entries from the monmap (i.e. ``monmaptool --rm mon.<hostname> --clobber /etc/ceph/monmap``)
+#. Stop the healty moitor and use ``deis-store-admin`` to inject the prepared monmap into the monitor with ``ceph-mon -i <hostname> --inject-monmap /etc/ceph/monmap``
+#. Start the monitor and ensure it achieves quorum by itself (use ``ceph -s`` and/or query mon_status on the admin socket)
+#. Start the other monitors and ensure they connect
+#. Start the OSDs with ``deisctl start store-daemon``
+#. Observe the OSD map with ``ceph osd dump`` -- for each OSD that is no longer with us, follow :ref:`removing_an_osd` -- take care to ensure that the data is relocated (watch the health with ``ceph -w``) before marking another OSD as ``out``
+#. Once the OSD map reflects the now-healthy OSDs, start the remaining store services in order: ``deisctl start store-metadata`` and ``deisctl start store-gateway``
+#. Confirm that the cluster is healthy with the metadata servers added, and then start ``store-volume`` with ``deisctl start store-volume``.
+#. Start the remaining services with ``deisctl start platform``
+
+.. _`admin socket`: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#using-the-monitor-s-admin-socket
+.. _`mon status`: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status
+.. _`monmaptool`: http://ceph.com/docs/master/man/8/monmaptool/
+.. _`Paxos`: http://en.wikipedia.org/wiki/Paxos_%28computer_science%29
diff --git a/docs/troubleshooting_deis/index.rst b/docs/troubleshooting_deis/index.rst
@@ -6,6 +6,13 @@
 Troubleshooting Deis
 ====================
 
+:Release: |version|
+:Date: |today|
+
+.. toctree::
+
+    troubleshooting-store
+
 Common issues that users have run into when provisioning Deis are detailed below.
 
 Logging in to the cluster
@@ -37,99 +44,7 @@ which will lead to issues running Deis successfully.
 A deis-store component fails to start
 -------------------------------------
 
-The store component is the most complex component of Deis. As such, there are many ways for it to fail.
-Recall that the store components represent Ceph services as follows:
-
-* ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
-* ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
-* ``store-gateway``: http://ceph.com/docs/giant/radosgw/
-* ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
-* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
-
-Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
-``deisctl status store-volume``). Additionally, the Ceph health can be queried by entering
-a store container with ``nse deis-store-monitor`` and then issuing a ``ceph -s``. This should output the
-health of the cluster like:
-
-.. code-block:: console
-
-    core@deis-1 ~ $ nse deis-store-monitor
-    root@deis-1:/# ceph -s
-        cluster 20038e38-4108-4e79-95d4-291d0eef2949
-         health HEALTH_OK
-         monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
-         mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
-         osdmap e36: 3 osds: 3 up, 3 in
-          pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
-                24198 MB used, 23659 MB / 49206 MB avail
-                1344 active+clean
-
-If you see ``HEALTH_OK``, this means everything is working as it should.
-Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
-``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
-and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
-
-We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
-
-For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
-specific store components are detailed below.
-
-store-monitor
-~~~~~~~~~~~~~
-
-The monitor is the first store component to start, and is required for any of the other store
-components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
-it is likely due to a host issue. Common failure scenarios include not
-having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
-
-.. code-block:: console
-
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700  0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700  0 ** Shutdown via Data Health Service **
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700  1 mon.deis-staging-node1@0(leader) e3 shutdown
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700  0 quorum service shutdown
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700  0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
-  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700  0 quorum service shutdown
-
-This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
-large volumes.
-
-store-daemon
-~~~~~~~~~~~~
-
-The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
-to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
-restoring all daemons to a running state as quickly as possible is paramount.
-
-Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
-resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
-``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
-that daemon.
-
-store-gateway
-~~~~~~~~~~~~~
-
-The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
-will result in a short downtime for the registry component (and will prevent the database from
-backing up), but those components should recover as soon as the gateway comes back up.
-
-store-metadata
-~~~~~~~~~~~~~~
-
-The metadata servers are required for the **volume** to function properly. Only one is active at
-any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
-server should the active one fail.
-
-store-volume
-~~~~~~~~~~~~
-
-Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
-indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
-failing store-volume, application logs will be lost until the volume recovers.
-
-Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
+For information on troubleshooting a ``deis-store`` component, see :ref:`troubleshooting-store`.
 
 Any component fails to start
 ----------------------------
@@ -178,6 +93,4 @@ Other issues
 
 Running into something not detailed here? Please `open an issue`_ or hop into #deis on Freenode IRC and we'll help!
 
-.. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
 .. _`open an issue`: https://github.com/deis/deis/issues/new
-.. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/
diff --git a/docs/troubleshooting_deis/troubleshooting-store.rst b/docs/troubleshooting_deis/troubleshooting-store.rst
@@ -0,0 +1,130 @@
+:title: Troubleshooting deis-store
+:description: Resolutions for common issues with deis-store and Ceph.
+
+.. _troubleshooting-store:
+
+Troubleshooting deis-store
+==========================
+
+The store component is the most complex component of Deis. As such, there are many ways for it to fail.
+Recall that the store components represent Ceph services as follows:
+
+* ``store-monitor``: http://ceph.com/docs/giant/man/8/ceph-mon/
+* ``store-daemon``: http://ceph.com/docs/giant/man/8/ceph-osd/
+* ``store-gateway``: http://ceph.com/docs/giant/radosgw/
+* ``store-metadata``: http://ceph.com/docs/giant/man/8/ceph-mds/
+* ``store-volume``: a system service which mounts a `Ceph FS`_ volume to be used by the controller and logger components
+
+Log output for store components can be viewed with ``deisctl status store-<component>`` (such as
+``deisctl status store-volume``). Additionally, the Ceph health can be queried by using the ``deis-store-admin``
+administrative container to access the cluster.
+
+.. _using-store-admin:
+
+Using store-admin
+-----------------
+
+``deis-store-admin`` is an optional component that is helpful when diagnosing problems with ``deis-store``.
+It contains the ``ceph`` client and writes the necessary Ceph configuration files so it always has the
+most up-to-date configuration for the cluster.
+
+To use ``deis-store-admin``, install and start it with ``deisctl``:
+
+.. code-block:: console
+
+    $ deisctl install store-admin
+    $ deisctl start store-admin
+
+The container will now be running on all hosts in the cluster. Log into any of the hosts, enter
+the container with ``nse deis-store-admin``, and then issue a ``ceph -s`` to query the cluster's health.
+
+The output should be similar to the following:
+
+.. code-block:: console
+
+    core@deis-1 ~ $ nse deis-store-admin
+    root@deis-1:/# ceph -s
+        cluster 20038e38-4108-4e79-95d4-291d0eef2949
+         health HEALTH_OK
+         monmap e3: 3 mons at {deis-1=172.17.8.100:6789/0,deis-2=172.17.8.101:6789/0,deis-3=172.17.8.102:6789/0}, election epoch 16, quorum 0,1,2 deis-1,deis-2,deis-3
+         mdsmap e10: 1/1/1 up {0=deis-2=up:active}, 2 up:standby
+         osdmap e36: 3 osds: 3 up, 3 in
+          pgmap v2096: 1344 pgs, 12 pools, 369 MB data, 448 objects
+                24198 MB used, 23659 MB / 49206 MB avail
+                1344 active+clean
+
+If you see ``HEALTH_OK``, this means everything is working as it should.
+Note also ``monmap e3: 3 mons at...`` which means all three monitor containers are up and responding,
+``mdsmap e10: 1/1/1 up...`` which means all three metadata containers are up and responding,
+and ``osdmap e7: 3 osds: 3 up, 3 in`` which means all three daemon containers are up and running.
+
+We can also see from the ``pgmap`` that we have 1344 placement groups, all of which are ``active+clean``.
+
+For additional information on troubleshooting Ceph, see `troubleshooting`_. Common issues with
+specific store components are detailed below.
+
+.. note::
+
+    If all of the ``ceph`` client commands seem to be hanging and the output is solely monitor
+    faults, the cluster may have lost quorum and manual intervention is necessary to recover.
+    For more information, see :ref:`recovering-ceph-quorum`.
+
+store-monitor
+-------------
+
+The monitor is the first store component to start, and is required for any of the other store
+components to function properly. If a ``deisctl list`` indicates that any of the monitors are failing,
+it is likely due to a host issue. Common failure scenarios include not
+having adequate free storage on the host node - in that case, monitors will fail with errors similar to:
+
+.. code-block:: console
+
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053693 7fd0586a6700  0 mon.deis-staging-node1@0(leader).data_health(6) update_stats avail 1% total 5960684 used 56655
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053770 7fd0586a6700 -1 mon.deis-staging-node1@0(leader).data_health(6) reached critical levels of available space on
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053772 7fd0586a6700  0 ** Shutdown via Data Health Service **
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053821 7fd056ea3700 -1 mon.deis-staging-node1@0(leader) e3 *** Got Signal Interrupt ***
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.053834 7fd056ea3700  1 mon.deis-staging-node1@0(leader) e3 shutdown
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054000 7fd056ea3700  0 quorum service shutdown
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054002 7fd056ea3700  0 mon.deis-staging-node1@0(shutdown).health(6) HealthMonitor::service_shutdown 1 services
+  Oct 29 20:04:00 deis-staging-node1 sh[1158]: 2014-10-29 20:04:00.054065 7fd056ea3700  0 quorum service shutdown
+
+This is typically only an issue when deploying Deis on bare metal, as most cloud providers have adequately
+large volumes.
+
+store-daemon
+------------
+
+The daemons are responsible for actually storing the data on the filesystem. The cluster is configured
+to allow writes with just one daemon running, but the cluster will be running in a degraded state, so
+restoring all daemons to a running state as quickly as possible is paramount.
+
+Daemons can be safely restarted with ``deisctl restart store-daemon``, but this will restart all daemons,
+resulting in downtime of the storage cluster until the daemons recover. Alternatively, issuing a
+``sudo systemctl restart deis-store-daemon`` on the host of the failing daemon will restart just
+that daemon.
+
+store-gateway
+-------------
+
+The gateway runs Apache and a FastCGI server to communicate with the cluster. Restarting the gateway
+will result in a short downtime for the registry component (and will prevent the database from
+backing up), but those components should recover as soon as the gateway comes back up.
+
+store-metadata
+--------------
+
+The metadata servers are required for the **volume** to function properly. Only one is active at
+any one time, and the rest operate as hot standbys. The monitors will promote a standby metadata
+server should the active one fail.
+
+store-volume
+------------
+
+Without functioning monitors, daemons, and metadata servers, the volume service will likely hang
+indefinitely (or restart constantly). If the controller or logger happen to be running on a host with a
+failing store-volume, application logs will be lost until the volume recovers.
+
+Note that store-volume requires CoreOS >= 471.1.0 for the CephFS kernel module.
+
+.. _`Ceph FS`: https://ceph.com/docs/giant/cephfs/
+.. _`troubleshooting`: http://docs.ceph.com/docs/giant/rados/troubleshooting/