Merge pull request #4277 from Joshua-Anderson/node-failover

mboersma · mboersma · commit a08d4333f306 · 2015-08-19T10:57:13.000-06:00
docs(understanding_deis): add documentation on node failover
diff --git a/docs/managing_deis/production_deployments.rst b/docs/managing_deis/production_deployments.rst
@@ -31,6 +31,8 @@ Running Deis without Ceph
 See :ref:`running-deis-without-ceph` for details on removing this operational
 complexity.
 
+.. _preseeding_continers:
+
 Preseeding containers
 ---------------------
 
diff --git a/docs/understanding_deis/failover.rst b/docs/understanding_deis/failover.rst
@@ -0,0 +1,55 @@
+:title: Node Failover in Deis
+:description: Describes how Deis nodes failover
+
+.. _failover:
+
+Failover
+========
+
+Three Node Cluster
+------------------
+
+Losing One of Three Nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Losing one of three nodes will have the following effects:
+
+- Ceph will enter a health warn state but will continue to function.
+- Anything scheduled on the downed node will be rescheduled to the other two nodes.
+  If your remaining nodes don't have the resources to run the new units, this could
+  take down the entire platform
+- When you scale up to three nodes again, Ceph and Etcd will still think one member is down.
+  You will need to manually remove the downed node from Ceph and Etcd.
+
+Losing Two of Three Nodes
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Losing two of three nodes will have the following effects:
+
+- Ceph will enter a degraded state and go into read-only mode.
+- Etcd will enter a degraded state and go into read-only mode.
+- Anything scheduled on the downed node will be rescheduled to remaining node.
+  If your remaining node doesn't have the resources to run the new units, this could
+  take down the entire platform.
+- When you scale up to three nodes again, Ceph and Etcd will still think two members are down.
+  You will need to manually remove the downed nodes from Ceph and Etcd.
+
+Larger Clusters
+---------------
+
+If you have more than three nodes, Deis can tolerate node failure without issue.
+Here are a few things to keep in mind:
+
+- You have to manually remove downed nodes from Etcd and Ceph. Ceph and Etcd think downed nodes
+  might still be functioning but out of communication with the main cluster. If you don't remove
+  downed nodes, they could eventually outnumber running nodes. This will cause Ceph and etcd to go
+  into read only mode to prevent a split brained cluster.
+- Ceph on Deis stores three replicas of all data. If a node goes down, Ceph doesn't replicate the data on
+  that node because it expects the node will come back. Manually removing the node will resolve this.
+- You should use the preseed script to automatically download the control and data plane on every node.
+  This way if a unit is rescheduled (like if a node goes down) it just had to be started, not downloaded,
+  reducing failover time to seconds, not minutes. See :ref:`preseeding_continers` for further details.
+- If the database is rescheduled, it has to go through a recovery process wherever it is rescheduled, causing
+  controller downtime (generally less than a minute).
+- User apps should be scaled to reside on multiple hosts. That way, if one node goes down your app will continue to
+  function without downtime.
diff --git a/docs/understanding_deis/index.rst b/docs/understanding_deis/index.rst
@@ -11,3 +11,4 @@ Understanding Deis
     concepts
     architecture
     components
+    failover