Skip to content

Commit d7002c4

Browse files
Joshua AndersonJoshua-Anderson
authored andcommitted
docs(understanding_deis): add documentation on node failover
1 parent ae64f49 commit d7002c4

3 files changed

Lines changed: 58 additions & 0 deletions

File tree

docs/managing_deis/production_deployments.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ Running Deis without Ceph
3131
See :ref:`running-deis-without-ceph` for details on removing this operational
3232
complexity.
3333

34+
.. _preseeding_continers:
35+
3436
Preseeding containers
3537
---------------------
3638

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
:title: Node Failover in Deis
2+
:description: Describes how Deis nodes failover
3+
4+
.. _failover:
5+
6+
Failover
7+
========
8+
9+
Three Node Cluster
10+
------------------
11+
12+
Losing One of Three Nodes
13+
^^^^^^^^^^^^^^^^^^^^^^^^^
14+
15+
Losing one of three nodes will have the following effects:
16+
17+
- Ceph will enter a health warn state but will continue to function.
18+
- Anything scheduled on the downed node will be rescheduled to the other two nodes.
19+
If your remaining nodes don't have the resources to run the new units, this could
20+
take down the entire platform
21+
- When you scale up to three nodes again, Ceph and Etcd will still think one member is down.
22+
You will need to manually remove the downed node from Ceph and Etcd.
23+
24+
Losing Two of Three Nodes
25+
^^^^^^^^^^^^^^^^^^^^^^^^^
26+
27+
Losing two of three nodes will have the following effects:
28+
29+
- Ceph will enter a degraded state and go into read-only mode.
30+
- Etcd will enter a degraded state and go into read-only mode.
31+
- Anything scheduled on the downed node will be rescheduled to remaining node.
32+
If your remaining node doesn't have the resources to run the new units, this could
33+
take down the entire platform.
34+
- When you scale up to three nodes again, Ceph and Etcd will still think two members are down.
35+
You will need to manually remove the downed nodes from Ceph and Etcd.
36+
37+
Larger Clusters
38+
---------------
39+
40+
If you have more than three nodes, Deis can tolerate node failure without issue.
41+
Here are a few things to keep in mind:
42+
43+
- You have to manually remove downed nodes from Etcd and Ceph. Ceph and Etcd think downed nodes
44+
might still be functioning but out of communication with the main cluster. If you don't remove
45+
downed nodes, they could eventually outnumber running nodes. This will cause Ceph and etcd to go
46+
into read only mode to prevent a split brained cluster.
47+
- Ceph on Deis stores three replicas of all data. If a node goes down, Ceph doesn't replicate the data on
48+
that node because it expects the node will come back. Manually removing the node will resolve this.
49+
- You should use the preseed script to automatically download the control and data plane on every node.
50+
This way if a unit is rescheduled (like if a node goes down) it just had to be started, not downloaded,
51+
reducing failover time to seconds, not minutes. See :ref:`preseeding_continers` for further details.
52+
- If the database is rescheduled, it has to go through a recovery process wherever it is rescheduled, causing
53+
controller downtime (generally less than a minute).
54+
- User apps should be scaled to reside on multiple hosts. That way, if one node goes down your app will continue to
55+
function without downtime.

docs/understanding_deis/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,3 +11,4 @@ Understanding Deis
1111
concepts
1212
architecture
1313
components
14+
failover

0 commit comments

Comments
 (0)