angopher · jasonjoo2010 · Jun 30, 2020 · Jul 8, 2020 · Jul 20, 2020 · Jul 20, 2020
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,6 @@
+.history/
+.DS_Store
+cmd/influxd-ctl/influxd-ctl
+cmd/influxd/influxd
+cmd/metad/metad
+sync_simulation
diff --git a/Data_Cluster_Maintenance.md b/Data_Cluster_Maintenance.md
@@ -0,0 +1,94 @@
+# Data Cluster Maintenance
+
+## Get Status of Cluster
+
+### Node List
+
+Use following to list all data nodes in cluster (no matter alive or dead):
+
+```shell
+influxd-ctl -s ip:port node list
+```
+
+Where `ip:port` is any **TCP address** of **alive** node in this cluster.
+
+Sample output as:
+
+```shell
+Nodes:
+4    http://:8092    tcp://127.0.0.1:8082
+5    http://:8093    tcp://127.0.0.1:8083
+6    http://:8094    tcp://127.0.0.1:8084
+7    http://:8095    tcp://127.0.0.1:8085
+8    http://:8096    tcp://127.0.0.1:8086
+9    http://:8091    tcp://127.0.0.1:8081
+15    http://:8097    tcp://127.0.0.1:8087
+```
+
+### Shards on Node
+
+Use following to list all available shards (only id) on specific node:
+
+```shell
+influxd-ctl -s ip:addr shard node <node-id>
+```
+
+Output:
+
+```shell
+Shards on node 15:
+[513 549 556 575 578 580 582 585 593 594]
+```
+
+### Shards of Retention Policy
+
+```shell
+influxd-ctl -s ip:port shard list <database> <retention policy>
+```
+
+### Single Shard Info
+
+```shell
+influxd-ctl -s ip:port shard info <shard id>
+```
+
+Output:
+
+```shell
+Shard: 594
+Database: _internal
+Retention Policy: monitor
+Nodes: [15]
+```
+
+## Restart Node
+
+Feel free to restart any node if you have **handoff hinted**(hh) service enabled
+on every other node. New replicated data blocked will be cached and retry to
+replicate when the node was back online. Any failed query sent to this node will
+be retried to other replicas.
+
+## Add New Node
+
+Adding operation is simple. Configure it and start it then it will appear in
+node list.
+
+## Remove Node
+
+1. Remove it from configuration through `influxd-ctl node remove`
+2. Stop the instance
+
+## Replace Node
+
+Replacement is more complicated. For instance we call the instance to be replaced
+as `A` and the new one as `B`.
+
+1. Add B into cluster
+2. Freeze both A and B through `influxd-ctl node freeze`
+3. Truncate shards and wait a while to make sure no further writes on A and B
+4. Get all shards through `influxd-ctl shard node`
+5. Copy them from A to B through `influxd-ctl shard copy`
+6. Progress can be checked through `influxd-ctl shard status`
+7. Better to verify the actual data directories are copied correctly
+8. Remove A from cluster
+9. Unfreeze B to let it accept creation of new shards
diff --git a/Meta_Cluster_Maintenance.md b/Meta_Cluster_Maintenance.md
@@ -0,0 +1,79 @@
+# Meta Cluster Maintenance
+
+## Get Status of Cluster
+
+```shell
+metad-ctl status -s ip:port
+```
+
+Sample output as:
+
+```shell
+Cluster:
+Leader: 3
+Term: 8
+Committed: 4685619
+Applied: 4685619
+
+Nodes:
+1      Follower       127.0.0.1:2345         StateReplicate 4685619=>4685620
+2      Follower       127.0.0.1:2346         StateReplicate 4685619=>4685620    Vote(3)
+3      Leader         127.0.0.1:2347         StateReplicate 4685619=>4685620    Vote(3)
+```
+
+## Restart Node
+
+### Restart Follower
+
+Feel free to restart any follower. The only thing you should take care is that
+it's better to restart one node at a time and make sure the status of cluster
+become healthy again.
+
+### Restart Leader
+
+Restart a leader should follow 2 phases:
+
+1. Kill the leader
+2. Check status of cluster whether a new leader has been elected
+3. Start it and now it's a follower
+
+## Add New Node
+
+Adding operation should also follow 2 phases step.
+
+1. Add it into the configuration using `metad-ctl add` specifying `id` and `addr`
+2. Start the new, empty meta node
+
+One node at a time. If you want to add multiple nodes just repeat the below steps.
+
+## Remove Node
+
+1. Kill it
+2. Remove it from configuration using `metad-ctl remove`
+
+## Replace Node
+
+There are two strategies to replace an existing node.
+
+First is remove-add strategy.
+In this strategy you can remove it first follow steps in `Remove Node` and then
+add a new node follow steps in `Add New Node`. The core point is that you can
+use the same **address** / **id** of the removed one.
+
+Another is add-remove strategy.
+In this strategy you first add a new node into cluster and then remove the old
+one. The core point is that it maybe safer compared to first strategy. But you
+can't use the same id or address because they will both up for a while.
+
+## Recover from Disaster
+
+If something bad happened and the cluster wouldn't achieve consensus anymore or
+there was other reason which caused cluster can't work anymore, here is how to get
+them back.
+
+First you should check which storage of node you want to recover with. Use commend
+`metad -config <configuration> -dump a.db` to dump the storage to the file `a.db`.
+
+Second you can boot up the first node using it through `metad -config <configuration> -restore a.db`.
+Now you have a single-instance cluster. Then you can follow the `Add New Node` steps
+to add the rest one by one.