Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 4.28 KB

handling_failures.md

File metadata and controls

122 lines (90 loc) · 4.28 KB

Handling Failures

So now we have a highly resilient setup, we can see how having the right architecture for both the cluster and the application allow us to tolerate failures.

Scenario: Handling Zone Failures

In this simulation, we will take out a full zone from the cluster and see how the application remains in a running state even with losing a full zone, below illustration for what we are going to do Handling Zone Failures

Below video shows how we can do this, follow the instructions underneath if you want to do the same.

simulate_zone_failure_small.mov

Instructions (Simulate a zone failure)

  1. Lets insert some data in our "test" index

#insert some data

curl -X  POST "http://$esip:9200/test/_doc/" -H 'Content-Type: application/json' -d'
{
"user" : "mo",
"post_date" : "2021-06-13T20:12:12",
"message" : "testing resiliency"
}
'

#we should receive the below

{
"user" : "mo",
"post_date" : "2021-06-13T20:12:12",
"message" : "testing resiliency"
}
'
{"_index":"test","_type":"_doc","_id":"TdZAD3oBAp40j8h6zdHs","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":0,"_primary_term":1}       

#Lets do a quick test to retrieve the data

curl http://$esip:9200/test/_search?q=user:mo*

#we should receive something similar to the below

{"took":706,"timed_out":false,"_shards":{"total":3,"successful":3,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.0,"hits":[{"_index":"test","_type":"_doc","_id":"TdZAD3oBAp40j8h6zdHs","_score":1.0,"_source":
{
"user" : "mo",
"post_date" : "2021-06-13T20:12:12",
"message" : "testing resiliency"
}
}]}}

#another test that we will use later is to return the http code only

$ curl -s -o /dev/null -w "%{http_code}" http://$esip:9200/test/_search?q=user:mo*
200
  1. Run the below script on the side to keep an eye on the application, the script is straightforward just return the status code for our curl call which we used earlier
while true
do
curl -s -o /dev/null -w "%{http_code}" http://$esip:9200/test/_search\?q\=user:mo\*
echo \\n
done

#you should see something like the below "hopefully"!
200

200

200

200
  1. On another shell, get the nodes to check on their status, keep it running so we see the nodes changing their status from "Ready" to "Not Ready"
kubectl get nodes -w 
NAME                                 STATUS   ROLES   AGE    VERSION
aks-espoolz1-37272235-vmss000000     Ready    agent   3d7h   v1.21.1
aks-espoolz1-37272235-vmss000001     Ready    agent   3d7h   v1.21.1
aks-espoolz2-37272235-vmss000000     Ready    agent   3d7h   v1.21.1
aks-espoolz2-37272235-vmss000001     Ready    agent   3d7h   v1.21.1
aks-espoolz3-37272235-vmss000000     Ready    agent   3d7h   v1.21.1
aks-espoolz3-37272235-vmss000001     Ready    agent   3d7h   v1.21.1
aks-systempool-37272235-vmss000000   Ready    agent   3d7h   v1.21.1
aks-systempool-37272235-vmss000001   Ready    agent   3d7h   v1.21.1
aks-systempool-37272235-vmss000002   Ready    agent   3d7h   v1.21.1
  1. Now SSH into your nodes and stop the Kubelet, there are many ways to SSH into your nodes like using a bastion/jump host which you should do, but for sake of simplicity we I'll use kubectl-node_shell.

Note: If you use "kubectl-node_shell) to open a shell into the nodes, then once you stop the kubelet you will be logged out which is expected, don't panic leave it as is for 5-6 minutes and then the AKS diagnostics agent will restart the kubelet for you.

#We will take out the nodes in Zone1 
$ kubectl-node_shell aks-espoolz1-37272235-vmss000000
#once inside the node run the below
systemctl stop kubelet

#do the same for the other node 
$ kubectl-node_shell aks-espoolz1-37272235-vmss000001
$ systemctl stop kubelet

#by now you should be logged out of the nodes, but notice that your application is still running

Summary

We have simulated a full zone failure and demonstrated how our application remained in a running status, this is the power of having the right architecture for your clusters and application.

Please continue to next section Handling CLuster Upgrades.