Our hands are not for boredom: restoring the Rook cluster in K8s





We already talked about how / why we like Rook: to a significant extent, it simplifies working with storage in Kubernetes clusters. However, with this simplicity, certain difficulties come. We hope that the new material will help to better understand such difficulties before they even manifest themselves.



And to read it was more interesting, we start with the consequences of a hypothetical problem in the cluster.



"Everything is lost!"



Imagine that you once configured and launched Rook in your K8s cluster, he was pleased with his work, but at some “wonderful” moment the following happens:





When was the pod rook-ceph-operator



launched? Not so long ago as he was deployed. Why? Rook-operator decided to make a new cluster ... How can we now restore the cluster and the data in it?



To get started, let's go a longer interesting way, having carried out a thoughtful investigation into the “internals” of Rook and a step-by-step restoration of its components. Of course, there is also a shorter correct way: using backups. As you know, admins are divided into two types: those who don’t do backups, and those who already do them ... But more about this after the investigation.



A little practice, or a long way



Take a look and restore the monitors



So, let's look at the list of ConfigMaps: there are rook-ceph-config



and rook-config-override



necessary for backup. They appear upon successful deployment of the cluster.



NB : In new versions, after the adoption of this PR , ConfigMaps have ceased to be an indicator of the success of a cluster deployment.



To perform further actions, we need a hard reboot of all servers that have mounted RBD images ( ls /dev/rbd*



). It must be done through sysrq (or "on foot" to the data center). This requirement is caused by the task of disconnecting mounted RBDs, for which a regular reboot will not work (it will unsuccessfully try to unmount them normally).



The theater begins with a hanger, and the Ceph cluster begins with monitors. Let's look at them.



Rook mounts the following entities in the monitor pod:



 Volumes: rook-ceph-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: rook-ceph-config rook-ceph-mons-keyring: Type: Secret (a volume populated by a Secret) SecretName: rook-ceph-mons-keyring rook-ceph-log: Type: HostPath (bare host directory volume) Path: /var/lib/rook/kube-rook/log ceph-daemon-data: Type: HostPath (bare host directory volume) Path: /var/lib/rook/mon-a/data Mounts: /etc/ceph from rook-ceph-config (ro) /etc/ceph/keyring-store/ from rook-ceph-mons-keyring (ro) /var/lib/ceph/mon/ceph-a from ceph-daemon-data (rw) /var/log/ceph from rook-ceph-log (rw)
      
      





Let's see what the secret of rook-ceph-mons-keyring



:



 kind: Secret data: keyring: LongBase64EncodedString=
      
      





We decode and get the usual keyring with rights for the admin and monitors:



 [mon.] key = AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA== caps mon = "allow *" [client.admin] key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"
      
      





Remember. Now look at the keyring in the secret rook-ceph-admin-keyring



:



 kind: Secret data: keyring: anotherBase64EncodedString=
      
      





What is in it?



 [client.admin] key = AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"
      
      





Same. Let's see more ... Here, for example, is the secret of rook-ceph-mgr-a-keyring



:



 [mgr.a] key = AQBZR19dbVeaIhBBXFYyxGyusGf8x1bNQunuew== caps mon = "allow *" caps mds = "allow *" caps osd = "allow *"
      
      





In the end, we find a few more secrets in ConfigMap rook-ceph-mon



:



 kind: Secret data: admin-secret: AQAhT19d9MMEMRGG+wxIwDqWO1aZiZGcGlSMKp== cluster-name: a3ViZS1yb29r fsid: ZmZiYjliZDMtODRkOS00ZDk1LTczNTItYWY4MzZhOGJkNDJhCg== mon-secret: AQAhT19dlUz0LhBBINv5M5G4YyBswyU43RsLxA==
      
      





And this is the initial list with keyring, where all the secrets described above come from.



As you know (see dataDirHostPath



in the documentation ), Rook stores this data in two places. Therefore, let's go to the nodes to look at the keyring'y lying in the directories that are mounted in pods with monitors and OSD. To do this, find the nodes /var/lib/rook/mon-a/data/keyring



and see:



 # cat /var/lib/rook/mon-a/data/keyring [mon.] key = AXAbS19d8NNUXOBB+XyYwXqXI1asIzGcGlzMGg== caps mon = "allow *"
      
      





Suddenly, the secret turned out to be different - not like in ConfigMap.



What about the admin keyring? We also have it:



 # cat /var/lib/rook/kube-rook/client.admin.keyring [client.admin] key = AXAbR19d8GGSMUBN+FyYwEqGI1aZizGcJlHMLgx= caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" caps mgr = "allow *"
      
      





Here is the problem. There was a failure: the cluster was recreated ... but in reality not.



It becomes clear that the newly generated keyring is stored in secrets, and they are not from our old cluster. Therefore:





The miracle will not take long: monitors will appear and start up. Hooray, a start!



Restore OSD



We go to the pod rook-operator



: calling ceph mon dump



shows that all monitors are in place, and ceph -s



that they are in a quorum. However, if you look at the OSD tree ( ceph osd tree



), you will see something strange in it: OSD's started to appear, but they are empty. It turns out that they also need to be somehow restored. But how?



Meanwhile, rook-ceph-config



and rook-config-override



, as well as many other ConfigMaps with names of the form rook-ceph-osd-$nodename-config



, appeared in ConfigMap's so needed. Let's look at them:



 kind: ConfigMap data: osd-dirs: '{"/mnt/osd1":16,"/mnt/osd2":18}'
      
      





Everything is wrong, everything is mixed up!



Scale the operator pod to zero, delete the generated Deployment pods from the OSD, and fix these ConfigMaps. But where to get the correct OSD map by nodes?





We conclude that osd0



is the required OSD, which was specified as /mnt/osd1



in ConfigMap (we use directory based osd .)



Step by step, we check all the nodes and edit ConfigMap's. After all the instructions, you can run the pod of the Rook operator and read its logs. And everything is wonderful in them:





Let's go back to the pod of the Rook operator and check the cluster liveness ... yes, we made a little mistake with the conclusions about the OSD names on some nodes! It doesn’t matter: they again corrected ConfigMaps, deleted the extra directories from the new OSDs and came to the long-awaited state HEALTH_OK



!



Check the images in the pool:



 # rbd ls -p kube pvc-9cfa2a98-b878-437e-8d57-acb26c7118fb pvc-9fcc4308-0343-434c-a65f-9fd181ab103e pvc-a6466fea-bded-4ac7-8935-7c347cff0d43 pvc-b284d098-f0fc-420c-8ef1-7d60e330af67 pvc-b6d02124-143d-4ce3-810f-3326cfa180ae pvc-c0800871-0749-40ab-8545-b900b83eeee9 pvc-c274dbe9-1566-4a33-bada-aabeb4c76c32 …
      
      





Everything is in place - the cluster is saved!



I am lazy doing backups, or the Fast Way



If backups were done for Rook, then the recovery procedure becomes much simpler and boils down to the following:



  1. Scale to zero deployment of the Rook operator;
  2. We remove all deployments except the Rook operator;
  3. Restore all secrets and ConfigMaps from backup;
  4. Restore the contents of the /var/lib/rook/mon-*



    on the nodes;
  5. Restore (if suddenly lost) CRD CephCluster



    , CephFilesystem



    , CephBlockPool



    , CephNFS



    , CephObjectStore



    ;
  6. Scale the deployment of the Rook operator back to 1.


Useful Tips



Make backups!



And in order to avoid situations when you need to recover from them:



  1. Before large-scale work with the cluster, consisting in server reboots, scale the Rook operator to zero so that it does not do too much.
  2. On monitors, add nodeAffinity in advance.
  3. Pay attention to the pre- setting of timeouts ROOK_MON_HEALTHCHECK_INTERVAL



    and ROOK_MON_OUT_TIMEOUT



    .


Instead of a conclusion



It makes no sense to argue that Rook, being an additional “layer” (in the general scheme of organizing storage in Kubernetes), as much simplifies, it also adds new difficulties and potential problems in the infrastructure. The thing remains “small”: to make an informed, informed choice between these risks on the one hand and the benefits that the solution brings in your particular case, on the other.



By the way, the section “Adopt an existing Rook Ceph cluster into a new Kubernetes cluster” has recently been added to the Rook documentation. It describes in more detail what needs to be done in order to move existing data to a new Kubernetes cluster or to restore a cluster that has collapsed for one reason or another.



PS



Read also in our blog:






All Articles