How fix full disk for /data in Prometheus Server deployed with Helm chart

How fix full disk for /data in Prometheus Server deployed with Helm chart

The prometheus-server pod has two containers: prometheus-server-configmap-reload and prometheus-server.

Currently the prometheus-server has one disk of 20GiB and in was full in sixty days. We need to resize or change this pvc for at least 60GiB.

In order to do this, we need:

the prometheus-server need to be stopped
backup current volume data in another pvc
delete current prometheus-server pvc
recreate the previous prometheus pvc
restore previous backup on this new prometheus-server pvc
start the previous stopped deployment of prometheus-sever

There are the details: (the "k" is one alias for "kubectl")

1. the prometheus-server need to be stopped

---
we need to get information for current prometheus-server deployment

```
$ k -n monitoring get deployment
NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
grafana                                 1/1     1            1           60d
grafana-nginx-ingress-controller        2/2     2            2           60d
grafana-nginx-ingress-default-backend   1/1     1            1           60d
prometheus-alertmanager                 1/1     1            1           60d
prometheus-kube-state-metrics           1/1     1            1           60d
prometheus-pushgateway                  1/1     1            1           60d
prometheus-server                       0/1     1            0           60d
```

Check details for prometheus-server deployment:

```
$ k -n monitoring describe deployment prometheus-server
Name:                   prometheus-server
Namespace:              monitoring
CreationTimestamp:      Tue, 29 Oct 2019 20:50:30 -0400
Labels:                 app=prometheus
                        chart=prometheus-9.2.0
                        component=server
                        heritage=Tiller
                        release=prometheus
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=prometheus,component=server,release=prometheus
Replicas:               1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy: 1 max unavailable, 1 max surge
Pod Template:
Labels:           app=prometheus
                    chart=prometheus-9.2.0
                    component=server
                    heritage=Tiller
                    release=prometheus
Service Account: prometheus-server
Containers:
   prometheus-server-configmap-reload:
    Image:      jimmidyson/configmap-reload:v0.2.2
    Port:       <none>
    Host Port: <none>
    Args:
      --volume-dir=/etc/config
      --webhook-url=http://127.0.0.1:9090/-/reload
    Environment: <none>
    Mounts:
      /etc/config from config-volume (ro)
   prometheus-server:
    Image:      prom/prometheus:v2.13.1
    Port:       9090/TCP
    Host Port: 0/TCP
    Args:
      --storage.tsdb.retention.time=15d
      --config.file=/etc/config/prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
    Liveness:     http-get http://:9090/-/healthy delay=30s timeout=30s period=10s #success=1 #failure=3
    Readiness:    http-get http://:9090/-/ready delay=30s timeout=30s period=10s #success=1 #failure=3
    Environment: <none>
    Mounts:
      /data from storage-volume (rw)
      /etc/config from config-volume (rw)
Volumes:
   config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-server
    Optional: false
   storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName: prometheus-server
    ReadOnly:   false
Conditions:
Type           Status Reason
----           ------ ------
Available      True    MinimumReplicasAvailable
OldReplicaSets: prometheus-server-65d76f67cf (1/1 replicas created)
NewReplicaSet:   <none>
Events:          <none>

```
We don't need to delete prometheus-server deployment, instead we set prometheus-server replicas to zero and this action delete current pod and release current volumes.

```
$ kubectl -n monitoring scale deployment prometheus-server --replicas=0
deployment.extensions/prometheus-server scaled
```

2. backup current volume data in another pvc

---
Check for current pvc for prometheus-server

```
$ k -n monitoring get pvc
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
alpinebox-recovery-pvc    Bound    pvc-c86f431e-2a3b-11ea-9bfe-524e272515fe   30Gi       RWO            managed-premium   19m
grafana                   Bound    pvc-822d1613-fab6-11e9-b365-4aa5ceef3b39   20Gi       RWO            managed-premium   60d
prometheus-alertmanager   Bound    pvc-44e4eedd-faaf-11e9-b365-4aa5ceef3b39   10Gi       RWO            managed-premium   60d
prometheus-pushgateway    Bound    pvc-44e6c0d8-faaf-11e9-b365-4aa5ceef3b39   10Gi       RWO            managed-premium   60d
prometheus-server         Bound    pvc-44eae0e4-faaf-11e9-b365-4aa5ceef3b39   20Gi       RWO            managed-premium   60d

$ k -n monitoring describe pvc prometheus-server
Name:          prometheus-server
Namespace:     monitoring
StorageClass: managed-premium
Status:        Bound
Volume:        pvc-44eae0e4-faaf-11e9-b365-4aa5ceef3b39
Labels:        app=prometheus
               chart=prometheus-9.2.0
               component=server
               heritage=Tiller
               release=prometheus
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes: RWO
VolumeMode:    Filesystem
Mounted By:    alpinebox
Events:        <none>

$ k -n monitoring get pvc prometheus-server -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
creationTimestamp: "2019-10-30T00:50:29Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
    app: prometheus
    chart: prometheus-9.2.0
    component: server
    heritage: Tiller
    release: prometheus
name: prometheus-server
namespace: monitoring
resourceVersion: "1457021"
selfLink: /api/v1/namespaces/monitoring/persistentvolumeclaims/prometheus-server
uid: 44eae0e4-faaf-11e9-b365-4aa5ceef3b39
spec:
accessModes:
- ReadWriteOnce
resources:
    requests:
      storage: 20Gi
storageClassName: managed-premium
volumeMode: Filesystem
volumeName: pvc-44eae0e4-faaf-11e9-b365-4aa5ceef3b39
status:
accessModes:
- ReadWriteOnce
capacity:
    storage: 20Gi
phase: Bound
```

as you can see, the current disk size is 20 GiB. Now I'll create a new pvc with more than 20 GiB.

```
# alpinebox-recovery-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: alpinebox-recovery-pvc
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
    requests:
      storage: 30Gi
storageClassName: managed-premium

$ kubectl apply -f alpinebox-recovery-pvc.yml
```

Now we need a busybox with Alpine where rsync is available.

```
# alpinebox.yml
apiVersion: v1
kind: Pod
metadata:
name: alpinebox
namespace: monitoring
spec:
containers:
- name: alpinebox
    image: alpine:3.5
    command:
      - sleep
      - "3600"
    volumeMounts:
    - mountPath: /data-old
      name: storage-volume-old
    - mountPath: /data-new
      name: storage-volume-new
restartPolicy: Never
volumes:
- name: storage-volume-old
    persistentVolumeClaim:
      claimName: prometheus-server
- name: storage-volume-new
    persistentVolumeClaim:
      claimName: alpinebox-recovery-pvc

$ kubectl apply -f alpinebox.yml
```

you can see that previous prometheus-server pvc will be available as */data-old* and the new recovery empty pvc disk will be available at */data-new*

now we need to sync */data-old* with empty */data-new* mountpoints.

```
$ k -n monitoring exec -it alpinebox sh
apk update
apk install rsync
rsync -avzHS --progress /data-old/* /data-new/
# wait near than 20 minutes to complete
exit
```
The actual data from prometheus-server database has been saved at new recovery pvc. we can delete this old pvc.

3. delete current prometheus-server pvc

---
The data has been saved. we need to remove and re-create prometheus-server pvc with new size.

```
$ kubectl -n monitoring delete pvc prometheus-server
```

4. recreate prometheus pvc

---

```
# prometheus-server-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
    app: prometheus
    chart: prometheus-9.2.0
    component: server
    heritage: Tiller
    release: prometheus
name: prometheus-server
spec:
accessModes:
- ReadWriteOnce
resources:
    requests:
      storage: 60Gi
storageClassName: managed-premium

$ kubectl -n monitoring apply -f prometheus-server-pvc.yml

```

5. restore previous backup on this new prometheus-server pvc

---
Now we need to restore backup data to new prometheus-server pvc

```
$ k apply -f alpinebox.yaml

$ k -n monitoring exec -it alpinebox sh
apk update
apk install rsync
rsync -avzHS --progress /data-new/* /data-old/
# wait near than 20 minutes to complete
exit

$ k delete -f alpinebox.yaml

```

6. start the previous stopped deployment of prometheus-sever

---

```
$ kubectl -n monitoring scale deployment prometheus-server --replicas=1

$ kubectl -n monitoring logs -f prometheus-server-65d76f67cf-7htdv -c prometheus-server
level=info ts=2019-12-29T22:09:57.517Z caller=main.go:332 msg="Starting Prometheus" version="(version=2.13.1, branch=HEAD, revision=6f92ce56053866194ae5937012c1bec40f1dd1d9)"
level=info ts=2019-12-29T22:09:57.517Z caller=main.go:333 build_context="(go=go1.13.1, user=root@88e419aa1676, date=20191017-13:15:01)"
level=info ts=2019-12-29T22:09:57.517Z caller=main.go:334 host_details="(Linux 4.15.0-1059-azure #64-Ubuntu SMP Fri Sep 13 17:02:44 UTC 2019 x86_64 prometheus-server-65d76f67cf-7htdv (none))"
level=info ts=2019-12-29T22:09:57.518Z caller=main.go:335 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-12-29T22:09:57.518Z caller=main.go:336 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-12-29T22:09:57.530Z caller=web.go:450 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-12-29T22:09:57.530Z caller=main.go:657 msg="Starting TSDB ..."
level=info ts=2019-12-29T22:09:57.566Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1575676800000 maxt=1575741600000 ulid=01DVH2NWPSVX5Y5VQ7HA5SXESY
level=info ts=2019-12-29T22:09:57.569Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1575741600000 maxt=1575806400000 ulid=01DVK0FFTNN6BX0W6W3TQECMZY
level=info ts=2019-12-29T22:09:57.573Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1575806400000 maxt=1575871200000 ulid=01DVMY8TZJ1HZZYD371EDZJ5CZ
level=info ts=2019-12-29T22:09:57.576Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1575871200000 maxt=1575936000000 ulid=01DVPW2KCC86JEHD9YPHCA1DW2
level=info ts=2019-12-29T22:09:57.579Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1575936000000 maxt=1576000800000 ulid=01DVRSW4GZRKT9BF2VH7TD1SNF
.......
```

If all is ok, you can delete temporal recovery pvc

```
$ kubectl -n monitoring get pvc
NAME                      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
alpinebox-recovery-pvc    Bound    pvc-c86f431e-....-11ea-9bfe-524e272515fe   30Gi       RWO            managed-premium   9h
grafana                   Bound    pvc-822d1613-....-11e9-b365-4aa5ceef3b39   20Gi       RWO            managed-premium   60d
prometheus-alertmanager   Bound    pvc-44e4eedd-....-11e9-b365-4aa5ceef3b39   10Gi       RWO            managed-premium   60d
prometheus-pushgateway    Bound    pvc-44e6c0d8-....-11e9-b365-4aa5ceef3b39   10Gi       RWO            managed-premium   60d
prometheus-server         Bound    pvc-75536a49-....-11ea-9bfe-524e272515fe   60Gi       RWO            managed-premium   8h

$ kubectl -n monitoring get pods
NAME                                                     READY   STATUS    RESTARTS   AGE
grafana-676f46565c-tqpzl                                 1/1     Running   0          39d
grafana-nginx-ingress-controller-5778fc5dcb-7vchz        1/1     Running   0          60d
grafana-nginx-ingress-controller-5778fc5dcb-kkmml        1/1     Running   0          60d
grafana-nginx-ingress-default-backend-7f879557f8-zvkm8   1/1     Running   0          60d
prometheus-alertmanager-788958f7c7-7rgdx                 2/2     Running   0          60d
prometheus-kube-state-metrics-55fb55b9db-8gmqt           1/1     Running   0          59d
prometheus-node-exporter-cqlql                           1/1     Running   0          60d
prometheus-node-exporter-k4xqf                           1/1     Running   0          60d
prometheus-node-exporter-p8cpj                           1/1     Running   0          60d
prometheus-pushgateway-699f55c47-8v7jq                   1/1     Running   0          60d
prometheus-server-65d76f67cf-jxl4k                       2/2     Running   0          9m24s

$ kubectl delete -f alpinebox-recovery-pvc.yml
```

Thanks for reading :)

Seek & Find

Search This Blog