How to upgrade Gitlab Omnibus to major version. Recently I need upgrade my self-hosted Gitlab Community from version 13.12.4-ce.0 to latest version 14.3.2-ce.0, but a simple apt update && apt-dist-upgrade was a nightmare ;/ Solution was at my eyes !!! Upgrading GitLab is a relatively straightforward process, but the complexity can increase based on the installation method you have used, how old your GitLab version is, if you’re upgrading to a major version, and so on. ;( References: https://docs.gitlab.com/ee/update/package/ https://docs.gitlab.com/ee/update/index.html#checking-for-background-migrations-before-upgrading Hands on ! First I need to know my exactly current Gitlab version root@gitlab:# dpkg -al|grep gitlab ii gitlab-ce 13.12.4-ce.0 ...
Prometheus Server Pod with high load is frecuently evicted (how to fix)
Recently I deployed Prometheus, Grafana, Alertmanager and PushGateway using the official helm chart. In this case the k8s cluster is production, so that tls is required. By the same way I have many clusters with Helm 2.x in production and some modifications was required to use differents private keys and certificates. At last I rewrite my .bashrc and others files to include something like this (only relevant sections are showed):
```
# ~/.bashrc
....
# https://github.com/ahmetb/kubectl-aliases
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
....
# kubernetes specific
export KUBE_EDITOR="nano"
source <(kubectl completion bash) # setup autocomplete in bash into the current shell, bash-completion package should be installed first.
....
```
Other file:
```
# ~/.bash_aliases
# note for cluster AKS
# Error: incompatible versions client[v2.16.1] server[v2.13.1]
# https://github.com/helm/helm/releases/tag/v2.13.1
# https://get.helm.sh/helm-v2.13.1-linux-amd64.tar.gz
# https://medium.com/nuvo-group-tech/configure-helm-tls-communication-with-multiple-kubernetes-clusters-5e58674352e2
alias tls='cluster=$(kubectl config view -o jsonpath='{.clusters[].name}' --minify); echo -n "--tls --tls-cert $(helm home)/tls/$cluster/cert.pem --tls-key $(helm home)/tls/$cluster/key.pem"'
function helmet() {
helm "$@" $(tls)
}
```
My ~/.helm folder:
```
.helm/
├── cache
│ └── archive
│ ├── apache-6.0.3.tgz
│ ├── cert-manager-v0.11.0.tgz
│ ├── cert-manager-v0.8.0.tgz
│ ├── external-dns-2.9.0.tgz
│ ├── grafana-3.8.3.tgz
│ ├── grafana-4.0.0.tgz
│ ├── jenkins-1.5.4.tgz
│ ├── kubernetes-dashboard-1.5.2.tgz
│ ├── loki-0.17.0.tgz
│ ├── loki-0.22.0.tgz
│ ├── loki-stack-0.16.0.tgz
│ ├── loki-stack-0.17.0.tgz
│ ├── mariadb-6.2.2.tgz
│ ├── mysql-0.19.0.tgz
│ ├── mysql-1.3.0.tgz
│ ├── mysql-1.4.0.tgz
│ ├── nginx-ingress-1.24.3.tgz
│ ├── nginx-ingress-1.24.4.tgz
│ ├── nginx-ingress-1.24.5.tgz
│ ├── nginx-ingress-1.24.6.tgz
│ ├── prometheus-operator-6.0.0.tgz
│ ├── prometheus-operator-6.21.0.tgz
│ ├── prometheus-operator-6.21.1.tgz
│ ├── prometheus-operator-6.7.2.tgz
│ ├── promtail-0.13.0.tgz
│ ├── promtail-0.16.0.tgz
......
│ └── wordpress-5.9.8.tgz
├── plugins
├── repository
│ ├── cache
│ │ ├── bitnami-index.yaml
│ │ ├── jetstack-index.yaml
│ │ ├── local-index.yaml
│ │ ├── loki-index.yaml
│ │ └── stable-index.yaml
│ ├── local
│ │ └── index.yaml
│ └── repositories.yaml
├── starters
└── tls
├── MyFirstCluster
│ ├── ca.pem
│ ├── cert.pem
│ └── key.pem
└── MySecondCluster
├── ca.pem
├── cert.pem
└── key.pem
......
```
The initial Prometheus deployment was made using the Helm chart.
```
$ helm repo update
$ helmet update --name=prometheus prometheus \
--namespace monitoring \
--set rbac.create=true \
--set server.persistentVolume.enabled=true \
--set server.persistentVolume.size=20Gi \
--set server.persistentVolume.storageClass=managed-premium \
--set alertmanager.persistentVolume.enabled=true \
--set alertmanager.persistentVolume.size=20Gi \
--set alertmanager.persistentVolume.storageClass=managed-premium \
--set pushgateway.persistentVolume.enabled=true \
--set pushgateway.persistentVolume.size=20Gi \
--set pushgateway.persistentVolume.storageClass=managed-premium \
--set server.terminationGracePeriodSeconds=360
```
Some PersistVolumeClaim was created. All pvc are from 20 GiB. All systems works fine, but after one month, the metrics count arise at high volumes and when the prometheus server pod is overload, kubernetes kill him because the original liveness and readiness probes exceed the threshold.
We need to adjust the kubernetes probes.
The original deployment for prometheus-server is:
```
$ k -n monitoring edit deployment prometheus-server
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2019-10-30T00:50:30Z"
generation: 3
labels:
app: prometheus
chart: prometheus-9.2.0
component: server
heritage: Tiller
release: prometheus
name: prometheus-server
namespace: monitoring
resourceVersion: "13955013"
selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/prometheus-server
uid: 4561936a-faaf-11e9-b365-4aa5ceef3b39
spec:
progressDeadlineSeconds: 2147483647
replicas: 1
revisionHistoryLimit: 2147483647
selector:
matchLabels:
app: prometheus
component: server
release: prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: prometheus
chart: prometheus-9.2.0
component: server
heritage: Tiller
release: prometheus
spec:
containers:
- args:
- --volume-dir=/etc/config
- --webhook-url=http://127.0.0.1:9090/-/reload
image: jimmidyson/configmap-reload:v0.2.2
imagePullPolicy: IfNotPresent
name: prometheus-server-configmap-reload
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: config-volume
readOnly: true
- args:
- --storage.tsdb.retention.time=15d
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
image: prom/prometheus:v2.13.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: config-volume
- mountPath: /data
name: storage-volume
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
serviceAccount: prometheus-server
serviceAccountName: prometheus-server
terminationGracePeriodSeconds: 360
volumes:
- configMap:
defaultMode: 420
name: prometheus-server
name: config-volume
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-server
status:
conditions:
- lastTransitionTime: "2019-10-30T00:50:30Z"
lastUpdateTime: "2019-10-30T00:50:30Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
```
The main sections here are:
```
....
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
....
```
Now, after sixty days, the TSDB has more than 5 millon of rows and take more than six minutes to be fully up and responsive.
To know about the prometheus-server pod:
```
$ k -n monitoring get pods
NAME READY STATUS RESTARTS AGE
grafana-676f46565c-tqpzl 1/1 Running 0 39d
grafana-nginx-ingress-controller-5778fc5dcb-7vchz 1/1 Running 0 60d
grafana-nginx-ingress-controller-5778fc5dcb-kkmml 1/1 Running 0 60d
grafana-nginx-ingress-default-backend-7f879557f8-zvkm8 1/1 Running 0 60d
prometheus-alertmanager-788958f7c7-7rgdx 2/2 Running 0 61d
prometheus-kube-state-metrics-55fb55b9db-8gmqt 1/1 Running 0 59d
prometheus-node-exporter-cqlql 1/1 Running 0 61d
prometheus-node-exporter-k4xqf 1/1 Running 0 61d
prometheus-node-exporter-p8cpj 1/1 Running 0 61d
prometheus-pushgateway-699f55c47-8v7jq 1/1 Running 0 61d
prometheus-server-745f77d49b-v77ll 2/2 Running 3 154m
$ k -n monitoring describe pod prometheus-server-745f77d49b-v77ll
Name: prometheus-server-745f77d49b-v77ll
Namespace: monitoring
Priority: 0
Node: aks-nodepool1-20238707-0/10.244.0.4
Start Time: Sun, 29 Dec 2019 21:35:57 -0400
Labels: app=prometheus
chart=prometheus-9.2.0
component=server
heritage=Tiller
pod-template-hash=745f77d49b
release=prometheus
Annotations: <none>
Status: Running
IP: 10.244.0.7
IPs: <none>
Controlled By: ReplicaSet/prometheus-server-745f77d49b
Containers:
prometheus-server-configmap-reload:
Container ID: docker://3237d4bf6957a76ba67021dc25da55475646041bd9b5829e3113402c9457f1f5
Image: jimmidyson/configmap-reload:v0.2.2
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:befec9f23d2a9da86a298d448cc9140f56a457362a7d9eecddba192db1ab489e
Port: <none>
Host Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://127.0.0.1:9090/-/reload
State: Running
Started: Sun, 29 Dec 2019 21:36:04 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-v4jg2 (ro)
prometheus-server:
Container ID: docker://87cec70c5bc65ad2db5d2e0b69b90c256a8a3c7cd56383bb08e15c486f91ffeb
Image: prom/prometheus:v2.13.1
Image ID: docker-pullable://prom/prometheus@sha256:0a8caa2e9f19907608915db6e62a67383fe44b9876a467b297ee6f64e51dd58a
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention.time=15d
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Running
Started: Sun, 29 Dec 2019 21:36:55 -0400
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 29 Dec 2019 21:36:24 -0400
Finished: Sun, 29 Dec 2019 21:36:25 -0400
Ready: True
Restart Count: 3
Liveness: http-get http://:9090/-/healthy delay=600s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=600s timeout=1s period=30s #success=1 #failure=20
Environment: <none>
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-v4jg2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-server
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-server
ReadOnly: false
prometheus-server-token-v4jg2:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-server-token-v4jg2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
```
To check the logs of the prometheus server container:
```
$ kubectl -n monitoring logs -f prometheus-server-745f77d49b-v77ll -c prometheus-server
level=info ts=2019-12-30T01:53:14.999Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577037600000 maxt=1577059200000 ulid=01DXA83MACD5PP4XHX7TC2K0BS sources="[01DXA7GB0XH5E69931MBAXWCPH 01DXA7GZP72WW93TYKKZDX35ZD 01DXA7HKN5BZF5AK1GTVZWK5M9]" duration=3.499215571s
level=info ts=2019-12-30T01:53:20.402Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577059200000 maxt=1577080800000 ulid=01DXA83SCFC0HQDWSNYXK4TXVV sources="[01DXA7J7H1ENJS14EFGBNYE8P2 01DXA7JT9J72SX1PCVWD23HQFT 01DXA7KDEPGQ6MRT8AXQSVYSJ0]" duration=3.71530483s
level=info ts=2019-12-30T01:53:25.600Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577080800000 maxt=1577102400000 ulid=01DXA83YJMBA7759C5QD1CEDWJ sources="[01DXA7M0EKGW2V92QZQSFFCAYJ 01DXA7MK45405TASTFHVJT02Q6 01DXA7MZR8EAB12T2C87EXWK7B]" duration=3.595686298s
level=info ts=2019-12-30T01:53:30.706Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577102400000 maxt=1577124000000 ulid=01DXA843MY2J47EZ8WMADK7KWT sources="[01DXA7NA5P8G0ZKPYWKBY5MJ8G 01DXA7NGB1GNMA89QYMESVSGX3 01DXA7NTDZ65T5F794175R38M5]" duration=3.507732636s
level=info ts=2019-12-30T01:53:34.014Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577124000000 maxt=1577145600000 ulid=01DXA848KKMPQQR3KR71AH5XDS sources="[01DXA7P04K1JXAQTNKKV4NY0D1 01DXA7P8N72MA1E70W3AREZP4N 01DXA7PCD44Q2FRC57MM438AF9]" duration=1.738807856s
level=info ts=2019-12-30T01:55:57.170Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1576972800000 maxt=1577037600000 ulid=01DXA84CAGGD83G33ZHWVXE09P sources="[01DWPDQBAK8Y3VNP2V4G2K4A8W 01DXA80KFNEJM5VRX6ECXZAZ39 01DXA8333RXBRJY4NBJRADJC99]" duration=2m21.08970652s
level=info ts=2019-12-30T01:56:04.508Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577037600000 maxt=1577102400000 ulid=01DXA88R0Q37C0643WKG7VM38Z sources="[01DXA83MACD5PP4XHX7TC2K0BS 01DXA83SCFC0HQDWSNYXK4TXVV 01DXA83YJMBA7759C5QD1CEDWJ]" duration=5.38087382s
level=info ts=2019-12-30T03:00:03.264Z caller=compact.go:496 component=tsdb msg="write block" mint=1577664000000 maxt=1577671200000 ulid=01DXABXYZCSJF71KZ4DWHP2SR5 duration=3.155977955s
level=info ts=2019-12-30T03:00:04.360Z caller=head.go:598 component=tsdb msg="head GC completed" duration=209.807162ms
level=info ts=2019-12-30T03:00:08.474Z caller=head.go:668 component=tsdb msg="WAL checkpoint complete" first=2703 last=2704 duration=4.113514824s
level=info ts=2019-12-30T03:00:12.754Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577102400000 maxt=1577152800000 ulid=01DXABY7XS038KHWDTZ83DPPZ9 sources="[01DXA843MY2J47EZ8WMADK7KWT 01DXA848KKMPQQR3KR71AH5XDS 01DXA7PK4TV7BBRA81W59QH8WS]" duration=3.481380438s
.....
```
To kill unresponsive pod:
```
$ kubectl -n monitoring delete pod prometheus-server-745f77d49b-gcfzt --force --grace-period 0
```
Edit the prometheuse-server deployment and adjust the probes
```
$ kubectl -n monitoring edit deployment prometheus-server
```
....
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 600
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 600
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
....
```
After this modifications, the prometheus-server pod is working fine and no more Evicted pods here.
Note:
If you need delete Evicted pods in your current k8s:
```
$ cat delete-evicted-pods-all-namespaces.sh
#!/bin/sh
# based on https://gist.github.com/ipedrazas/9c622404fb41f2343a0db85b3821275d
# delete all evicted pods from all namespaces
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
# delete all containers in ImagePullBackOff state from all namespaces
kubectl get pods --all-namespaces | grep 'ImagePullBackOff' | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
# delete all containers in ImagePullBackOff or ErrImagePull or Evicted state from all namespaces
kubectl get pods --all-namespaces | grep -E 'ImagePullBackOff|ErrImagePull|Evicted' | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
```
Thanks for reading :)
Recently I deployed Prometheus, Grafana, Alertmanager and PushGateway using the official helm chart. In this case the k8s cluster is production, so that tls is required. By the same way I have many clusters with Helm 2.x in production and some modifications was required to use differents private keys and certificates. At last I rewrite my .bashrc and others files to include something like this (only relevant sections are showed):
```
# ~/.bashrc
....
# https://github.com/ahmetb/kubectl-aliases
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
....
# kubernetes specific
export KUBE_EDITOR="nano"
source <(kubectl completion bash) # setup autocomplete in bash into the current shell, bash-completion package should be installed first.
....
```
Other file:
```
# ~/.bash_aliases
# note for cluster AKS
# Error: incompatible versions client[v2.16.1] server[v2.13.1]
# https://github.com/helm/helm/releases/tag/v2.13.1
# https://get.helm.sh/helm-v2.13.1-linux-amd64.tar.gz
# https://medium.com/nuvo-group-tech/configure-helm-tls-communication-with-multiple-kubernetes-clusters-5e58674352e2
alias tls='cluster=$(kubectl config view -o jsonpath='{.clusters[].name}' --minify); echo -n "--tls --tls-cert $(helm home)/tls/$cluster/cert.pem --tls-key $(helm home)/tls/$cluster/key.pem"'
function helmet() {
helm "$@" $(tls)
}
```
My ~/.helm folder:
```
.helm/
├── cache
│ └── archive
│ ├── apache-6.0.3.tgz
│ ├── cert-manager-v0.11.0.tgz
│ ├── cert-manager-v0.8.0.tgz
│ ├── external-dns-2.9.0.tgz
│ ├── grafana-3.8.3.tgz
│ ├── grafana-4.0.0.tgz
│ ├── jenkins-1.5.4.tgz
│ ├── kubernetes-dashboard-1.5.2.tgz
│ ├── loki-0.17.0.tgz
│ ├── loki-0.22.0.tgz
│ ├── loki-stack-0.16.0.tgz
│ ├── loki-stack-0.17.0.tgz
│ ├── mariadb-6.2.2.tgz
│ ├── mysql-0.19.0.tgz
│ ├── mysql-1.3.0.tgz
│ ├── mysql-1.4.0.tgz
│ ├── nginx-ingress-1.24.3.tgz
│ ├── nginx-ingress-1.24.4.tgz
│ ├── nginx-ingress-1.24.5.tgz
│ ├── nginx-ingress-1.24.6.tgz
│ ├── prometheus-operator-6.0.0.tgz
│ ├── prometheus-operator-6.21.0.tgz
│ ├── prometheus-operator-6.21.1.tgz
│ ├── prometheus-operator-6.7.2.tgz
│ ├── promtail-0.13.0.tgz
│ ├── promtail-0.16.0.tgz
......
│ └── wordpress-5.9.8.tgz
├── plugins
├── repository
│ ├── cache
│ │ ├── bitnami-index.yaml
│ │ ├── jetstack-index.yaml
│ │ ├── local-index.yaml
│ │ ├── loki-index.yaml
│ │ └── stable-index.yaml
│ ├── local
│ │ └── index.yaml
│ └── repositories.yaml
├── starters
└── tls
├── MyFirstCluster
│ ├── ca.pem
│ ├── cert.pem
│ └── key.pem
└── MySecondCluster
├── ca.pem
├── cert.pem
└── key.pem
......
```
Overview
The initial Prometheus deployment was made using the Helm chart.
```
$ helm repo update
$ helmet update --name=prometheus prometheus \
--namespace monitoring \
--set rbac.create=true \
--set server.persistentVolume.enabled=true \
--set server.persistentVolume.size=20Gi \
--set server.persistentVolume.storageClass=managed-premium \
--set alertmanager.persistentVolume.enabled=true \
--set alertmanager.persistentVolume.size=20Gi \
--set alertmanager.persistentVolume.storageClass=managed-premium \
--set pushgateway.persistentVolume.enabled=true \
--set pushgateway.persistentVolume.size=20Gi \
--set pushgateway.persistentVolume.storageClass=managed-premium \
--set server.terminationGracePeriodSeconds=360
```
Some PersistVolumeClaim was created. All pvc are from 20 GiB. All systems works fine, but after one month, the metrics count arise at high volumes and when the prometheus server pod is overload, kubernetes kill him because the original liveness and readiness probes exceed the threshold.
We need to adjust the kubernetes probes.
Deployment review
The original deployment for prometheus-server is:
```
$ k -n monitoring edit deployment prometheus-server
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2019-10-30T00:50:30Z"
generation: 3
labels:
app: prometheus
chart: prometheus-9.2.0
component: server
heritage: Tiller
release: prometheus
name: prometheus-server
namespace: monitoring
resourceVersion: "13955013"
selfLink: /apis/extensions/v1beta1/namespaces/monitoring/deployments/prometheus-server
uid: 4561936a-faaf-11e9-b365-4aa5ceef3b39
spec:
progressDeadlineSeconds: 2147483647
replicas: 1
revisionHistoryLimit: 2147483647
selector:
matchLabels:
app: prometheus
component: server
release: prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: prometheus
chart: prometheus-9.2.0
component: server
heritage: Tiller
release: prometheus
spec:
containers:
- args:
- --volume-dir=/etc/config
- --webhook-url=http://127.0.0.1:9090/-/reload
image: jimmidyson/configmap-reload:v0.2.2
imagePullPolicy: IfNotPresent
name: prometheus-server-configmap-reload
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: config-volume
readOnly: true
- args:
- --storage.tsdb.retention.time=15d
- --config.file=/etc/config/prometheus.yml
- --storage.tsdb.path=/data
- --web.console.libraries=/etc/prometheus/console_libraries
- --web.console.templates=/etc/prometheus/consoles
- --web.enable-lifecycle
image: prom/prometheus:v2.13.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/config
name: config-volume
- mountPath: /data
name: storage-volume
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 65534
runAsGroup: 65534
runAsNonRoot: true
runAsUser: 65534
serviceAccount: prometheus-server
serviceAccountName: prometheus-server
terminationGracePeriodSeconds: 360
volumes:
- configMap:
defaultMode: 420
name: prometheus-server
name: config-volume
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-server
status:
conditions:
- lastTransitionTime: "2019-10-30T00:50:30Z"
lastUpdateTime: "2019-10-30T00:50:30Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 3
replicas: 1
unavailableReplicas: 1
updatedReplicas: 1
```
The main sections here are:
```
....
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
....
```
Now, after sixty days, the TSDB has more than 5 millon of rows and take more than six minutes to be fully up and responsive.
Fixes
To know about the prometheus-server pod:
```
$ k -n monitoring get pods
NAME READY STATUS RESTARTS AGE
grafana-676f46565c-tqpzl 1/1 Running 0 39d
grafana-nginx-ingress-controller-5778fc5dcb-7vchz 1/1 Running 0 60d
grafana-nginx-ingress-controller-5778fc5dcb-kkmml 1/1 Running 0 60d
grafana-nginx-ingress-default-backend-7f879557f8-zvkm8 1/1 Running 0 60d
prometheus-alertmanager-788958f7c7-7rgdx 2/2 Running 0 61d
prometheus-kube-state-metrics-55fb55b9db-8gmqt 1/1 Running 0 59d
prometheus-node-exporter-cqlql 1/1 Running 0 61d
prometheus-node-exporter-k4xqf 1/1 Running 0 61d
prometheus-node-exporter-p8cpj 1/1 Running 0 61d
prometheus-pushgateway-699f55c47-8v7jq 1/1 Running 0 61d
prometheus-server-745f77d49b-v77ll 2/2 Running 3 154m
$ k -n monitoring describe pod prometheus-server-745f77d49b-v77ll
Name: prometheus-server-745f77d49b-v77ll
Namespace: monitoring
Priority: 0
Node: aks-nodepool1-20238707-0/10.244.0.4
Start Time: Sun, 29 Dec 2019 21:35:57 -0400
Labels: app=prometheus
chart=prometheus-9.2.0
component=server
heritage=Tiller
pod-template-hash=745f77d49b
release=prometheus
Annotations: <none>
Status: Running
IP: 10.244.0.7
IPs: <none>
Controlled By: ReplicaSet/prometheus-server-745f77d49b
Containers:
prometheus-server-configmap-reload:
Container ID: docker://3237d4bf6957a76ba67021dc25da55475646041bd9b5829e3113402c9457f1f5
Image: jimmidyson/configmap-reload:v0.2.2
Image ID: docker-pullable://jimmidyson/configmap-reload@sha256:befec9f23d2a9da86a298d448cc9140f56a457362a7d9eecddba192db1ab489e
Port: <none>
Host Port: <none>
Args:
--volume-dir=/etc/config
--webhook-url=http://127.0.0.1:9090/-/reload
State: Running
Started: Sun, 29 Dec 2019 21:36:04 -0400
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/etc/config from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-v4jg2 (ro)
prometheus-server:
Container ID: docker://87cec70c5bc65ad2db5d2e0b69b90c256a8a3c7cd56383bb08e15c486f91ffeb
Image: prom/prometheus:v2.13.1
Image ID: docker-pullable://prom/prometheus@sha256:0a8caa2e9f19907608915db6e62a67383fe44b9876a467b297ee6f64e51dd58a
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.retention.time=15d
--config.file=/etc/config/prometheus.yml
--storage.tsdb.path=/data
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles
--web.enable-lifecycle
State: Running
Started: Sun, 29 Dec 2019 21:36:55 -0400
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 29 Dec 2019 21:36:24 -0400
Finished: Sun, 29 Dec 2019 21:36:25 -0400
Ready: True
Restart Count: 3
Liveness: http-get http://:9090/-/healthy delay=600s timeout=1s period=30s #success=1 #failure=3
Readiness: http-get http://:9090/-/ready delay=600s timeout=1s period=30s #success=1 #failure=20
Environment: <none>
Mounts:
/data from storage-volume (rw)
/etc/config from config-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from prometheus-server-token-v4jg2 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: prometheus-server
Optional: false
storage-volume:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: prometheus-server
ReadOnly: false
prometheus-server-token-v4jg2:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-server-token-v4jg2
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
```
To check the logs of the prometheus server container:
```
$ kubectl -n monitoring logs -f prometheus-server-745f77d49b-v77ll -c prometheus-server
level=info ts=2019-12-30T01:53:14.999Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577037600000 maxt=1577059200000 ulid=01DXA83MACD5PP4XHX7TC2K0BS sources="[01DXA7GB0XH5E69931MBAXWCPH 01DXA7GZP72WW93TYKKZDX35ZD 01DXA7HKN5BZF5AK1GTVZWK5M9]" duration=3.499215571s
level=info ts=2019-12-30T01:53:20.402Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577059200000 maxt=1577080800000 ulid=01DXA83SCFC0HQDWSNYXK4TXVV sources="[01DXA7J7H1ENJS14EFGBNYE8P2 01DXA7JT9J72SX1PCVWD23HQFT 01DXA7KDEPGQ6MRT8AXQSVYSJ0]" duration=3.71530483s
level=info ts=2019-12-30T01:53:25.600Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577080800000 maxt=1577102400000 ulid=01DXA83YJMBA7759C5QD1CEDWJ sources="[01DXA7M0EKGW2V92QZQSFFCAYJ 01DXA7MK45405TASTFHVJT02Q6 01DXA7MZR8EAB12T2C87EXWK7B]" duration=3.595686298s
level=info ts=2019-12-30T01:53:30.706Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577102400000 maxt=1577124000000 ulid=01DXA843MY2J47EZ8WMADK7KWT sources="[01DXA7NA5P8G0ZKPYWKBY5MJ8G 01DXA7NGB1GNMA89QYMESVSGX3 01DXA7NTDZ65T5F794175R38M5]" duration=3.507732636s
level=info ts=2019-12-30T01:53:34.014Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577124000000 maxt=1577145600000 ulid=01DXA848KKMPQQR3KR71AH5XDS sources="[01DXA7P04K1JXAQTNKKV4NY0D1 01DXA7P8N72MA1E70W3AREZP4N 01DXA7PCD44Q2FRC57MM438AF9]" duration=1.738807856s
level=info ts=2019-12-30T01:55:57.170Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1576972800000 maxt=1577037600000 ulid=01DXA84CAGGD83G33ZHWVXE09P sources="[01DWPDQBAK8Y3VNP2V4G2K4A8W 01DXA80KFNEJM5VRX6ECXZAZ39 01DXA8333RXBRJY4NBJRADJC99]" duration=2m21.08970652s
level=info ts=2019-12-30T01:56:04.508Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577037600000 maxt=1577102400000 ulid=01DXA88R0Q37C0643WKG7VM38Z sources="[01DXA83MACD5PP4XHX7TC2K0BS 01DXA83SCFC0HQDWSNYXK4TXVV 01DXA83YJMBA7759C5QD1CEDWJ]" duration=5.38087382s
level=info ts=2019-12-30T03:00:03.264Z caller=compact.go:496 component=tsdb msg="write block" mint=1577664000000 maxt=1577671200000 ulid=01DXABXYZCSJF71KZ4DWHP2SR5 duration=3.155977955s
level=info ts=2019-12-30T03:00:04.360Z caller=head.go:598 component=tsdb msg="head GC completed" duration=209.807162ms
level=info ts=2019-12-30T03:00:08.474Z caller=head.go:668 component=tsdb msg="WAL checkpoint complete" first=2703 last=2704 duration=4.113514824s
level=info ts=2019-12-30T03:00:12.754Z caller=compact.go:441 component=tsdb msg="compact blocks" count=3 mint=1577102400000 maxt=1577152800000 ulid=01DXABY7XS038KHWDTZ83DPPZ9 sources="[01DXA843MY2J47EZ8WMADK7KWT 01DXA848KKMPQQR3KR71AH5XDS 01DXA7PK4TV7BBRA81W59QH8WS]" duration=3.481380438s
.....
```
To kill unresponsive pod:
```
$ kubectl -n monitoring delete pod prometheus-server-745f77d49b-gcfzt --force --grace-period 0
```
The solution found with trial and errors
Edit the prometheuse-server deployment and adjust the probes
```
$ kubectl -n monitoring edit deployment prometheus-server
```
....
livenessProbe:
failureThreshold: 3
httpGet:
path: /-/healthy
port: 9090
scheme: HTTP
initialDelaySeconds: 600
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
name: prometheus-server
ports:
- containerPort: 9090
protocol: TCP
readinessProbe:
failureThreshold: 20
httpGet:
path: /-/ready
port: 9090
scheme: HTTP
initialDelaySeconds: 600
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 30
....
```
After this modifications, the prometheus-server pod is working fine and no more Evicted pods here.
Note:
If you need delete Evicted pods in your current k8s:
```
$ cat delete-evicted-pods-all-namespaces.sh
#!/bin/sh
# based on https://gist.github.com/ipedrazas/9c622404fb41f2343a0db85b3821275d
# delete all evicted pods from all namespaces
kubectl get pods --all-namespaces | grep Evicted | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
# delete all containers in ImagePullBackOff state from all namespaces
kubectl get pods --all-namespaces | grep 'ImagePullBackOff' | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
# delete all containers in ImagePullBackOff or ErrImagePull or Evicted state from all namespaces
kubectl get pods --all-namespaces | grep -E 'ImagePullBackOff|ErrImagePull|Evicted' | awk '{print $2 " --namespace=" $1}' | xargs kubectl delete pod
```
Thanks for reading :)
Comments
Post a Comment