0

I have deployed a simple dummy application that performs a numer of "calculations" (sleeps) in parallel. I have deployed it using Terraform / Terrgrunt here https://github.com/teticio/latency (see the 10-eks-rabbitmq section). It uses RabbitMQ (Bitnami Helm chart), Prometheus, Prometheus Operator and Prometheus Adapter (all Prometheus Community Helm charts) to provide a queue with metrics that can be used to trigger the scaling of the HPA.

The HPA is defined as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: calc-hpa
  namespace: default
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
    scaleUp:
      policies:
      - periodSeconds: 15
        type: Pods
        value: 4
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 0
  maxReplicas: 10
  metrics:
  - pods:
      metric:
        name: rabbitmq_queue_messages_ready
      target:
        averageValue: "10"
        type: AverageValue
    type: Pods
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: calc

When I load the queue with many requests, it correctly scales up. Initially, it can only start 6 of the 10 required pods, but Karpenter correctly scales up the number of nodes from 1 to 2, and eventually the 10 pods are running.

Reference:                                  Deployment/calc
Metrics:                                    ( current / target )
  "rabbitmq_queue_messages_ready" on pods:  266 / 10
Min replicas:                               1
Max replicas:                               10
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       10 current / 10 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric rabbitmq_queue_messages_ready
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count

However, once the queue has gone down to 0, the current metric value goes to 0, but the HPA does not decide to scale down for some reason, even hours later.

Reference:                                  Deployment/calc
Metrics:                                    ( current / target )
  "rabbitmq_queue_messages_ready" on pods:  0 / 10
Min replicas:                               1
Max replicas:                               10
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       10 current / 10 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric rabbitmq_queue_messages_ready
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

What I don't understand about the above is, if current / target is 0 / 10, why is it saying 10 desired? If I do

kubectl scale --replicas 1 deployment/calc

it works perfectly, deleting all but one of the pods and terminating the second node instance.

New contributor
Rob Smith is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.