HPA not scaling down on EKS even though current metric 0

Ask Question

Asked today

Modified today

Viewed 3 times

Part of AWS Collective

I have deployed a simple dummy application that performs a numer of "calculations" (sleeps) in parallel. I have deployed it using Terraform / Terrgrunt here https://github.com/teticio/latency (see the 10-eks-rabbitmq section). It uses RabbitMQ (Bitnami Helm chart), Prometheus, Prometheus Operator and Prometheus Adapter (all Prometheus Community Helm charts) to provide a queue with metrics that can be used to trigger the scaling of the HPA.

The HPA is defined as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: calc-hpa
  namespace: default
spec:
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
    scaleUp:
      policies:
      - periodSeconds: 15
        type: Pods
        value: 4
      - periodSeconds: 15
        type: Percent
        value: 100
      selectPolicy: Max
      stabilizationWindowSeconds: 0
  maxReplicas: 10
  metrics:
  - pods:
      metric:
        name: rabbitmq_queue_messages_ready
      target:
        averageValue: "10"
        type: AverageValue
    type: Pods
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: calc

When I load the queue with many requests, it correctly scales up. Initially, it can only start 6 of the 10 required pods, but Karpenter correctly scales up the number of nodes from 1 to 2, and eventually the 10 pods are running.

Reference:                                  Deployment/calc
Metrics:                                    ( current / target )
  "rabbitmq_queue_messages_ready" on pods:  266 / 10
Min replicas:                               1
Max replicas:                               10
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       10 current / 10 desired
Conditions:
  Type            Status  Reason            Message
  ----            ------  ------            -------
  AbleToScale     True    ReadyForNewScale  recommended size matches current size
  ScalingActive   True    ValidMetricFound  the HPA was able to successfully calculate a replica count from pods metric rabbitmq_queue_messages_ready
  ScalingLimited  True    TooManyReplicas   the desired replica count is more than the maximum replica count

However, once the queue has gone down to 0, the current metric value goes to 0, but the HPA does not decide to scale down for some reason, even hours later.

Reference:                                  Deployment/calc
Metrics:                                    ( current / target )
  "rabbitmq_queue_messages_ready" on pods:  0 / 10
Min replicas:                               1
Max replicas:                               10
Behavior:
  Scale Up:
    Stabilization Window: 0 seconds
    Select Policy: Max
    Policies:
      - Type: Pods     Value: 4    Period: 15 seconds
      - Type: Percent  Value: 100  Period: 15 seconds
  Scale Down:
    Select Policy: Max
    Policies:
      - Type: Percent  Value: 100  Period: 15 seconds
Deployment pods:       10 current / 10 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric rabbitmq_queue_messages_ready
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range

What I don't understand about the above is, if current / target is 0 / 10, why is it saying 10 desired? If I do

kubectl scale --replicas 1 deployment/calc

it works perfectly, deleting all but one of the pods and terminating the second node instance.

asked 2 mins ago

Rob Smith

11 bronze badge

New contributor

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Collectives™ on Stack Overflow

HPA not scaling down on EKS even though current metric 0

0

Your Answer

Browse other questions tagged
kubernetes
rabbitmq
prometheus
amazon-eks
hpa
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Browse other questions tagged kubernetesrabbitmqprometheusamazon-ekshpa or ask your own question.

Browse other questions tagged
kubernetes
rabbitmq
prometheus
amazon-eks
hpa
or ask your own question.