Scaled Jobs

This guide covers configuring the pgt-scaledjob Helm chart for deploying event-driven jobs to Kubernetes. The chart creates KEDA ScaledJob resources that automatically scale job executions based on external event sources such as message queues, databases, or custom metrics.

When to Use ScaledJobs

ScaledJobs are ideal for:

Queue processing - Process messages from SQS, RabbitMQ, Kafka, Azure Service Bus, etc.
Event-driven workloads - Scale based on events from external systems
Batch processing - Process items in batches with automatic scaling
Background tasks - Execute tasks triggered by external events

For time-based scheduling, use CronJobs instead.

Prerequisites

Add the chart as a dependency in your Chart.yaml:

apiVersion: v2
name: my-scaledjob
version: 0.0.1
dependencies:
  - name: pgt-scaledjob
    version: 0.0.3
    repository: oci://public.ecr.aws/w9m9e0e9/pgt-helm-charts

After adding the dependency, run:

helm dependency update

ℹ️ KEDA Required
ScaledJobs require KEDA to be installed in your cluster. KEDA monitors your event sources and automatically creates Jobs when events are detected.

Basic Configuration

Required Fields

pgt-scaledjob:
  # [Required] ScaledJob name - used as base name for all K8s resources
  name: my-scaledjob

  # [Required] Organisation name for labeling
  organisationName: my-org

  # [Required] KEDA triggers - what events trigger job creation
  scaledjob:
    triggers:
      - type: aws-sqs-queue
        metadata:
          queueURL: "https://sqs.eu-west-1.amazonaws.com/123456789012/my-queue"
          queueLength: "5"
          awsRegion: "eu-west-1"
          identityOwner: operator

  # [Required] Container image configuration
  container:
    image:
      registry: public.ecr.aws
      repository: my-org/my-worker
      tag: "1.0.0"
    resources:
      limits:
        memory: 512Mi
      requests:
        memory: 512Mi  # Should equal limits for Guaranteed QoS
        cpu: 100m

  # [Required] ServiceAccount name
  serviceAccount:
    name: my-scaledjob-sa

⚠️ Memory Requests and Limits
Always set memory requests equal to memory limits. This ensures your pod receives a Guaranteed Quality of Service (QoS) class, which provides predictable scheduling and OOM kill priority.

Development Environment & Cost Optimization

When deploying ScaledJobs to non-production environments, consider these settings to reduce costs:

pgt-scaledjob:
  scaledjob:
    minReplicaCount: 0      # Scale to zero when no events
    maxReplicaCount: 5      # Limit concurrent jobs in dev
    pollingInterval: 60     # Check less frequently in dev

  affinity:
    nodeAffinity:
      preferSpotInstances: true  # Prefer running on spot instances

  container:
    resources:
      limits:
        memory: 256Mi  # Right-size for dev workloads
      requests:
        memory: 256Mi
        cpu: 50m

💡 Spot Instances
The preferSpotInstances: true setting prefers scheduling your workloads on spot instances when available, which can significantly reduce compute costs. If you're interested in enabling spot instances for your environment, please reach out to the platform team to discuss your requirements and ensure your applications are suitable for spot instance usage.

💰 Cost-Saving Tips for ScaledJobs
Scale to zero: Set minReplicaCount: 0 to ensure no jobs run when there are no events to process
Limit max replicas: Use lower maxReplicaCount in non-production to prevent runaway scaling
Increase polling interval: A longer pollingInterval reduces API calls to your event source
Right-size resources: Development jobs often need fewer resources than production

KEDA Triggers

Triggers define what events cause KEDA to create new Jobs. KEDA supports many trigger types including message queues, databases, HTTP endpoints, and custom metrics.

AWS SQS Queue

Scale based on messages in an AWS SQS queue:

pgt-scaledjob:
  scaledjob:
    triggers:
      - type: aws-sqs-queue
        metadata:
          queueURL: "https://sqs.eu-west-1.amazonaws.com/123456789012/my-queue"
          queueLength: "5"       # Create a job for every 5 messages
          awsRegion: "eu-west-1"
          identityOwner: operator  # Use KEDA operator's identity
        authenticationRef:
          name: my-scaledjob-trigger-auth  # Reference to TriggerAuthentication

Azure Service Bus Queue

Scale based on messages in an Azure Service Bus queue:

pgt-scaledjob:
  scaledjob:
    triggers:
      - type: azure-servicebus
        metadata:
          queueName: my-queue
          namespace: my-servicebus-namespace
          messageCount: "5"
        authenticationRef:
          name: my-scaledjob-trigger-auth

RabbitMQ Queue

Scale based on messages in a RabbitMQ queue:

pgt-scaledjob:
  scaledjob:
    triggers:
      - type: rabbitmq
        metadata:
          queueName: my-queue
          mode: QueueLength
          value: "5"
        authenticationRef:
          name: my-scaledjob-trigger-auth

Kafka Topic

Scale based on consumer lag in a Kafka topic:

pgt-scaledjob:
  scaledjob:
    triggers:
      - type: kafka
        metadata:
          bootstrapServers: kafka.example.com:9092
          consumerGroup: my-consumer-group
          topic: my-topic
          lagThreshold: "10"

PostgreSQL Query

Scale based on a PostgreSQL query result:

pgt-scaledjob:
  scaledjob:
    triggers:
      - type: postgresql
        metadata:
          query: "SELECT COUNT(*) FROM tasks WHERE status = 'pending'"
          targetQueryValue: "5"
          connectionFromEnv: DATABASE_URL

💡 More Triggers
KEDA supports 50+ trigger types. See the KEDA Scalers documentation for the full list and configuration options.

Trigger Authentication

When triggers need to authenticate with external services, use TriggerAuthentication to provide credentials securely.

💡 Creating Cloud Identities
Before configuring trigger authentication, you need to create IAM roles or managed identities. See:
AWS IAM Roles (IRSA) - Configure IAM roles for EKS workloads
Azure Workload Identity - Configure managed identities for AKS workloads

AWS with Pod Identity (IRSA)

For AWS services using IAM Roles for Service Accounts:

pgt-scaledjob:
  serviceAccount:
    name: my-scaledjob-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-scaledjob-role

  triggerAuthentication:
    enabled: true
    podIdentity:
      provider: aws-eks

  scaledjob:
    triggers:
      - type: aws-sqs-queue
        metadata:
          queueURL: "https://sqs.eu-west-1.amazonaws.com/123456789012/my-queue"
          queueLength: "5"
          awsRegion: "eu-west-1"
          identityOwner: pod  # Use pod's identity (IRSA)
        authenticationRef:
          name: my-scaledjob-trigger-auth

Azure with Workload Identity

For Azure services using Workload Identity:

pgt-scaledjob:
  serviceAccount:
    name: my-scaledjob-sa
    annotations:
      azure.workload.identity/client-id: "00000000-0000-0000-0000-000000000000"
      azure.workload.identity/tenant-id: "00000000-0000-0000-0000-000000000000"

  pod:
    metadata:
      labels:
        azure.workload.identity/use: "true"

  triggerAuthentication:
    enabled: true
    podIdentity:
      provider: azure-workload
      identityId: "00000000-0000-0000-0000-000000000000"

  scaledjob:
    triggers:
      - type: azure-servicebus
        metadata:
          queueName: my-queue
          namespace: my-servicebus-namespace
          messageCount: "5"
        authenticationRef:
          name: my-scaledjob-trigger-auth

Credentials from Secrets

For services requiring username/password or connection strings:

pgt-scaledjob:
  triggerAuthentication:
    enabled: true
    secretTargetRef:
      - parameter: host
        name: rabbitmq-credentials
        key: connection-string

  scaledjob:
    triggers:
      - type: rabbitmq
        metadata:
          queueName: my-queue
          mode: QueueLength
          value: "5"
        authenticationRef:
          name: my-scaledjob-trigger-auth

Scaling Configuration

Polling Interval

How often KEDA checks the trigger source for events:

pgt-scaledjob:
  scaledjob:
    pollingInterval: 30  # Check every 30 seconds (default)

Replica Limits

Control the minimum and maximum number of concurrent jobs:

pgt-scaledjob:
  scaledjob:
    minReplicaCount: 0    # Scale to zero when no events (default)
    maxReplicaCount: 100  # Maximum concurrent jobs (default)

Scaling Strategy

Control how KEDA creates jobs based on events:

pgt-scaledjob:
  scaledjob:
    # Options: default, accurate, eager
    scalingStrategy: default

Strategy

Behaviour

default

KEDA manages scaling as events are detected

accurate

Creates exactly the number of jobs matching events in queue

eager

Creates maxReplicaCount jobs immediately when events are detected

Job Configuration

Job Execution Settings

pgt-scaledjob:
  scaledjob:
    parallelism: 1          # Pods running in parallel per job
    completions: 1          # Successful completions required
    backoffLimit: 3         # Retries before marking job as failed
    activeDeadlineSeconds: 600  # Maximum job duration (optional)
    restartPolicy: OnFailure    # OnFailure or Never

Job History

Configure how many completed jobs to retain:

pgt-scaledjob:
  scaledjob:
    successfulJobsHistoryLimit: 3  # Keep last 3 successful jobs
    failedJobsHistoryLimit: 1      # Keep last 1 failed job

Container Configuration

Command and Arguments

Override the container's default entrypoint and arguments:

pgt-scaledjob:
  container:
    image:
      registry: public.ecr.aws
      repository: my-org/my-worker
      tag: "1.0.0"
    command: ["python", "process_queue.py"]
    args: ["--queue", "my-queue", "--verbose"]

Environment Variables

Direct Environment Variables

pgt-scaledjob:
  environmentVariables:
    - name: QUEUE_URL
      value: "https://sqs.eu-west-1.amazonaws.com/123456789012/my-queue"
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: db-secret
          key: url

Load from ConfigMap or Secret

pgt-scaledjob:
  environmentVariablesFrom:
    - configMapRef:
        name: worker-config
    - secretRef:
        name: worker-secrets

External Secrets

The pgt-scaledjob chart includes pgt-secrets as a subchart for fetching secrets from AWS Secrets Manager or Azure Key Vault. For full configuration options, see the PGT Secrets documentation.

pgt-scaledjob:
  serviceAccount:
    name: my-scaledjob-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-scaledjob-role

  pgt-secrets:
    enabled: true
    organisationName: my-org
    serviceAccount:
      create: false
      name: my-scaledjob-sa
    aws:
      enabled: true
      secretRegion: eu-west-1
    items:
      - secretStoreName: scaledjob-store
        kubernetesSecretName: scaledjob-secrets
        data:
          - secretKey: database-url
            remoteRef:
              key: prod/scaledjob/database
              property: url

Volume Mounts

Mount ConfigMaps or Secrets as files:

pgt-scaledjob:
  volumes:
    - kubernetesSecretName: tls-certs
      mountPath: /etc/ssl/certs
      readOnly: true
    - kubernetesConfigMapName: worker-config
      mountPath: /etc/config
      readOnly: true

Pod Configuration

Labels and Annotations

pgt-scaledjob:
  pod:
    metadata:
      labels:
        app: my-scaledjob
        environment: production
      annotations:
        logs.example.io/enabled: "true"

Tolerations

For scheduling on specific nodes:

pgt-scaledjob:
  pod:
    spec:
      tolerations:
        - key: node.playgroundtech.io/os-windows
          operator: Exists
          effect: NoExecute

Prometheus PodMonitor

Enable metrics scraping for ScaledJobs that expose metrics during execution:

pgt-scaledjob:
  podMonitor:
    enabled: true
    path: /metrics
    port: "9090"
    interval: 30s

Complete Examples

AWS SQS Queue Processor

Process messages from an SQS queue with IRSA authentication:

pgt-scaledjob:
  name: sqs-processor
  organisationName: acme-corp

  scaledjob:
    pollingInterval: 15
    maxReplicaCount: 50
    minReplicaCount: 0
    scalingStrategy: accurate
    successfulJobsHistoryLimit: 10
    failedJobsHistoryLimit: 5
    parallelism: 1
    completions: 1
    backoffLimit: 3
    activeDeadlineSeconds: 300
    restartPolicy: OnFailure
    triggers:
      - type: aws-sqs-queue
        metadata:
          queueURL: "https://sqs.eu-west-1.amazonaws.com/123456789012/orders-queue"
          queueLength: "1"
          awsRegion: "eu-west-1"
          identityOwner: pod
        authenticationRef:
          name: sqs-processor-trigger-auth

  triggerAuthentication:
    enabled: true
    podIdentity:
      provider: aws-eks

  container:
    image:
      registry: public.ecr.aws
      repository: acme-corp/order-processor
      tag: "2.1.0"
    command: ["python", "process_order.py"]
    resources:
      limits:
        memory: 1Gi
      requests:
        memory: 1Gi
        cpu: 250m

  serviceAccount:
    name: sqs-processor-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/sqs-processor-role

  environmentVariables:
    - name: QUEUE_URL
      value: "https://sqs.eu-west-1.amazonaws.com/123456789012/orders-queue"
    - name: LOG_LEVEL
      value: info
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: sqs-processor-secrets
          key: database-url

  pgt-secrets:
    enabled: true
    organisationName: acme-corp
    serviceAccount:
      create: false
      name: sqs-processor-sa
    aws:
      enabled: true
      secretRegion: eu-west-1
    items:
      - secretStoreName: sqs-processor-store
        kubernetesSecretName: sqs-processor-secrets
        data:
          - secretKey: database-url
            remoteRef:
              key: prod/sqs-processor/database
              property: connection_string

Azure Service Bus Queue Processor

Process messages from an Azure Service Bus queue with Workload Identity:

pgt-scaledjob:
  name: servicebus-processor
  organisationName: acme-corp

  scaledjob:
    pollingInterval: 30
    maxReplicaCount: 20
    minReplicaCount: 0
    successfulJobsHistoryLimit: 5
    failedJobsHistoryLimit: 3
    backoffLimit: 3
    restartPolicy: OnFailure
    triggers:
      - type: azure-servicebus
        metadata:
          queueName: notifications-queue
          namespace: acme-servicebus
          messageCount: "5"
        authenticationRef:
          name: servicebus-processor-trigger-auth

  triggerAuthentication:
    enabled: true
    podIdentity:
      provider: azure-workload
      identityId: "00000000-0000-0000-0000-000000000000"

  container:
    image:
      registry: acmeacr.azurecr.io
      repository: notification-processor
      tag: "1.5.0"
    resources:
      limits:
        memory: 512Mi
      requests:
        memory: 512Mi
        cpu: 100m

  serviceAccount:
    name: servicebus-processor-sa
    annotations:
      azure.workload.identity/client-id: "00000000-0000-0000-0000-000000000000"
      azure.workload.identity/tenant-id: "00000000-0000-0000-0000-000000000000"

  environmentVariables:
    - name: SERVICEBUS_NAMESPACE
      value: "acme-servicebus"
    - name: QUEUE_NAME
      value: "notifications-queue"

  pgt-secrets:
    enabled: true
    organisationName: acme-corp
    serviceAccount:
      create: false
      name: servicebus-processor-sa
    azure:
      enabled: true
      managedIdentity:
        useWorkloadIdentity: true
    items:
      - secretStoreName: servicebus-store
        kubernetesSecretName: servicebus-secrets
        azure:
          vaultUrl: https://acme-keyvault.vault.azure.net/
        data:
          - secretKey: smtp-password
            remoteRef:
              key: smtp-password

RabbitMQ Queue Processor

Process messages from a RabbitMQ queue:

pgt-scaledjob:
  name: rabbitmq-processor
  organisationName: acme-corp

  scaledjob:
    pollingInterval: 10
    maxReplicaCount: 30
    scalingStrategy: accurate
    backoffLimit: 5
    restartPolicy: OnFailure
    triggers:
      - type: rabbitmq
        metadata:
          queueName: tasks-queue
          mode: QueueLength
          value: "1"
        authenticationRef:
          name: rabbitmq-processor-trigger-auth

  triggerAuthentication:
    enabled: true
    secretTargetRef:
      - parameter: host
        name: rabbitmq-credentials
        key: connection-string

  container:
    image:
      registry: public.ecr.aws
      repository: acme-corp/task-processor
      tag: "3.0.0"
    resources:
      limits:
        memory: 256Mi
      requests:
        memory: 256Mi
        cpu: 100m

  serviceAccount:
    name: rabbitmq-processor-sa
    annotations:
      eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/rabbitmq-processor-role

  environmentVariables:
    - name: RABBITMQ_URL
      valueFrom:
        secretKeyRef:
          name: rabbitmq-credentials
          key: connection-string

  pgt-secrets:
    enabled: true
    organisationName: acme-corp
    serviceAccount:
      create: false
      name: rabbitmq-processor-sa
    aws:
      enabled: true
      secretRegion: eu-west-1
    items:
      - secretStoreName: rabbitmq-store
        kubernetesSecretName: rabbitmq-credentials
        data:
          - secretKey: connection-string
            remoteRef:
              key: prod/rabbitmq/credentials
              property: connection_string

Troubleshooting

Use Argo CD to investigate issues with ScaledJobs.

Viewing ScaledJob Status

Navigate to your application in the Argo CD UI
Locate the ScaledJob resource in the application tree
Click on the ScaledJob to view its details including:
- Current replica count
- Trigger status
- Last scale time

Viewing Job Executions

In the Argo CD application tree, look for Job resources created by the ScaledJob
Click on a Job to see its status and completion time
Expand the Job to see its Pods

Checking Pod Logs

In the application tree, find the Pod created by a Job
Click on the Pod resource
Select the Logs tab to view container output

Checking TriggerAuthentication

Locate the TriggerAuthentication resource in the application tree
Verify the authentication configuration matches your trigger requirements

Common Issues

Jobs not being created:

Verify KEDA is installed and running in the cluster
Check if the trigger source has events/messages
Verify TriggerAuthentication credentials are correct
Check KEDA operator logs for trigger errors

Jobs failing repeatedly:

Check Pod logs for error messages
Verify secrets and ConfigMaps are correctly configured
Ensure the ServiceAccount has required permissions
Check if backoffLimit is too low

Authentication errors:

Verify ServiceAccount annotations for IRSA/Workload Identity
Check TriggerAuthentication references match the trigger configuration
Ensure IAM roles/managed identities have correct permissions

Scaling issues:

Check maxReplicaCount isn't limiting scaling
Verify pollingInterval is appropriate for your use case
Review scalingStrategy for your workload pattern

Values Reference

Value

Type

Default

Description

name

string

nil

Required. ScaledJob name

organisationName

string

nil

Required. Organisation name

scaledjob.pollingInterval

int

30

How often KEDA checks triggers (seconds)

scaledjob.successfulJobsHistoryLimit

int

3

Successful jobs to retain

scaledjob.failedJobsHistoryLimit

int

1

Failed jobs to retain

scaledjob.maxReplicaCount

int

100

Maximum concurrent jobs

scaledjob.minReplicaCount

int

0

Minimum jobs to maintain

scaledjob.scalingStrategy

string

default

default, accurate, or eager

scaledjob.parallelism

int

1

Parallel pods per job

scaledjob.completions

int

1

Required successful completions

scaledjob.backoffLimit

int

3

Retries before job failure

scaledjob.activeDeadlineSeconds

int

nil

Maximum job duration

scaledjob.restartPolicy

string

OnFailure

OnFailure or Never

scaledjob.triggers

list

[]

Required. KEDA trigger configurations

affinity.nodeAffinity.preferSpotInstances

bool

false

Prefer scheduling on spot instances

container.image.registry

string

nil

Required. Container registry

container.image.repository

string

nil

Required. Image repository

container.image.tag

string

nil

Required. Image tag

container.command

list

[]

Container entrypoint override

container.args

list

[]

Container arguments

serviceAccount.name

string

nil

Required. ServiceAccount name

serviceAccount.annotations

object

{}

ServiceAccount annotations

triggerAuthentication.enabled

bool

false

Enable TriggerAuthentication

triggerAuthentication.secretTargetRef

list

[]

Secrets for authentication

triggerAuthentication.podIdentity

object

{}

Pod identity configuration

environmentVariables

list

[]

Environment variables

environmentVariablesFrom

list

[]

Load env vars from ConfigMap/Secret

volumes

list

[]

Volume mounts

podMonitor.enabled

bool

false

Enable PodMonitor

pgt-secrets.enabled

bool

false

Enable external secrets

PreviousCron Jobs NextSecrets

Last updated 2 days ago

Was this helpful?

hashtagWhen to Use ScaledJobs

hashtagPrerequisites

hashtagBasic Configuration

hashtagRequired Fields

hashtagDevelopment Environment & Cost Optimization

hashtagKEDA Triggers

hashtagAWS SQS Queue

hashtagAzure Service Bus Queue

hashtagRabbitMQ Queue

hashtagKafka Topic

hashtagPostgreSQL Query

hashtagTrigger Authentication

hashtagAWS with Pod Identity (IRSA)

hashtagAzure with Workload Identity

hashtagCredentials from Secrets

hashtagScaling Configuration

hashtagPolling Interval

hashtagReplica Limits

hashtagScaling Strategy

hashtagJob Configuration

hashtagJob Execution Settings

hashtagJob History

hashtagContainer Configuration

hashtagCommand and Arguments

hashtagEnvironment Variables

hashtagDirect Environment Variables

hashtagLoad from ConfigMap or Secret

hashtagExternal Secrets

hashtagVolume Mounts

hashtagPod Configuration

hashtagLabels and Annotations

hashtagTolerations

hashtagPrometheus PodMonitor

hashtagComplete Examples

hashtagAWS SQS Queue Processor

hashtagAzure Service Bus Queue Processor

hashtagRabbitMQ Queue Processor

hashtagTroubleshooting

hashtagViewing ScaledJob Status

hashtagViewing Job Executions

hashtagChecking Pod Logs

hashtagChecking TriggerAuthentication

hashtagCommon Issues

hashtagValues Reference

When to Use ScaledJobs

Prerequisites

Basic Configuration

Required Fields

Development Environment & Cost Optimization

KEDA Triggers

AWS SQS Queue

Azure Service Bus Queue

RabbitMQ Queue

Kafka Topic

PostgreSQL Query

Trigger Authentication

AWS with Pod Identity (IRSA)

Azure with Workload Identity

Credentials from Secrets

Scaling Configuration

Polling Interval

Replica Limits

Scaling Strategy

Job Configuration

Job Execution Settings

Job History

Container Configuration

Command and Arguments

Environment Variables

Direct Environment Variables

Load from ConfigMap or Secret

External Secrets

Volume Mounts

Pod Configuration

Labels and Annotations

Tolerations

Prometheus PodMonitor

Complete Examples

AWS SQS Queue Processor

Azure Service Bus Queue Processor

RabbitMQ Queue Processor

Troubleshooting

Viewing ScaledJob Status

Viewing Job Executions

Checking Pod Logs

Checking TriggerAuthentication

Common Issues

Values Reference