The workload consists of large batch-processing simulation jobs that read 15–20 GB per job from Amazon S3, write results back to S3, are not time sensitive, and can tolerate interruptions. These characteristics are ideal for using EC2 Spot Instances, which provide steep discounts compared to On-Demand pricing in exchange for potential interruptions.
AWS Batch is designed specifically for batch and simulation workloads. It can dynamically provision compute resources, manage job queues, retry failed jobs, and integrate tightly with EC2 Spot Instances. Using an AWS Batch compute environment that is configured to use EC2 Spot Instances with the SPOT_CAPACITY_OPTIMIZED allocation strategy allows AWS Batch to choose the lowest interruption-risk Spot capacity pools automatically while still giving the largest cost savings.
Option B leverages these capabilities. It creates an AWS Batch compute environment using only Spot Instances and uses the SPOT_CAPACITY_OPTIMIZED allocation strategy to select capacity across instance types with lower interruption rates. Since the simulations are not time sensitive and can withstand interruptions, the risk of Spot interruptions is acceptable, and AWS Batch will re-run interrupted jobs as needed. This combination provides the most cost-effective solution while minimizing operational overhead for job scheduling and capacity management.
Option A uses Lambda with Step Functions. Lambda has limits on memory, runtime, and payload size, making it a poor fit for processing 15–20 GB of data per job. In addition, the concurrency and invocation cost for large-scale, long-running simulation workloads would be higher and less cost-effective than EC2 Spot Instances.
Option C mixes On-Demand and Spot Instances. While this can provide reliability for more time-critical workloads, it is less cost-effective than a pure Spot configuration when the workload explicitly can tolerate interruptions and is not time sensitive.
Option D uses Amazon EKS with a mix of On-Demand and Spot Instances. While this can be made to work, it requires managing Kubernetes clusters, node groups, and job scheduling, which introduces more operational complexity than using AWS Batch, and the inclusion of On-Demand capacity reduces cost-effectiveness compared to pure Spot-based batch processing.
Therefore, using AWS Batch with EC2 Spot Instances and the SPOT_CAPACITY_OPTIMIZED allocation strategy (option B) is the most cost-effective solution for this tolerant, batch-processing workload.
[References:AWS documentation on AWS Batch for running batch and simulation jobs.AWS guidance on using EC2 Spot Instances and the SPOT_CAPACITY_OPTIMIZED allocation strategy for cost-effective, interruption-tolerant workloads.]