The current architecture uses a fleet of small EC2 instances that poll SQS for URLs and write results to EFS. Because URLs arrive only occasionally and each crawl takes 10 seconds or less, EC2 instances are often idle waiting for messages. This leads to unnecessary compute and storage costs.
This workload is naturally event-driven: each URL in the SQS queue triggers a short-lived crawl operation. AWS Lambda is well suited for such event-driven, short-duration tasks. Lambda integrates natively with Amazon SQS as an event source, which allows Lambda to poll the SQS queue automatically and invoke functions when messages arrive, scaling the concurrency up and down as needed. When there are no messages, there are no Lambda invocations and no compute charges, eliminating the cost of idle EC2 instances.
For storage of crawl results, Amazon S3 is a highly durable, cost-effective object store. The current use of an EFS file system mounted on all instances is no longer necessary if each Lambda invocation can write the output directly to S3 as objects (for example, CSV files). S3 charges only for the storage used and requests, and does not require continuously running file system infrastructure.
Option B replaces the idle fleet of EC2 instances with a Lambda function that reads from SQS. This aligns compute usage with actual work and takes advantage of serverless scaling and pricing.
Option E modifies the web-crawling process to store results in Amazon S3 instead of EFS, removing the need to maintain an EFS file system and its associated costs.
Option A increases instance size to m5.8xlarge, which greatly increases capacity and cost. Reducing the number of instances by 50% does not offset the large increase in instance size; it does not address idle capacity and likely increases overall cost.
Option C uses Amazon Neptune, which is a managed graph database service. It is not needed for storing simple CSV output; it is more complex and more expensive than S3 for this use case.
Option D uses Aurora Serverless MySQL. While Aurora Serverless can automatically scale, using a relational database to store simple crawl outputs adds unnecessary cost and operational complexity compared to S3 objects.
Therefore, moving the crawling logic to Lambda triggered by SQS (option B) and writing results directly to Amazon S3 (option E) meets the requirements in the most cost-effective way.
[References:AWS documentation on using Amazon SQS as an event source for AWS Lambda for event-driven processing.AWS documentation on Amazon S3 for durable, low-cost object storage for analytical and application data.]