Spring Boot with Apache Spark for Scalable Data Processing
Combine Spring Boot and Apache Spark to build scalable and maintainable big data processing pipelines
As the volume of data grows, applications need to be equipped with scalable data processing capabilities. While Spring Boot excels at building RESTful services and microservices, Apache Spark is a powerful engine for distributed computation across big data sets.
Combining the two allows you to build maintainable, cloud-ready data pipelines that are easy to monitor, secure, and deploy. In this post, we explore how to integrate Spring Boot with Apache Spark to build batch and streaming data processors in Java.
Why Integrate Spring Boot with Spark?
Apache Spark provides:
- Distributed processing for massive datasets
- In-memory computation for high performance
- APIs for batch, streaming, and machine learning
- Support for multiple data sources (CSV, Parquet, Kafka, JDBC, etc.)
Spring Boot complements Spark by:
- Managing configuration and dependency injection
- Exposing job controls via REST APIs
- Facilitating security, monitoring, and deployment
Project Setup
Add the following dependencies to your pom.xml
:
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
Set Spark dependencies as provided
if you’re deploying to a Spark cluster.
Initializing SparkSession with Spring Boot
Create a @Configuration
class to expose a SparkSession
bean:
@Configuration
public class SparkConfig {
@Bean
public SparkSession sparkSession() {
return SparkSession.builder()
.appName("SpringBootSparkApp")
.master("local[*]") // or use yarn/mesos/k8s in production
.getOrCreate();
}
}
This ensures Spark is initialized once and can be injected anywhere.
Creating a Spark Job Service
Define a service class that uses Spark for data transformation:
@Service
public class SparkJobService {
private final SparkSession sparkSession;
public SparkJobService(SparkSession sparkSession) {
this.sparkSession = sparkSession;
}
public Dataset<Row> processCsv(String filePath) {
Dataset<Row> input = sparkSession.read()
.option("header", true)
.csv(filePath);
return input.groupBy("category")
.agg(functions.avg("price").alias("avg_price"));
}
}
Exposing Jobs via REST API
Use Spring Boot’s @RestController
to expose Spark jobs as HTTP endpoints:
@RestController
@RequestMapping("/api/jobs")
public class JobController {
private final SparkJobService sparkJobService;
public JobController(SparkJobService sparkJobService) {
this.sparkJobService = sparkJobService;
}
@GetMapping("/summary")
public ResponseEntity<String> runJob(@RequestParam String file) {
Dataset<Row> result = sparkJobService.processCsv(file);
result.show(); // for testing
return ResponseEntity.ok("Job completed");
}
}
This allows remote triggering of Spark jobs from web clients or schedulers.
Streaming Data with Spark Structured Streaming
You can also process streaming data with Spark:
public void processStream(String kafkaServer) {
Dataset<Row> stream = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("subscribe", "events")
.load();
Dataset<Row> messages = stream.selectExpr("CAST(value AS STRING)");
messages.writeStream()
.format("console")
.start()
.awaitTermination();
}
This enables ingestion from Kafka, with transformation and output to console, storage, or databases.
Packaging and Running Your Application
To build a runnable JAR:
mvn clean package
Submit your Spring Boot JAR to Spark:
spark-submit \
--class org.example.YourApp \
--master local[*] \
target/spring-boot-spark-app.jar
Or deploy your app as a microservice and use it to trigger distributed Spark jobs running on a cluster.
Best Practices
- Use
SparkSession
as a singleton to avoid reinitialization - Tune Spark configurations via application properties or SparkConf
- Validate and sanitize all input file paths to prevent injection
- Log and monitor Spark job metrics using Prometheus or Spark UI
- Separate job orchestration from core logic for modularity
Conclusion
By combining Spring Boot’s flexibility with Apache Spark’s scalability, you can build powerful data processing pipelines that are robust, testable, and production-ready. Whether you’re building an ETL pipeline, a batch job runner, or a streaming analytics engine, this integration pattern brings the best of both ecosystems together.
Start small with local development and scale up to cluster-based processing as your workload grows.