As the volume of data grows, applications need to be equipped with scalable data processing capabilities. While Spring Boot excels at building RESTful services and microservices, Apache Spark is a powerful engine for distributed computation across big data sets.

Combining the two allows you to build maintainable, cloud-ready data pipelines that are easy to monitor, secure, and deploy. In this post, we explore how to integrate Spring Boot with Apache Spark to build batch and streaming data processors in Java.


Why Integrate Spring Boot with Spark?

Apache Spark provides:

  • Distributed processing for massive datasets
  • In-memory computation for high performance
  • APIs for batch, streaming, and machine learning
  • Support for multiple data sources (CSV, Parquet, Kafka, JDBC, etc.)

Spring Boot complements Spark by:

  • Managing configuration and dependency injection
  • Exposing job controls via REST APIs
  • Facilitating security, monitoring, and deployment

Project Setup

Add the following dependencies to your pom.xml:

<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

Set Spark dependencies as provided if you’re deploying to a Spark cluster.


Initializing SparkSession with Spring Boot

Create a @Configuration class to expose a SparkSession bean:

@Configuration
public class SparkConfig {

    @Bean
    public SparkSession sparkSession() {
        return SparkSession.builder()
            .appName("SpringBootSparkApp")
            .master("local[*]") // or use yarn/mesos/k8s in production
            .getOrCreate();
    }
}

This ensures Spark is initialized once and can be injected anywhere.


Creating a Spark Job Service

Define a service class that uses Spark for data transformation:

@Service
public class SparkJobService {

    private final SparkSession sparkSession;

    public SparkJobService(SparkSession sparkSession) {
        this.sparkSession = sparkSession;
    }

    public Dataset<Row> processCsv(String filePath) {
        Dataset<Row> input = sparkSession.read()
            .option("header", true)
            .csv(filePath);

        return input.groupBy("category")
            .agg(functions.avg("price").alias("avg_price"));
    }
}

Exposing Jobs via REST API

Use Spring Boot’s @RestController to expose Spark jobs as HTTP endpoints:

@RestController
@RequestMapping("/api/jobs")
public class JobController {

    private final SparkJobService sparkJobService;

    public JobController(SparkJobService sparkJobService) {
        this.sparkJobService = sparkJobService;
    }

    @GetMapping("/summary")
    public ResponseEntity<String> runJob(@RequestParam String file) {
        Dataset<Row> result = sparkJobService.processCsv(file);
        result.show(); // for testing
        return ResponseEntity.ok("Job completed");
    }
}

This allows remote triggering of Spark jobs from web clients or schedulers.


Streaming Data with Spark Structured Streaming

You can also process streaming data with Spark:

public void processStream(String kafkaServer) {
Dataset<Row> stream = sparkSession.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("subscribe", "events")
.load();

    Dataset<Row> messages = stream.selectExpr("CAST(value AS STRING)");

    messages.writeStream()
        .format("console")
        .start()
        .awaitTermination();
}

This enables ingestion from Kafka, with transformation and output to console, storage, or databases.


Packaging and Running Your Application

To build a runnable JAR:

mvn clean package

Submit your Spring Boot JAR to Spark:

spark-submit \
--class org.example.YourApp \
--master local[*] \
target/spring-boot-spark-app.jar

Or deploy your app as a microservice and use it to trigger distributed Spark jobs running on a cluster.


Best Practices

  • Use SparkSession as a singleton to avoid reinitialization
  • Tune Spark configurations via application properties or SparkConf
  • Validate and sanitize all input file paths to prevent injection
  • Log and monitor Spark job metrics using Prometheus or Spark UI
  • Separate job orchestration from core logic for modularity

Conclusion

By combining Spring Boot’s flexibility with Apache Spark’s scalability, you can build powerful data processing pipelines that are robust, testable, and production-ready. Whether you’re building an ETL pipeline, a batch job runner, or a streaming analytics engine, this integration pattern brings the best of both ecosystems together.

Start small with local development and scale up to cluster-based processing as your workload grows.