SpringBatch-ETL: High-Performance ETL Pipeline

Optimizing data flow between Java and the transformation engine

The Challenge

When migrating legacy e-commerce platforms to modern systems, data volume quickly becomes a bottleneck. SpringBatch-ETL had to migrate 73 million rows at ~1.8M rows/min while managing a specific technical constraint: PHP data deserialization.

This data, stored in the database in a PHP-specific serialized format, had to be converted into usable JSON structures by Java mid-migration, without collapsing the system’s overall performance.

Technical Choices

Why Java 21 & Spring Batch?

Java 21: Chosen for its modern asynchronous handling and robust typing, essential for complex transformations.
Spring Batch: Provides native “chunk-based” orchestration. It secures the data flow with robust transaction management, allowing the process to resume exactly where it left off in case of an interruption.

Moving from HTTP to TCP

At the project’s start, I considered creating a standard PHP API to delegate the deserialization. However, processing speed immediately became a major hurdle. The HTTP protocol, while standard, imposes an overhead (headers, complex handshakes) that is completely unsuitable for processing millions of unit requests on the fly.

I switched to a Swoole daemon communicating directly via the TCP protocol. By stripping away unnecessary application layers, I achieved a high-performance deserialization tool capable of handling massive throughput when coupled with Spring Batch’s parallelism.

The Problem: The Instability of CLI Scripts

Using isolated PHP scripts launched via command line for every row would have turned a 2-hour migration into a 20-hour ordeal, primarily due to the overhead of booting the PHP interpreter for each call.

The Solution: Swoole Micro-service & Parallelism

I designed a hybrid architecture where the Java engine drives the data logic while delegating the raw transformation:

Swoole Daemon: A PHP container runs in the background and maintains an open TCP connection.
Composite Processor: On the Java side, the pipeline uses a processor chain. The first sends the raw data to the daemon via TCP, retrieves the JSON, and subsequent processors handle column mapping and locale conversions.
Multi-threading: The system processes multiple tables simultaneously through a thread pool (ThreadPoolTaskExecutor), fully exploiting available CPU capacity.

Technical Deep Dive: Dynamic Configuration

To avoid coding a specific Job for every table, I made the engine entirely generic via the application.yml file:

Declarative Mapping: Tables and column renames (e.g., owner_id -> product_id) are simply listed so the Job can be built dynamically at startup.
Environment Injection: All resources (JVM memory, InnoDB limits, Swoole ports) are managed via a .env file, making deployment across different infrastructures seamless.

Lessons Learned

Flow Optimization: The real-world impact of network protocols (TCP vs. HTTP) on bulk data processing.
MySQL Performance: Tuning innodb_buffer_pool_size and max_allowed_packet parameters to support massive batch writes.
Interoperability: Effectively managing communication between a Java runtime and an asynchronous PHP service.

Retrospective & Future

This project has proven its stability during critical migrations in Docker environments. However, to scale further, I am currently working on a transition to the Cloud.

The goal is to enable full horizontal scaling: by deploying the engine on Kubernetes, I can use the declarative mapping to run multiple instances in parallel, each processing different tables. This approach will drastically reduce migration times, even for databases reaching hundreds of millions of rows.