In a recent production use case, we had to compare millions of claim records—both old and new—to detect changes and deletions. This problem, typically solved using Hadoop MapReduce or Azure Spark, came with notable infrastructure cost and complexity. Here’s how we replaced all that with a scalable, memory-efficient pure Java process, and why this approach is now running in production.

📥 Step 1: Processing Incoming Claim Files

The pipeline begins when two files arrive—one with old claims, one with new. Each file contains millions of records, which we process in chunks defined by a configurable THRESHOLD. As soon as a block reaches that threshold, a new Java thread is spawned to handle it:

  • Each thread sorts its block of records.
  • The sorted data is written to intermediate files.

This multi-threaded file-level sorting keeps memory use in check and distributes CPU load without spinning up a full Spark cluster.

🧪 Step 2: External Merge Sort

Once all chunks are written to disk in sorted form, we run an External Merge Sort to collate them efficiently. This avoids large in-memory datasets while delivering a completely sorted stream of claims ready for comparison.

🔄 Step 3: Comparing Records with ForkJoin + Producer-Consumer Design

With sorted streams ready, we shift to the core comparison logic using Java’s ForkJoinPool:

  • We run two producers, one for each file.
  • They push sorted records into a shared ArrayBlockingQueue.
  • A consumer thread reads from the queue and compares records in real time.

Whenever it finds a delta (a changed record) or a deletion, it writes that to corresponding output files—cleanly separating changed and removed claims.

💡 Why Not Spark?

Initially, the management leaned toward spinning up Azure Spark pools to solve this. But the cost per run and effort to manage large-scale Spark infrastructure didn’t justify the occasional nature of the job. Our analysis found:

  • Spark’s operational cost would be 30–40% higher.
  • Resource ramp-up time was excessive.
  • Custom record comparison logic fit Java better than Spark transformations.

So we took the lean route—a POC using pure Java—and it proved to be:

✅ Faster to deploy
✅ Easier to maintain
✅ Far more cost-effective
✅ Fully scalable (millions of records handled without crash or timeout)

🏁 Result: Java in Production

The Java implementation passed all performance tests and is now live in production. We’ve offloaded the need for cluster orchestration, reduced compute billing, and retained full flexibility in how records are chunked, sorted, and compared.

This experience reaffirmed that you don’t always need heavy-duty distributed frameworks to solve big problems—especially when you’ve got thoughtful threading, queue design, and memory control on your side.