Got ETL? Meet the reader, processor, writer pattern. Along with all the pre-built implementations, scheduling, chunking and retry features you might need.
I think those who are drawn to Spring Batch are right to use it. It's paradigm is sensible and encourages developers to design well and not to reinvent things. It is reliable, robust, and relatively easy to use.
I have found, however, that many times people reach for Spring Batch as an excellent technical solution but completely miss the business impact. ETL jobs suck!
Many businesses totally neglect the repercussions of copying and renaming data between all the systems in their company. To me, this kind of ETL reinforces bad data practices, delays meaningful standardization, and inhibits future analytics work. Extensive data duplication leads to data quality issues which often leads to added costs for master data management solutions. It's also very wasteful, especially if you are paying for lots of Oracle products. Don’t be that company that names tables in hipster speak (USR_CNTCT_INF) to cut down on data waste and then ETL everything all over your company!
If you can get the job done with Spring Batch, you can get the job done in a similar read-process-write paradigm in just about any messaging framework. This encourages loose coupling and enables continuous data transfer (I avoid using the term real-time so as not to confuse with actual real-time systems). If you really miss the pre-built readers and writers, take a look at Apache Camel.
Many features are easy to replicate in a streaming/messaging system. Scheduling? Who cares, it streaming. Retry? Make a retry queue. Failure and error handling? Dead letter queue. Partitioning? Just add more workers. The last example is actually much easier than in Spring Batch.
There's also a whole host of stream processing and analytics capabilities you probably want out of your batch job but can't make sense of. Say you have to data loads that somehow need to be related. You now need a third batch job to do a join after the first two run. Plus this all requires scheduling or polling and much consideration over efficiency.
Please don't be scared away from streaming or message systems. Please do not use Spring Batch solely because you are already doing badly at making batch jobs. Make the leap toward messaging.
I don't want to disparage Spring Batch, it's great at what it does. I have just seen too many batch jobs that would be significantly better as streaming architectures. There's also a huge push these days behind event-streaming which is a topic for another post.
Easy Batch is a framework that aims at simplifying batch processing with Java. It was specifically designed for simple, single-task ETL jobs. Writing batch applications requires a lot of boilerplate code: reading, writing, filtering, parsing and validating data, logging, reporting to name a few.. The idea is to free you from these tedious tasks and let you focus on your batch application's logic.
How does it work?
Easy Batch jobs are simple processing pipelines. Records are read in sequence from a data source, processed in pipeline and written in batches to a data sink:
The framework provides the Record and Batch APIs to abstract data format and process records in a consistent way regardless of the data source/sink type.
Let's see a quick example. Suppose you have the following tweets.csv file:
id,user,message 1,foo,hello 2,bar,@foo hi!and you want to transform these tweets to XML format. Here is how you can do that with Easy Batch:
Path inputFile = Paths.get("tweets.csv"); Path outputFile = Paths.get("tweets.xml"); Job job = new JobBuilder<String, String>() .reader(new FlatFileRecordReader(inputFile)) .filter(new HeaderRecordFilter<>()) .mapper(new DelimitedRecordMapper<>(Tweet.class, "id", "user", "message")) .marshaller(new XmlRecordMarshaller<>(Tweet.class)) .writer(new FileRecordWriter(outputFile)) .batchSize(10) .build(); JobExecutor jobExecutor = new JobExecutor(); JobReport report = jobExecutor.execute(job); jobExecutor.shutdown();This example creates a job that:
At the end of execution, you get a report with statistics and metrics about the job run (Execution time, number of errors, etc). All the boilerplate code of resources I/O, iterating through the data source, filtering and parsing records, mapping data to the domain object Tweet, writing output and reporting is handled by Easy Batch. Your code becomes declarative, intuitive, easy to read, understand, test and maintain.