Stream Processing

Batch Processing

Ajay Yadav

3 min readJan 8, 2022

Batch processing is the procedure of transforming a set of inputs to another set of outputs.

The primary illustrations of batch processing are Reporting and Billing.

The concern with batch processing is that it is slow and the outcome is reflected after a long time (hours or days)

Stream Processing

Stream processing is the handling of an event instantly as it occurs. A stream can be illustrated as a steady flow of data.

In batch processing, input and output are typically files. In-stream processing, input is a group of records. Record is also known as an Event.

An event is an immutable object which contains the facts of something that happened in the past. In a streaming system, topics and streams are the groupings of the related events.

Batch processing is written once and read by multiple jobs for processing. Stream processing, an event is generated by one producer and consumed by numerous consumers.

Datastore for Stream processing

We require a datastore to store the events generated by the producer and consumers to poll the events from the database that emerged after the prior poll by the consumer.

Anticipation from system

Processing delay should be minimal
Polling speed should be high

We have to discover a database that satisfies the anticipation of the system.

One strategy is to utilize a relational database that supports triggers to respond to the change. Despite they have restricted capability for stream processing.

Specialized tools for stream processing: Messaging Systems

Messaging System

Fundamental concept

The producer sends the message
The consumer receives the message

There are myriad messaging systems in the market with diverse capabilities.

A few questions to distinguish between the messaging system

What happens if producers produce messages faster than the consumer can ingest?

Following the appropriate prospects

The system can drop messages
Buffer messages in a queue
Apply backpressure or flow control (block producer from sending additional messages)

What happens if the nodes of messaging system crashed?

Durability demands further resources for writing to the disk and replication. It can be decided based on the system requirements.

Direct communication

Several systems don’t utilize message brokers and directly establish a connection between producer and consumer with alternatives

Financial industry use UDP multicast
A brokers-less messaging library such as Zero MQ
StatsD use UDP communication to send metrics
Webhook: Consumer exposes a service endpoint that a producer can invoke when an event occurs.

Message broker

A message broker is a data store optimized for handling message streams.

Features

Data is centralized with the broker
Client handling ( connect, disconnect and crash )
Durability management

Message processing with a message broker is asynchronous.

Database vs Message broker

Message brokers are not suitable for long-term data storage. A database usually keeps data for the long term albeit the message broker deletes it as it is consumed by the consumer.
Message broker assumes that their working set is fairly small.
The database provides various ways of searching the data which is not the case with the message brokers.

Multiple consumer handling

Load balancing
Fanout

Stream Processing

Batch Processing

Stream Processing

Datastore for Stream processing

Messaging System

Written by Ajay Yadav