Batch processing is the procedure of transforming a set of inputs to another set of outputs.
The primary illustrations of batch processing are Reporting and Billing.
The concern with batch processing is that it is slow and the outcome is reflected after a long time (hours or days)
Stream processing is the handling of an event instantly as it occurs. A stream can be illustrated as a steady flow of data.
In batch processing, input and output are typically files. In-stream processing, input is a group of records. Record is also known as an Event.
An event is an immutable object which contains the facts of something that happened in the past. In a streaming system, topics and streams are the groupings of the related events.
Batch processing is written once and read by multiple jobs for processing. Stream processing, an event is generated by one producer and consumed by numerous consumers.
Datastore for Stream processing
We require a datastore to store the events generated by the producer and consumers to poll the events from the database that emerged after the prior poll by the consumer.
Anticipation from system
- Processing delay should be minimal
- Polling speed should be high
We have to discover a database that satisfies the anticipation of the system.
One strategy is to utilize a relational database that supports triggers to respond to the change. Despite they have restricted capability for stream processing.
Specialized tools for stream processing: Messaging Systems
- The producer sends the message
- The consumer receives the message
There are myriad messaging systems in the market with diverse capabilities.
A few questions to distinguish between the messaging system
What happens if producers produce messages faster than the consumer can ingest?
Following the appropriate prospects
- The system can drop messages
- Buffer messages in a queue
- Apply backpressure or flow control (block producer from sending additional messages)
What happens if the nodes of messaging system crashed?
- Durability demands further resources for writing to the disk and replication. It can be decided based on the system requirements.
Several systems don’t utilize message brokers and directly establish a connection between producer and consumer with alternatives
- Financial industry use UDP multicast
- A brokers-less messaging library such as Zero MQ
- StatsD use UDP communication to send metrics
- Webhook: Consumer exposes a service endpoint that a producer can invoke when an event occurs.
A message broker is a data store optimized for handling message streams.
- Data is centralized with the broker
- Client handling ( connect, disconnect and crash )
- Durability management
Message processing with a message broker is asynchronous.
Database vs Message broker
- Message brokers are not suitable for long-term data storage. A database usually keeps data for the long term albeit the message broker deletes it as it is consumed by the consumer.
- Message broker assumes that their working set is fairly small.
- The database provides various ways of searching the data which is not the case with the message brokers.
Multiple consumer handling
- Load balancing