Technology of big data is booming and moving fast, and one of the major sector of big data booming is the requirements to process and analyze real time data that is helping lots of industries using real time data, from IoT, to the analyzing social networks.
But real time data has its challenges, which the industry answered with many different solutions.
In this post we are going to cover the technical solutions in AWS for streaming real time data.

Problems with real-time data processing

Real time data produces its unique challenges and problems, and here are some of them:

  1. Process massive amounts of streaming data.
    For example, processing a hashtag in twitter, means we might receive thousands of tweets in a second.
  2. Data Loss:
    Data producer send record which could be lost for different reasons.
  3. Message Re-ordering:
    With lots of data coming in, it is challenging to keep track of the order of the data, when the difference could be fraction of a second.

Real-time streaming solution requirements

A stream processing solution has to solve different challenges:

  • Process millions of events per second in realtime.
  • correlate data from multiple streams.
  • Real-time responsiveness to events.
  • Scale up/out as volumes increase in size and complexity
  • Rapid integration with existing infrastructure and data sources.
  • Integration with other systems that will use the data.
  • End-user ad-hoc continuous query access

Streaming solutions on AWS

AWS provides many streaming services for realtime data. Each has its own features. We are going to compare between them to give you and idea which one to pick for your next project. These services are provided directly from Amazon. There are others that available from AWS market place, which is very active community, and you can check if your preferable streaming software might be available already there.

  1. Amazon Kinesis
  2. Apache Kafka
  3. Amazon SQS / SNS

And let us start with the simplest: SQS / SNS.

Amazon SQS / SNS

Amazon SQS is a simple (“dump”) message queuing service. It is a simple FIFO data store (first-in first-out), and if you are coming from Windows environment it is similar to MSMQ.
But you can achieve push-based messaging service if you combine it with SNS: the Amazon notification service.

  • SQS/SNS Pros:

    1. Both SQS/SNS are native AWS services, which means they are elastics and scale to the level you want them.
    2. They are managed by Amazon, and you don’t have to worry about them.
    3. Very easy to configure, and the easiest to configure between all three services.
    4. Duplicate data between many availability zones automatically.
  • SQS/SNS Cons:

    1. It is very simple comparing to other solutions, and to achieve what other solutions are providing we need to do lots of customization and coding.
    2. It is more expensive than others in this list when it comes to big data.

Amazon Kinesis

Amazon Kinesis is a platform for streaming real-time data on AWS, offering services to store, query, analyze and deliver realtime data:

  • Kinesis Pros:

    1. Stream huge data (TBs per hour).
    2. Managed by Amazon and you don’t have to worry about it.
    3. Cheapest between all three when it comes to data size.
    4. Duplicate data between many availability zones automatically.
  • Kinesis Cons:

    1. More complex to setup than SQS.

Apache Kafka

Apache Kafka is publish-subscribe message broker that can handle hundreds of megabytes per seconds, and has clustering to support scaling out.

  • Kafka Pros:

    1. Huge community support.
    2. Relatively easier to install / configure than Kinesis.
    3. More stable than Kinesis.
  • Kafka Cons:

    1. You need to set it up and configure it yourself on EC2.
    2. If you need redundancy across many availability zones , then you have to do it your self.

How to choose

If you want very simple requirements with small size of data, then you don’t have to use Kinesis or Kafka, because Amazon SQS/SNS is simpler to use.
If you require to handle huge data, and more complex functionality then you should choose either Kafka, or Kinesis. They both are can fit many complex situation, and actually Kinesis was developed after Kafka.
If you want to do less work and less learning curve, then Kinesis should be your choice. Kinesis is managed by AWS, and can be scaled automatically using AWS auto-scaling services.
If you prefer to have the support of a huge community, then Kafka will be the choice, but Kafka scalability is achieved using its clustering features, which you have to configure it yourself.

This is a sheet that I took it from AWS documentation that shows the differences between these 3 technologies, plus a technology that I didn’t mention which is DynamoDB Stream:


List of posts

This post is the second part of many posts related to bid data. The other posts are:

  1. Introduction to big data.
  2. Streaming real time data (this post).
  3. Hadoop & MapReduce.
  4. ElasticSearch (coming).
  5. Spark and data analysis with Python (coming).
  6. Hive and data processing (coming).
  7. Practical example