How to setup a Kafka cluster with Docker on Amazon EC2
In this example we are going to setup a Kafka cluster of two nodes using Amazon EC2 instances. However, even if some configuration details are peculiar to AWS, instructions described in this tutorial can be generalized very easily to larger clusters and to other deployment environments.
⚠️ WARNING ⚠️: as soon as your EC2 instances are running, your Amazon account will be billed.
Setup EC2 instances
First of all you need to setup two EC2 instances. I won’t describe here the details of how to setup an AWS account (if you still don’t have one) and EC2 instances, but you can find lots of resources online (see e.g., the official documentation).
In general, for a production environment, you should evaluate carefully:
- the instance type (e.g.,
t2.large
) - the size and type of storage (e.g.,
500GB st1
)
The following instructions assume that your instances are running Amazon AMI Linux. However, any other linux distro can be used.
Install Docker and Docker Compose
Connect to both instances with SSH and run:
yum install docker
systemctl enable docker.service
systemctl start docker.service
In this post I will also use Docker Compose to start Docker containers. To install it, see the official instructions. It should be enough to run something like the following:
sudo curl -L "https://github.com/docker/compose/releases/download/1.28.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
where 1.28.2
is the latest version at the time of writing this (February 2021).
Setup public IP addresses
To access Kafka nodes from outside, you need to setup Elastic IP addresses for both instances. From the instances configuration page, take note of the following fields:
- Public IPv4 address
- Public IPv4 DNS
- Private IPv4 addresses
In this example, I will use the following fake addresses:
Node A
- Public IPv4 address:
A.B.C.D
- Public IPv4 DNS:
ec2-A-B-C-D.us-east-1.compute.amazonaws.com
- Private IPv4 address:
172.1.1.1
Node B
- Public IPv4 address:
E.F.G.H
- Public IPv4 DNS:
ec2-E-F-G-H.us-east-1.compute.amazonaws.com
- Private IPv4 address:
172.2.2.2
Setup VPC
You need to configure your network to allow the two nodes to talk to each other, and to allow clients to reach the Kafka cluster nodes from the outside.
First of all, find the security group of your EC2 instances (e.g., sg-1234567890
). Then go to VPC > SECURITY > Security Groups
and select that security group ID. Now click on button Edit inbound rules. You have to add the following rules:
Source | Type | Protocol | Port range |
---|---|---|---|
Security group sg-1234567890 |
All traffic | All | All |
Your client IP address(es) | Custom TCP | TCP | 2181 (for Zookeeper) |
Your client IP address(es) | Custom TCP | TCP | 9092 (for Kafka) |
The first rule could be more specific, i.e., you could open only ports 2181, 2888 and 3888. In the other two rules, you have to set the IP address of the client (or clients) that will connect to the Kafka cluster.
Configure Docker containers
Prepare the host
In this example I am using Docker images from Bitnami, but this configuration can be adapted very easily to other images.
Bitnami images run in user mode, in particular with user ID 1001
. To allow Kafka and Zookeeper to persist files on your host filesystem, you have to map two directories to the docker containers (Everytime a Docker container is deleted and recreated, data present in its filesystem is lost).
Assuming that you have two directories /data/zookeeper
and /data/kafka
where you are going to store Zookeeper and Kafka files, respectively, run the following commands:
sudo chown 1001.1001 /data/zookeeper/
sudo chown 1001.1001 /data/kafka/
These commands will change the ownerships of the two directories, granting the container processes the permissions to read and write.
Zookeeper and Kafka setup
We have to start one instance of Zookeeper and one instance of Kafka for each cluster node.
As mentioned above, we are using Docker Compose to manage containers creation. Use the two files linked below as a template to create your configuration files (one for each node):
Let’s go through the relevant details of the configuration.
In the zookeeper
section:
- Each node is assigned a unique
ZOO_SERVER_ID
number. - In the
ZOO_SERVERS
field you have to list all addresses of your cluster, in the position matching theZOO_SERVER_ID
number assigned above. Private IP addresses are OK here, since the communication between servers is only internal. In this field you have to change IP addresses to match your configuration.
In the kafka
section:
KAFKA_BROKER_ID
is the unique identifier of the Kafka instance.KAFKA_CFG_NUM_PARTITIONS
is set here to 1 by default, but in real applications you should create topics with many partitions (see below for an explanation).- All configurations related to replication factor are set to 2 (the minimum allowed). In this case we have a 2-nodes cluster, but with larger clusters you could increase it.
- The configuration of listeners is crucial. Article Kafka Listeners – Explained explains in fact everything about this theme. In this example I create a single listener (named
EXTERNAL
), that is used both for internal (between cluster nodes) and external (between clients and servers) communication. However, this only works here in AWS because the DNS will resolve to the public IP address on the external interface, and to the private IP address on the internal interface. In this field you have to change DNS strings to match your configuration.
Start containers
Now put the first Docker Compose configuration file on the first instance, and the other configuration file on the second instance. Place them in some directory, rename them to docker-compose.yml
and from inside that directory run:
sudo /usr/local/bin/docker-compose up -d
on both machines. Now by running docker ps
you should see both Zookeeper and Kafka containers running on both instances. If not, you will need to debug the problem by checking containers output by using docker logs
.
Check if it’s working
If you have configured everything correctly, both nodes should now talk to each other, and you should be able to reach both of them.
To test connection with Kafka, you can use a handy command line tool named kafkacat. To avoid installing it, since we are already using Docker, let’s execute it as a container.
From the external host/network you have configured in the VPC
section, try to reach both node 1:
docker run --rm --tty confluentinc/cp-kafkacat \
kafkacat \
-b ec2-A-B-C-D.us-east-1.compute.amazonaws.com:9092 \
-L
and node 2:
docker run --rm --tty confluentinc/cp-kafkacat \
kafkacat \
-b ec2-E-F-G-H.us-east-1.compute.amazonaws.com:9092 \
-L
Both commands should run with no error, with an output similar to the following:
Metadata for all topics (from broker 1: ec2-A-B-C-D.us-east-1.compute.amazonaws.com:9092:9092/1):
2 brokers:
broker 2 at ec2-A-B-C-D.us-east-1.compute.amazonaws.com:9092
broker 1 at ec2-E-F-G-H.us-east-1.compute.amazonaws.com:9092
...
Create topics
In the configuration, I have set KAFKA_CFG_AUTO_CREATE_TOPICS_ENABLE
to true
to let you create topics automatically (with a default of 1 partition). However, since using multiple partitions is the only way to parallelize Kafka IO operations, I suggest to create topics manually and to choose carefully how many partitions to assign to each topic.
The best replication factor and number of partitions for your topic depend on many parameters, like e.g.:
- cluster size
- node instances hardware
- network speed
- application needs
I’m not covering here the details of this choice. Let’s say for example that you want to create a topic with 10 partitions with a replication factor of 2. From a client with Kafka installed (e.g., brew install kafka
on macOS), run the following:
kafka-topics --create \
--zookeeper ec2-A-B-C-D.us-east-1.compute.amazonaws.com:2181 \
--replication-factor 2 --partitions 10 \
--topic my-test-topic
Again, it is required that the client machine has access to the Zookeeper port.
If you run again the kafkacat
commands described above, you should be able to see your new topic in the command output.
Conclusions
We have configured a working Kafka cluster, that you can use in production for any real application.
The configuration described here will work for Amazon EC2 (in particular see the setup of Kafka listeners). However, it can be also easily changed to work in any other deployment environment.