[KAFKA] KAFKA 1

read

Kafka에 대해

INFRA

N사 인턴을 할 때, 처음으로 로깅처리에 대해 중점적으로 개발한 적이 있다. 단순히 그전까지 파일 날짜별로, logging 타입별로 저장하는 형태를 유지해왔는데, 기업에서는 어떻게 처리하는 지 매우 궁금했었다. 아마, 진짜 Kafka를 해당 회사 안에서 쓰는지는 모르겠지만, 적어도 내가 인턴을 할 때는 해당사항에 대해 배우는 것을 권했고, 그러다보니 면접 시에 항상 물어보는 것 같아, 이번에 문서를 총체적으로 정리해 보기로 했다. 그리고 정리하며 내가 헤깔려하는 용어들도 같이 정리해보려 한다.

KAFKA 란?

Kafka® is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies

기본적으로 Apache에서 소개하는 Kafka다. 해석을 해보자면, Kafka®는 실시간 데이터 파이프 라인 및 스트리밍 앱을 제작하는 데 사용됩니다. 수평적으로 확장 가능하고 결함 감내 기능이 있으며 매우 빠르며 수천 개의 회사에서 생산되고 있습니다.

파이프 라인 이라 하면, 데이터를 처리할 때, 출력이 다음 단계에 있어서의 입력으로 연결되는 구조를 말한다.

즉, 수평적으로 확장하며, 결함 감내 (부분 결함이 생기면 그 부분에 한해 성능은 떨어지지만, STOP 되진 않는다.) 기능이 있는 분산형 메세징 시스템이라 볼 수 있다.

기본적인 성격을 알아보도록 하자.

이 역시, Apache 사이트에서 설명하는 것으로 시작을 해보도록 하겠다.

distributed streaming platform 가장 크게 적혀진 성격이다. 분산형 스트리밍 플랫폼 이 의미를 하나씩 알아보자.

key

스트리밍 플랫폼에 3가지 KEY 가 있는데 다음과 같이 말한다.

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.

메세지 큐 혹은 enterprise 메세징 시스템과 유사하게 레코드 스트림을 Publish(게시)와 subscribe (구독) 할 수 있다.

Store streams of records in a fault-tolerant durable way.

결함감내의 내구성 방식으로 레코드의 스트림을 저장.

먼저, 스트리밍 데이터 라 하면 수천 개의 데이터 소스에서 연속적으로 생성되는 데이터를 의미하는데, 레코드 는 처리의 기본 단위이므로 처리의 기본단위가 연속적으로 이어지는 것을 레코드의 스트림이라 볼 수 있다.

Process streams of records as they occur.

레코드 스트림이 발생하면 처리.

그리고 다음과 같이 2가지 방향으로 자주 사용한다.

Building real-time streaming data pipelines that reliably get data between systems or applications

시스템 또는 응용 프로그램간에 데이터를 신뢰성있게 얻는 실시간 스트리밍 데이터 파이프 라인 구축

Building real-time streaming applications that transform or react to the streams of data

데이터 스트림을 변환하거나 이에 반응하는 실시간 스트리밍 애플리케이션 구축

이게 상당히 어렵게 느껴지지만, 어떻게 동작하는지 Bottom Up으로 쉽게 알아가 보도록 하려한다.

일단, 3가지의 Kafka 개념이 설명된다.

Kafka is run as a cluster on one or more servers that can span multiple datacenters.

카프카는 여러 데이터센터로 확장할 수 있는 하나 혹은 그 이상의 서버들의 클러스터로 실행된다.

The Kafka cluster stores streams of records in categories called topics.

카프카 클러스터는 레코드 스트림을 TOPIC이라는 이름의 카테고리 안에 저장한다.

Each record consists of a key, a value, and a timestamp.

각 레코드는 키, 값, 타임 스탬프로 구성된다.

그리고 4개의 핵심 API가 제공된다.

The Producer API allows an application to publish a stream of records to one or more Kafka topics.

Producer API 는 application이 레코드 스트림을 하나 혹은 그 이상의 Kafka Topic들로 게시할 수 있게 한다.

The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.

Consumer API 는 application이 하나 혹은 그 이상의 TOPIC을 구독하고, 생성된 레코드 스트림을 처리할 수 있게 한다.

The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.

Streams API 는 application이 stream processor로써 작동하게 하고, 하나 혹은 그 이상의 TOPIC에서 Input Stream을 소비하고 하나 혹은 그 이상의 TOPIC의 Output Stream으로 생성해, 효과적으로 Input Stream에서 Output Stream으로 변환하게 한다.

The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.

Connector API는 Kafka TOPIC을 존재하는 application이나 data system과 연결하여, 재사용 가능한 Producer나 consumer를 구축하고 실행할 수 있게 한다. 예를 들어, 관계형 데이터베이스와의 connector는 모든 테이블의 변경을 캡쳐할 수 있다.

Topic

KAFKA

앞선 부분이 개론이자 어려운 내용이다.

내가 이해하고 있는 데로 간단히 말해보면, 하나 혹은 그 이상의 서버들을 모아 클러스터로 실행되는 Kafka는 사용자가 원하는 방향으로 TOPIC이라는 단어의 카테고리로 나누어 스트림을 저장한다. 즉, A와 B를 원하는 사용자는 A라는 TOPIC과 B라는 TOPIC으로 나누어, 각각의 스트림을 저장하는 것이다. 또한, 저장될 때 레코드는 키, 값, 타임 스탬프로 저장되어, 검색이 용이할 것으로 생각된다.

4가지 API는 한가지씩 분석해보면, Producer의 경우, 앞서 말한 TOPIC의 저장(게시)을 행하는 주체일 것이고, Consumer는 그런 TOPIC을 가져와(구독) 처리하는 주체일 것이다. Streams의 경우는 입출력을 담당할 것이고, Connector의 경우, 흔히, APP이나 DB와의 연결을 관리하는 부분일 것이다.

그럼 TOPIC과 Log에 대해 알아보자.

TOPIC을 처음에 다음과 같이 말한다. core abstraction Kafka provides for a stream of records

레코드 스트림을 제공하는 Kafka의 핵심 추상화

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

토픽은 레코드가 Publish(게시)되는 카테고리 또는 피드 이름이다. 항상 다중 Subscriber(구독자)이고, 작성된 데이터를 구독하는 consumer가 0, 1, 혹은 많을 수 있다.

For each topic, the Kafka cluster maintains a partitioned log.

각 TOPIC에 대해, 카프카 클러스터는 파티션으로 나누어진 Log로 유지한다.

TOPICS

Each partition is an ordered, immutable sequence of records that is continually appended to — a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

각 파티션은 순서가 있고, 지속적으로 추가 되며 변하지 않는 코드 순서를 가진다. (a structured commit log)

파티션 안의 레코드는 파티션 내의 각 레코드를 고유하게 식별하는 offset이라는 순차 ID 번호가 각각 지정됩니다.

The Kafka cluster durably persists all published records — whether or not they have been consumed — using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

카프카 클러스터는 구성 가능한 보존 기간을 사용해 publish된 레코드들이 consume이 되든 안되든, 모두 영속적으로 유지한다.

예를 들어, 이틀로 보존 기간이 설정되어 있을 때, publish 된 이후 이틀간은 consume할 수 있고 이후에는 사용 공간을 위해 폐기된다. Kafka의 성능은 데이터 크기와 관련하여 사실상 일정하므로 데이터를 오랫동안 저장하는 것은 문제가 되지 않는다.

consumer

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

실제로는 consumer 기준으로 유지되는 유일한 메타 데이터는 로그에서 해당 소비자의 offset 또는 위치이다.

이 offset은 consumer에 의해 제어된다: 일반적으로 consumer는 선형적으로 offset을 읽어드리지만, 실제로는 consumer에 의해 offset이 제어되므로 선호하는 순서로 consume할 수 있다.

예를 들어, 소비자는 과거의 데이터를 다시 처리하기 위해 오래된 offset으로 재설정하거나 가장 최근의 레코드로 건너 뛰고 “지금”부터 consume할 수도 있다.

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

이러한 기능의 조합은 Kafka consumer가 매우 경제적이라는 것을 의미한다. 클러스터나 다른 consumer들에게 큰 영향을 미치지 않고 조작 할 수 있다. 예를 들어, 기존 consumer가 consume하는 것을 변경하지 않고, 커맨드 라인 도구의 “tail”을 사용하여 어떤 TOPIC의 내용을 알 수 있다.

본래 tail 명령어는 텍스트 또는 출력될 수 있는 파일의 끝 부분의 특정 열을 출력하는 명령어이지만 -f 옵션을 사용할 경우 특정 파일에 대해 크기가 커지는 만큼 추가 내용을 출력한다. 즉, 위 상황에서 추가된 로그를 출력한다.

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

로그의 파티션은 여러 목적으로 제공된다. 첫번째로, 하나의 서버에 맞는 크기 이상으로 확장할 수 있게 한다. 각 개별 파티션은 호스트하는 서버에 적합해야 하지만 TOPIC은 많은 파티션을 가지므로, 임의의 양의 데이터를 다룰 수 있다. 두번째로, 병렬단위인 것처럼 행동한다.

Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

로그의 파티션은 Kafka 클러스터의 서버를 통해 분할되며 각 서버는 데이터를 처리하고 파티션 공유에 대한 요청을 처리합니다. 각 파티션은 결함 감내를 위해, 구성 가능한 수의 서버에 복제됩니다.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

각 파티션은 Leader 역할을 하는 서버와 Follower 역할을 하는 0개 이상의 서버로 구성된다. Leader는 Follower가 수동적으로 Leader를 복제할 동안에, 모든 read와 write 요청을 처리한다. 만약, Leader에 결함이 생기면 Follower중 하나가 자동적으로 새로운 리더가 된다. 각 서버는 일부 파티션의 Leader와 다른 파티션의 Follower의 역할을 한다. 따라서, 클러스터 내에서 balance는 매우 잘 맞다.

Geo-Replication

Kafka MirrorMaker provides geo-replication support for your clusters. With MirrorMaker, messages are replicated across multiple datacenters or cloud regions. You can use this in active/passive scenarios for backup and recovery; or in active/active scenarios to place data closer to your users, or support data locality requirements.

Kafka MirrorMaker는 클러스터에 지리적 복제 지원을 제공한다. MirrorMaker를 사용하면 메시지가 여러 데이터 센터 또는 클라우드 지역에 복제된다. 이것은 백업 및 복구를 위해 능동/수동 시나리오에서 사용할 수 있다. 또한 활성/활성 시나리오에서 데이터를 사용자 가까이에 배치하거나 데이터 지역성 요구 사항을 지원할 수 있습니다.

데이터베이스의 지리적 분할과 비슷한 것 같다.

Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

Producer는 선택한 Topic을 publish(게시) 한다. Producer는 Topic 내의 어떤 파티션에 할당할 레코드를 선택해야 한다. 이는 로드 균형을 맞추기 위해 라운드 로빈 방식으로 수행되거나 어떤 의미적 파티션 함수 (레코드의 일부 키를 기반으로 하는)에 따라 수행 될 수 있다. 두 번째 방식을 파티셔닝에 더 많이 사용한다!

Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

Consumer는 Consumer 그룹 이름으로 자신의 레이블을 지정하고, TOPIC에 Publish된 각 레코드는 각 Subscribing Consumer 그룹 내의 하나의 Consumer 인스턴스에 전달된다. Consumer 인스턴스는 별도의 프로세스 또는 별도의 시스템에 있을 수 있다.

풀어 말하면, Consumer는 소속된 그룹 이름으로 자신을 라벨링하고, TOPIC에 저장된 레코드는 이를 구독하는 Consumer 그룹 안에 하나의 Consumer 인스턴스로 전달되는 것이다.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

만약, 모든 Consumer 인스턴스가 같은 Consumer 그룹을 가지면, 레코드는 Consumer 인스턴스들보다 더 효율적으로 Load Balance 된다.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

만약, 모든 Consumer 인스턴스가 다른 Consumer 그룹을 가지면, 각 레코드는 모든 Consumer 프로세스들에게 Broadcast 할 것이다.

consumer

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

두 개의 Consumer 그룹이 있는 네 개의 파티션 (P0-P3)을 호스팅하는 두 대의 서버를 가진 Kafka 클러스터이다. Consumer 그룹 A에는 두 개의 Consumer 인스턴스가 있고 그룹 B에는 네 개의 인스턴스가 있다.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

그렇지만, 일반적으로, TOPIC이 각각의 “logical subscriber”에 대해 하나씩 적은 수의 Consumer 그룹을 갖는 것을 볼 수 있다. 각 그룹은 확장성 및 결함 감내를 위한 많은 Consumer 인스턴스로 구성됩니다. 이것은 Consumer가 단일 프로세스 대신, Consumer들의 클러스터인 publish-subscribe 형태 이상 이하도 아니다.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka에서 Consumption이 구현되는 방식은 로그의 파티션을 Consumer 인스턴스로 나누어, 각 인스턴스가 어느 시점에서든 파티션의 “공정한 공유”를 독점사용자가 되게 하는 것이다. 그룹내의 구성원을 유지하는 이 과정은 Kafka 프로토콜에 의해 동적으로 처리된다. 만일 새 인스턴스가 그룹에 들어오면 그룹의 다른 구성원으로부터 일부 파티션을 넘겨 받는다; 만약 인스턴스가 죽으면, 해당 파티션은 남은 인스턴스들에게 분배될 것이다.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Kafka는 한 TOPIC의 다른 파티션 사이가 아닌, 한 파티션 내의 레코드에 대해서만 전체 순서를 제공한다.

대부분의 application에서는 키 단위로 데이터를 분할하는 기능과 함께 파티션 단위의 순서만으로 충분하다. 그러나, 레코드에 대한 전체 순서가 필요한 경우 이는 하나의 파티션만 있는 TOPIC만 수행 할 수 있다. 단, 이는 Consumer 그룹당 하나의 Consumer 프로세스를 의미한다.

Multi-tenancy (다중 차용)

You can deploy Kafka as a multi-tenant solution. Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients. For more information, see the security documentation.

Kafka를 Multi-tenancy 솔루션으로 배포 할 수 있다. Multi-tenancy는 데이터를 생성하거나 소비 할 수 있는 TOPIC을 구성하여 사용할 수 있다. 할당량에 대한 운영 지원도 있다. 관리자는 클라이언트가 사용하는 브로커 자원을 제어하라는 요청에 대해 할당량을 정의하고 시행 할 수 있다. 자세한 내용은 보안 설명서를 참조해라.

Guarantees

At a high-level Kafka gives the following guarantees

높은 레벨의 Kafka는 다음과 같은 보증을 제공한다.

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

Producer가 특정 TOPIC 파티션으로 보낸 메시지는 전송된 순서대로 추가된다. 즉, 레코드 M1이 레코드 M2와 동일한 Producer에 의해 보내지고 M1이 먼저 보내지면 M1은 M2보다 더 낮은 Offset을 가지며 로그에서 더 일찍 나타난다.

A consumer instance sees records in the order they are stored in the log. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

Consumer 인스턴스는 로그에 저장된 순서대로 레코드를 본다.

For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

N개의 복제가 있는 TOPIC의 경우 로그에 커밋 된 레코드를 손실하지 않고 최대 N-1 개의 서버 오류를 허용합니다.

More details on these guarantees are given in the design section of the documentation.

보증에 대해 자세한 사항은 Design section의 문서에 나온다.

Kafka as a Messaging System

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Kafka의 스트림 개념은 기존 엔터프라이즈 메시징 시스템과 어떻게 비교 되는가?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

전통적 메시지 시스템은 두가지 모델이 있다: queuing 과 publish-subscribe. 큐의 경우, Consumer 풀은 서버에서 읽을 수 있으며 각 레코드는 그 중 하나에 저장한다; publish-subscribe의 경우, 레코드는 모든 Consumer에게 브로드캐스트 한다.

각 두가지 모델은 강점과 약점이 있다. queuing의 강점은 여러 Consumer 인스턴스에서 데이터 처리를 분할할 수 있으므로 처리 규모를 확장 할 수 있다. 아쉽게도, queue는 다중 subscriber가 아니다. 한번 한 프로세스가 읽은 데이터는 사라진다. Publish-subscribe는 여러 프로세스에 데이터를 브로드 캐스트 할 수 있지만 모든 메시지가 모든 Consumer에게 전달되므로 처리를 조정할 방법이 없다.

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

Kafka의 Consumer 그룹 개념은 이 두 개념을 일반화 한다. queue에서와 마찬가지로 Consumer 그룹은 프로세스 모음(Consumer 그룹의 구성원)을 통해 처리를 나눌 수 있다. publish-subscribe처럼 Kafka는 여러 Consumer 그룹에 메세지를 브로드 캐스트 할 수 있다.

The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

Kafka 모델의 장점은 모든 TOPIC이 이러한 속성을 모두 갖추고 있다는 것이다. 즉, 처리 규모를 조정할 수 있고 다중 Subscriber 이기도 하므로 둘 중 하나를 선택할 필요가 없다.

Kafka has stronger ordering guarantees than a traditional messaging system, too.

Kafka는 전통적인 메시징 시스템보다 강력한 순서 보증 또한 제공한다.

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

전통적인 queue는 서버에서 순서대로 레코드를 보유하고, 여러 Consumer가 큐에서 Consume하는 경우 서버는 저장된 순서대로 레코드를 전달한다. 그러나 서버가 레코드를 순서대로 전달하더라도 레코드는 비동기적으로 소비자에게 전달되므로 서로 다른 소비자에게 순서가 맞지 않을 수 있다. 이것은 사실상 병렬 Consumption이 발생하면 레코드의 순서가 손실된다는 것을 의미한다. 메시징 시스템은 queue에서 하나의 프로세스만 사용할 수 있는 “독점 Consumer”라는 개념을 사용하여이 문제를 해결할 수 있지만, 처리 과정에서 병렬 처리가 없다는 것을 의미한다.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka는 이것을 더 잘한다. TOPIC 내에서 병렬 처리 개념(파티션)을 가짐으로써 Kafka는 Consumer 프로세스 풀에 대해 순서 보증과 로드 밸런싱을 모두 제공 할 수 있다. 이는 TOPIC의 파티션을 Consumer 그룹의 Consumer에게 할당하여 각 파티션이 그룹의 정확히 한 Consumer에 의해 Consume되도록 수행된다. 이렇게하면 Consumer가 해당 파티션의 유일한 독자이고 순서대로 데이터를 사용하게 된다. 파티션이 많으므로 많은 Consumer 인스턴스에서 로드 밸런싱을 유지한다. 그러나 Consumer 그룹에는 파티션보다 더 많은 Consumer 인스턴스가 있을 수 없다.

Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

분리된 Publishing 메시지가 메시지를 Consume하지 못하게 하는 메시지 queue는 사실상 메시지의 저장 시스템으로 작동한다. Kafka가 다른 점은 그것이 매우 훌륭한 저장 시스템이라는 것이다.

즉, 따로 Consume하지 않는 Publish 된 메세지 queue들은 저장 시스템으로서 동작한다.

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

Kafka에 기록 된 데이터는 디스크에 기록되고 결함 감내을 위해 복제된다. Kafka는 Producer가 승인을 기다리게 한다. 따라서 쓰기가 완료될 때까지 쓰기가 완료되지 않은 것으로 간주되어 서버가 실패한 경우에도 쓰기가 유지된다.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

scale well-Kafka를 사용하는 디스크 구조 Kafka는 서버에 50KB 또는 50TB의 영구 데이터를 가지고 있더라도 동일하게 수행한다.

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

스토리지를 중요하게 생각하고 클라이언트가 읽기 위치를 제어할 수 있게 된 결과, Kafka는 고성능, 낮은 대기 시간의 커밋 로그 저장, 복제 및 전달 전용의 일종의 특수 목적의 분산 파일 시스템으로 생각할 수 있다.

For details about the Kafka's commit log storage and replication design, please read this page.

Kafka의 커밋 로그 저장 및 복제 설계에 대한 자세한 내용은 이 페이지를 참조하라.

Kafka for Stream Processing

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

데이터 스트림을 읽고, 쓰고, 저장하는 것만으로는 충분하지 않다. 목적은 스트림의 실시간 처리를 가능하게하는 것이다.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

Kafka에서 스트림 프로세서는 입력 TOPIC에서 연속적인 데이터 스트림을 가져 와서 이 입력에 대한 일부 처리를 수행하고 TOPIC을 출력하기 위해 지속적인 데이터 스트림을 생성하는 모든 것이다.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

예를 들어, retail application은 판매 및 출하의 입력 스트림을 받아, 이 데이터에서 계산된 재주문 및 가격 조정 스트림을 출력 할 수 있다.

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

Producer와 Consumer API를 사용하여 직접 간단한 처리를 수행 할 수 있다. 그러나 보다 복잡한 변환의 경우 Kafka는 완전히 통합 된 Streams API를 제공한다. 따라서 스트림에서 집계를 계산하거나 스트림을 함께 결합하는 중요하지 않은 처리를 하는 Application을 작성할 수 있다.

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

이 기능은 이 유형의 application이 직면 한 어려운 문제를 해결하는 데 도움이 된다. 즉, 순서가 잘못된 데이터 처리, 코드 변경 사항으로 입력 재 처리, 상태 계산 등이다.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

스트림 API는 Kafka가 제공하는 핵심 기본 요소를 기반으로 한다. 입력에 Producer와 Consumer API를 사용하고, 상태 저장을 위해 Kafka를 사용하며, 스트림 프로세서 인스턴스 간의 결함 감내를 위해 동일한 그룹 메커니즘을 사용한다.

Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

메시징, 스토리지 및 스트림 처리의 이러한 결합은 드문 것처럼 보일 수 있지만 스트리밍 플랫폼으로서의 카프카의 역할에 필수적이다.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

HDFS와 같은 분산 파일 시스템을 사용하면 일괄 처리를 위해 정적 파일을 저장할 수 있다. 사실상 이와 같은 시스템을 사용하면 과거의 기록 데이터를 저장하고 처리 할 수 있다.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

기존 엔터프라이즈 메시징 시스템을 사용하면 가입 한 후에 도착할 향후 메시지를 처리 할 수 있다. 이런 식으로 작성된 application은 도착하는대로 미래의 데이터를 처리한다.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

Kafka는이 두 가지 기능을 모두 갖추고 있으며 스트리밍 application과 스트리밍 데이터 파이프 라인의 플랫폼으로 Kafka를 사용하는 데 있어 이 두 가지 기능이 모두 중요하다.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

스토리지 및 대기 시간이 짧은 Subscription을 결합하여 스트리밍 application은 과거 및 미래 데이터를 동일한 방식으로 처리 할 수 있다. 즉, 단일 application에서 기록 된 저장된 데이터를 처리 할 수 있지만, 마지막 레코드에 도달 할 때 종료하지 않고 이후 데이터가 도착할 때 처리를 유지할 수 있다. 이는 메시지 처리 응용 프로그램뿐만 아니라 일괄 처리를 포함하는 스트림 처리의 일반화 된 개념이다.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

마찬가지로 스트리밍 데이터 파이프 라인의 경우 실시간 이벤트에 가입하면 매우 짧은 지연 시간의 파이프 라인에 Kafka를 사용할 수 있다. 데이터를 안정적으로 저장하는 기능은 데이터 전달을 보장해야하는 중요한 데이터 또는 주기적으로 데이터를로드하는 오프라인 시스템과의 통합을 위해 사용하거나 유지 관리를 위해 오랜 기간 동안 중단 될 수 있다. 스트림 처리 설비는 도착하는대로 데이터를 변환 할 수있게 한다.

For more information on the guarantees, APIs, and capabilities Kafka provides see the rest of the documentation.

Kafka가 제공하는 보증, API 및 기능에 대한 자세한 내용은 나머지 설명서를 참조하라.

이렇게 문서 해석을 마쳤다. 다음엔 내가 받은 면접 질문에 대해 하나씩 알아보도록 하겠다.