Streamlining Data Ingestion with Pulsar and Kafka Connectors

Modern data systems rely on ingesting large volumes of data from diverse sources—databases, logs, cloud services, and sensors. To handle this complexity at scale, tools like Apache Pulsar and Kafka Connect provide extensible frameworks for streamlined, real-time ingestion.

In this post, you’ll learn how to use Apache Pulsar with Kafka Connectors to build modular and scalable data ingestion pipelines, leveraging Kafka Connect’s rich ecosystem of connectors with Pulsar’s cloud-native messaging architecture.

Why Combine Pulsar with Kafka Connect?

While Pulsar has its own native connector framework (Pulsar IO), integrating it with Kafka Connect offers:

Access to 100+ pre-built connectors (Debezium, JDBC, Elasticsearch, etc.)
Easy integration with enterprise systems and cloud services
Consistent data pipeline architecture
Minimal custom code

This hybrid approach enables organizations to ingest from virtually any source directly into Pulsar topics.

Architecture Overview

[Source Systems (DB, REST, MQ)]
↓
[Kafka Connect with Source Connectors]
↓
[Apache Pulsar Sink Connector]
↓
[Pulsar Topics → Stream Processing / Analytics / ML]

You can also do the reverse—sink data from Pulsar into other systems using Kafka-compatible sink connectors.

Setting Up Kafka Connect with Pulsar Sink

Apache Pulsar supports Kafka Connect via its Kafka Connect Adapter, which allows running a Kafka Connect cluster that writes to Pulsar topics.

Step 1: Deploy Kafka Connect (standalone or distributed)

connect-distributed connect-distributed.properties

Step 2: Install the Pulsar Sink Connector JAR

Download from: https://github.com/streamnative/pulsar-kafka-connect/releases

Place it in the Kafka Connect plugins path.

Step 3: Define your connector config

{
"name": "jdbc-pulsar-sink",
"config": {
"connector.class": "org.apache.pulsar.io.kafka.connect.PulsarSinkConnector",
"topics": "source-topic",
"pulsar.service.url": "pulsar://localhost:6650",
"pulsar.topic": "persistent://public/default/db-stream",
"tasks.max": "1"
}
}

Submit it:

curl -X POST -H "Content-Type: application/json" \
--data @jdbc-pulsar-sink.json \
http://localhost:8083/connectors

Ingesting from Relational Databases (Debezium + Pulsar)

Use Debezium Kafka Connector to capture CDC (change data capture) events:

Configure Debezium (MySQL/Postgres) as Kafka Connect source
Set Pulsar as sink topic via Pulsar Sink Connector
CDC events from your DB stream into Pulsar

[MySQL DB] → [Debezium Source] → [Kafka Connect] → [Pulsar Sink] → [Pulsar Topic]

This enables real-time ingestion and replication without writing custom integration code.

Ingesting from Cloud Services

With Kafka Connect’s cloud connectors, you can stream data from:

Amazon S3
Google Pub/Sub
Azure Blob Storage
Salesforce, SAP, and more

Using the same Pulsar sink connector setup, all ingested events land in Pulsar topics ready for stream processing.

Pulsar IO vs Kafka Connect

Feature	Pulsar IO	Kafka Connect Integration
Connector Library	Smaller but growing	100+ mature connectors available
Deployment	Integrated in Pulsar cluster	External Kafka Connect service
Ecosystem	Pulsar-native	Kafka-native (requires adaptation)
Use Case Fit	Pulsar-first architectures	Hybrid / migration / interop scenarios

Use Pulsar IO for Pulsar-native solutions. Use Kafka Connect when reusing existing Kafka pipelines or connectors is needed.

Monitoring and Observability

Use Pulsar Admin to monitor sink topic throughput
Monitor Kafka Connect REST API for connector status
Add Prometheus exporters for Pulsar and Connect metrics
Visualize with Grafana dashboards

Watch for:

Lag in source connectors
Pulsar topic write latency
Connector failures or restarts

Best Practices

Use schema registry (e.g., Confluent or Apicurio) for schema evolution
Align message formats across systems (e.g., JSON, Avro)
Use topic naming conventions to organize pipelines
Deploy Connectors as containers for better portability
Enable dead-letter topics to handle transformation or delivery failures

Conclusion

By integrating Apache Pulsar with Kafka Connect, you can rapidly build end-to-end, real-time ingestion pipelines that pull from virtually any data source into Pulsar’s scalable messaging backbone. This combination offers the best of both worlds: the extensive Kafka connector ecosystem and the cloud-native, multi-tenant architecture of Pulsar.

Whether you’re modernizing legacy ingestion flows or building cloud-native data platforms, this approach accelerates development and streamlines operations.