What is CDC (Change Data Capture)?
- Change Data Capture (CDC) is the process of observing changes in a database and making them available in a format that other systems can use such as making them available as a stream of events. As the data is written to the Database, you may, for example, detect the events and update a search index.
- CDC does this by detecting row-level changes in Database source tables, which are characterized as Insert, Update, and Delete events. It then notifies any other systems or services that rely on the same data. The change alerts are sent out in the same order that they were made in the Database. As a result, CDC guarantees that all parties interested in a given data set are reliably notified of the change and may respond appropriately, either by refreshing their own version of the data or by activating business processes.
Importance of Kafka CDC
- Change Data Capture refers to a collection of techniques that enable you to discover and record data that has changed in your Database so that you can act on it later. CDC can help you streamline and optimize your data and application infrastructures.
- enterprises are turning to Change Data Capture when an Apache Kafka system requires continuous and real-time data intake from corporate Databases.
The following are the main reasons why Kafka CDC is superior to other methods:
1. Kafka is a messaging system that allows you to handle events and transmit data to applications in real-time. Kafka CDC transforms Databases into streaming data sources, delivering new transactions to Kafka in real-time rather than batching them and causing delays for Kafka consumers.
2. When done non-intrusively via reading the Database redo or transaction logs, Kafka CDC has the least impact on source systems. Performance degradation or change of your production sources is avoided with log-based Kafka CDC.
3. You can make better use of your network bandwidth by using Kafka CDC by transferring just changed data continually rather than vast quantities of data in batches.
4. When you transfer changed data constantly rather than utilizing Database snapshots, you obtain more specific information about what happened in the period between snapshots. Granular data flow helps downstream Analytics systems to produce more accurate and richer insights.
Types of Kafka CDC
Kafka CDC allows you to capture everything currently in the Database and any fresh data updates. There are 2 types of Kafka CDC:
1) Query-Based Kafka CDC: Query-based Kafka CDC pulls fresh data from the Database using a Database query. A predicate will be included in the query to determine what has changed. This will be based on a timestamp field or an identifier column that will be incremented (or both). The JDBC connector for Kafka Connect provides Query-based Kafka CDC. It is offered as a fully managed service in Confluent or as a self-managed connector.
2) Log-Based Kafka CDC: Log-based Kafka CDC leverages the transaction log of the Database to extract details of every modification performed. The implementation and specifications of the transaction log will differ with every Database, but all are built on the same concepts. The transaction log records every modification made to the Database.
Kafka CDC Using IBM Data Replication Management Tool
Pre-requirement: IBM Management Tool, Confluent Cloud
Follow the steps to Configure Kafka CDC using IBM Infosphere Data Replication Tool
1. Search for Management Console in Start Menu
2. Login into Management Console
3. Go to Access Manager
4. Right Click on Empty Row area of Datastore Management will show option for creating new Datastore.
5. Add Datastore details and ping it to verify its correctness.
6. Add connection parameters for Datastore details.
7. Now you can see Datastore. As shown in the image below I have configured Target Datastore.
8. Assign Datastore to User.
9. Assign Datastore to User
10. After assigning to user, go to configuration > at the bottom of frame, Datastore will be shown, right click on newly created datastore and connect.
11. Add Another Datastore for source Datastore and assign to User and connect to datastore
12. Right Click on Project and click on new Subscription.
13. Add name, description, source and target datastore for new subscription.
14. Clik on Ok, prompt box open with option for map table.
15. Click on yes will check for mapping table of source.
16. There will be an option for Kafka Mapping, select Multiple Kafka Mapping and click on next.
17. There will be shown multiple databases as in the image below.
18. Click on Specify filter to find database.
19. After clicking on ok of specify filter, there will be a database shown
20. On clicking Database, you can see tables of that database.
21. Enter the name of the table in the search box, it will show the table name.
22. Check the table and click on next.
23. After clicking on next it will start map table.
24.You will be able to see the map table.
25. Double click on Mapped table will show another block showing Recent Mapping.
26. On Click new Derived column, there will be an option to add new column to combine multiple source column that can be used for target.
27.You can add expression for the derived field, if needed.
28. After adding an expression make sure expression is valid, click on verify to check expression is valid or not.
29. You can check after the field is added in column mapping.
30. After that Right click on subscription and click on Kafka properties.
31. Add Configuration for Kafka properties and click on ok.
32. Same way need to add User Exists properties.
33. After adding User Exist properties, you can start mirroring that will start subscription process
34. After completing mirroring, you can view data via Confluent. Login into Confluent cloud and view topic from beginning, topic name will be according to topic prefix that you added on Kafka properties Configuration.
Conclusion:
Hope this post gives you an idea of what change data capture is, the importance of Kafka change data capture, and how to configure it with Kafka and the IBM management console. Let me know if you have any questions.