Introduction
Building and testing a real-time fraud detection application requires a continuous stream of realistic data. But generating that data can be a challenge. That's why we recently created the Datagen CLI, a simple tool that helps you create believable fake data using the FakerJS API.
In this blog post, we'll explore how to use the Datagen CLI to simulate a streaming data use-case like a fraud detection app. I'll show you how to install and configure the tool, create a schema for the data, and send the data to a Kafka topic. By the end of this tutorial, you'll be able to generate your own realistic streaming data for testing and development purposes.
Prerequisites
- Basic knowledge of Kafka and streaming data
- Node.js and npm installed on your machine
- A Kafka cluster set up and running
Installation and Setup
First, install the Datagen CLI using npm:
npm install -g @materializeinc/datagen
Create a .env file in your working directory with the necessary Kafka and Schema Registry environment variables. Replace the placeholder values with your actual settings:
# Kafka Brokers
KAFKA_BROKERS=
# For Kafka SASL Authentication:
SASL_USERNAME=
SASL_PASSWORD=
SASL_MECHANISM=
Creating a Schema for Fraud Detection Data
To generate realistic data for a fraud detection app, we need to define a schema that includes relevant fields like transaction ID, user ID, timestamp, and transaction amount. Let's create a JSON schema called transactions.json
:
[
{
"_meta": {
"topic": "transactions"
},
"transaction_id": "faker.datatype.uuid()",
"user_id": "faker.datatype.number({min: 1, max: 10000})",
"timestamp": "faker.date.between('2023-01-01', '2023-12-31')",
"amount": "faker.finance.amount(0, 10000, 2)",
"is_fraud": "faker.datatype.boolean()"
}
]
This schema generates a stream of transaction data with random transaction IDs, user IDs, timestamps, and amounts. We've also added a field called is_fraud
that randomly the transactions as fraudulent.
Generating and Sending Data to Kafka
Now that we have our schema, we can use the Datagen CLI to generate data and send it to a Kafka topic. Use the following command to generate an infinite stream of transactions in JSON format:
datagen \
-s transactions.json \
-f json \
-n -1 \
-dr
The -n
flag specifies the number of messages to generate. We've set it to -1
to generate an infinite stream of data. The -dr
flag enables dry run mode, which prints the data to the console instead of sending it to Kafka. This is useful for testing and debugging.
Example output:
✔ Dry run: Skipping record production...
Topic: transactions
Record key: null
Payload: {"transaction_id":"b86d1d57-a650-4680-843d-06179f1c4c2e","user_id":5127,"timestamp":"2023-09-02T03:26:28.194Z","amount":"6904.40","is_fraud":false}
✔ Dry run: Skipping record production...
Topic: transactions
Record key: null
Payload: {"transaction_id":"719fe62a-322c-4b58-89f9-e380e2f3552d","user_id":2757,"timestamp":"2023-09-30T06:40:37.378Z","amount":"3375.15","is_fraud":true}
Press Ctrl+C to stop producing data.
Enriching the Data
To enrich the datagen
schema example, we can add more fields related to geolocation and other attributes that can be useful for fraud detection. Update the JSON input schema as follows:
[
{
"_meta": {
"topic": "transactions",
"key": "id"
},
"id": "faker.datatype.uuid()",
"user_id": "faker.datatype.number({min: 1, max: 1000})",
"amount": "faker.finance.amount(1, 5000, 2)",
"currency": "faker.finance.currencyCode()",
"timestamp": "faker.date.past(1, '2023-01-01').getTime()",
"is_fraud": "faker.datatype.boolean({likelihood: 5})",
"ip_address": "faker.internet.ip()",
"location": {
"latitude": "faker.address.latitude()",
"longitude": "faker.address.longitude()"
},
"device": {
"id": "faker.datatype.uuid()",
"type": "faker.helpers.arrayElement(['mobile', 'tablet', 'desktop'])",
"os": "faker.helpers.arrayElement(['ios', 'android', 'windows', 'macos', 'linux', 'other'])"
},
"merchant_id": "faker.datatype.number({min: 1, max: 500})"
}
]
In this enriched schema, we've added:
-
ip_address
: An IP address related to the transaction. -
location
: An object containing latitude and longitude. -
device
: An object containing device information such as ID, type, and operating system. -
merchant_id
: A unique ID representing the merchant involved in the transaction.
Relationship between Transactions and Users
The Datagen CLI can also generate data for related entities.
For example, we can extend the schema to include a users
topic that contains user information. We can then use the user_id
field in the transactions
topic to join the two topics together:
[
{
"_meta": {
"topic": "users",
"key": "id",
"relationships": [
{
"topic": "transactions",
"parent_field": "id",
"child_field": "user_id",
"records_per": 10
}
]
},
"id": "faker.datatype.number({min: 1, max: 1000})",
"name": "faker.name.fullName()",
"email": "faker.internet.email()",
"registered_at": "faker.date.past(5, '2023-01-01').getTime()"
},
{
"_meta": {
"topic": "transactions",
"key": "id"
},
"id": "faker.datatype.uuid()",
"user_id": "faker.datatype.number(100)",
"amount": "faker.finance.amount(1, 5000, 2)",
"currency": "faker.finance.currencyCode()",
"timestamp": "faker.date.between('relationship.registered_at', new Date()).getTime()",
"is_fraud": "faker.datatype.boolean({likelihood: 5})",
"ip_address": "faker.internet.ip()",
"location": {
"latitude": "faker.address.latitude()",
"longitude": "faker.address.longitude()"
},
"device": {
"id": "faker.datatype.uuid()",
"type": "faker.helpers.arrayElement(['mobile', 'tablet', 'desktop'])",
"os": "faker.helpers.arrayElement(['ios', 'android', 'windows', 'macos', 'linux', 'other'])"
},
"merchant_id": "faker.datatype.number({min: 1, max: 500})"
}
]
The data will be produced to the users
and transactions
topics. The transactions
topic will contain a user_id
field that references the id
field in the users
topic. The transactions
topic will also contain a registered_at
field that references the registered_at
field in the users
topic.
An example of the data produced:
...
Topic: users
Record key: 602
Payload: {"id":602,"name":"Mr. Jennie Prohaska","email":"[email protected]","registered_at":1591058898886}
Topic: transactions
Record key: 417f6a6b-d7c5-47a0-a013-93f2aee94941
Payload: {"user_id":602,"id":"417f6a6b-d7c5-47a0-a013-93f2aee94941","amount":"1760.29","currency":"MZN","timestamp":1680516946423,"is_fraud":true,"ip_address":"240.254.28.18","location":{"latitude":"60.3920","longitude":"11.9718"},"device":{"id":"1e57d9d2-de48-4bf2-9131-e70ceb5f4fee","type":"mobile","os":"macos"},"merchant_id":20}
...
Testing Your Fraud Detection App
With realistic streaming data available, you can now test your fraud detection app using the data generated by the Datagen CLI. Consume the data from the transactions Kafka topic and implement your fraud detection logic, which may involve analyzing transaction patterns, comparing with historical data, or applying machine learning models.
As a next step you can use Materialize to create a materialized view of the transactions data. This will allow you to query the data in real-time and build a fraud detection dashboard.
Conclusion
The Datagen CLI is a simple tool for generating realistic streaming data for testing and development purposes. In this tutorial, we showcased how to use Datagen CLI to simulate a fraud detection app use-case. With this knowledge, you can create your own schemas and generate data for various streaming data applications.
Useful Links: