Tesey DataFlow lets you process dataflows as in batch and as well in streaming modes in any Apache Beam's supported execution engines including Apache Spark, Apache Flink, Apache Samza, etc.
Clone the repository, and install the package with
mvn clean install
- Describe endpoints used in processing data in endpoints.yaml similar to following:
endpoints:
- name: authorization
type: kafka
schemaPath: avro/authorization.avsc
format: avro
options:
- name: topic
value: authorization
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092
- name: transaction
type: kafka
schemaPath: avro/transaction.avsc
format: avro
options:
- name: topic
value: transaction
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092
- name: summary
type: kafka
schemaPath: avro/summary.avsc
format: avro
options:
- name: topic
value: summary
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092
- name: customer
type: kafka
schemaPath: avro/customer.avsc
format: avro
options:
- name: topic
value: customer
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092
- name: report
type: kafka
schemaPath: avro/report.avsc
format: avro
options:
- name: topic
value: report
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092
- name: groupedAuthorizationsByCnum
type: kafka
schemaPath: avro/groupedAuthorizationsByCnum.avsc
format: avro
options:
- name: topic
value: groupedAuthorizationsByCnum
- name: bootstrapServers
value: kafka-cp-kafka-headless:9092- Describe dataflows that should be processed in application.yaml similar to following:
dataflows:
- name: authorizationStream
source: authorization
isFirst: true
window: 60000
- name: transactionStream
source: transaction
window: 60000
- name: summaryStream
source: authorizationStream
select: "authorizationStream.operationId, authorizationStream.cnum, authorizationStream.amount, authorizationStream.currency, authorizationStream.authTime, transaction.entryId, transaction.entryTime"
window: 60000
join:
dataflow: transactionStream
where: "authorizationStream.operationId = transaction.operationId"
sink: summary
- name: customerStream
source: customer
window: 60000
- name: reportStream
source: summaryStream
select: "summaryStream.operationId, summaryStream.cnum, customerStream.firstName, customerStream.lastName, summaryStream.amount, summaryStream.currency, summaryStream.authTime, summaryStream.entryId, summaryStream.entryTime"
join:
dataflow: customerStream
where: "summaryStream.cnum = customerStream.cnum"
sink: report
- name: groupedAuthorizationsByCnumStream
source: summaryStream
select: "authorizationStream.cnum, SUM(authorizationStream.amount) AS total_amount"
groupBy: "authorizationStream.cnum"
sink: groupedAuthorizationsByCnum- Submit the application like the following specified the pathes to endpoints.yaml and application.yaml
in options
endpointConfigFilePathanddataflowConfigFilePathrespectively:
spark-submit \
--class org.tesey.dataflow.DataflowProcessor \
--master yarn \
--deploy-mode cluster \
target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=SparkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml./bin/flink run \
-m yarn-cluster \
-c org.tesey.dataflow.DataflowProcessor target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=FlinkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml| Field | Description |
|---|---|
namestring |
The name used to identify endpoint |
typestring |
The type of endpoint, now is supported types are kafka and file
|
schemaPathstring |
The path to Avro schema that corresponds with the structure of ingesting/exporting records |
formatstring |
The format of ingesting/exporting data. currently supported formats are avro and parquet
|
options
| The set of options depends on the endpoint type |
kafka- the endpoint type used to read/write messages in Apache Kafka topicsfile- the endpoint type used to read/write files to HDFS or to object storage like S3, GS, etc.jms- the endpoint type used to read/write in JMS queues.
| Name | Description |
|---|---|
topicstring |
The name of Kafka topic |
bootstrapServersstring |
A comma-separated list of host and port pairs that are the addresses of the Kafka brokers |
| Name | Description |
|---|---|
pathToDatasetstring |
The path of ingesting/exporting files |
| Name | Description |
|---|---|
brokerHoststring |
The JMS broker host |
brokerPortstring |
The JMS broker port |
queueManagerstring |
The name of a queue manager |
messageChannelstring |
The name of a messageChannel |
transportTypeinteger |
The transport type |
queueNamestring |
The queue name |
rootTagstring |
The tag of a root element (specified for xml format) |
recordTagstring |
The tag of a record element (specified for xml format) |
xsltStylesheetPathstring |
The path of a XSLT file |
charsetstring |
The character set |
| Field | Description | ||||
|---|---|---|---|---|---|
namestring |
The name used to identify dataflow | ||||
isFirstboolean |
The flag telling the dataflow should be processed firstly | ||||
sourcestring |
The name of source / dataflow used to read data from | ||||
selectstring |
A comma-separated list of selected fields | ||||
filterstring |
The filter predicate | ||||
sinkstring |
The name of endpoint that should be used as sink to write records to | ||||
groupBystring |
A comma-separated list of fields which is used for grouping rows on | ||||
windowinteger |
Window size in milliseconds | ||||
join |
|