Skip to content

Tesey DataFlow lets you process dataflows as in batch and as well in streaming modes in any Apache Beam's supported execution engines including Apache Spark, Apache Flink, Apache Samza, etc.

License

Notifications You must be signed in to change notification settings

tesey-io/DataFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Tesey DataFlow

Tesey DataFlow lets you process dataflows as in batch and as well in streaming modes in any Apache Beam's supported execution engines including Apache Spark, Apache Flink, Apache Samza, etc.

Installation

Clone the repository, and install the package with

mvn clean install

Usage

  1. Describe endpoints used in processing data in endpoints.yaml similar to following:
endpoints:
  - name: authorization
    type: kafka
    schemaPath: avro/authorization.avsc
    format: avro
    options:
    - name: topic
      value: authorization
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: transaction
    type: kafka
    schemaPath: avro/transaction.avsc
    format: avro
    options:
    - name: topic
      value: transaction
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: summary
    type: kafka
    schemaPath: avro/summary.avsc
    format: avro
    options:
    - name: topic
      value: summary
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: customer
    type: kafka
    schemaPath: avro/customer.avsc
    format: avro
    options:
    - name: topic
      value: customer
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: report
    type: kafka
    schemaPath: avro/report.avsc
    format: avro
    options:
    - name: topic
      value: report
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: groupedAuthorizationsByCnum
    type: kafka
    schemaPath: avro/groupedAuthorizationsByCnum.avsc
    format: avro
    options:
    - name: topic
      value: groupedAuthorizationsByCnum
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  1. Describe dataflows that should be processed in application.yaml similar to following:
dataflows:
  - name: authorizationStream
    source: authorization
    isFirst: true
    window: 60000
  - name: transactionStream
    source: transaction
    window: 60000
  - name: summaryStream
    source: authorizationStream
    select: "authorizationStream.operationId, authorizationStream.cnum, authorizationStream.amount, authorizationStream.currency, authorizationStream.authTime, transaction.entryId, transaction.entryTime"
    window: 60000
    join:
      dataflow: transactionStream
      where: "authorizationStream.operationId = transaction.operationId"
    sink: summary
  - name: customerStream
    source: customer
    window: 60000
  - name: reportStream
    source: summaryStream
    select: "summaryStream.operationId, summaryStream.cnum, customerStream.firstName, customerStream.lastName, summaryStream.amount, summaryStream.currency, summaryStream.authTime, summaryStream.entryId, summaryStream.entryTime"
    join:
      dataflow: customerStream
      where: "summaryStream.cnum = customerStream.cnum"
    sink: report
  - name: groupedAuthorizationsByCnumStream
    source: summaryStream
    select: "authorizationStream.cnum, SUM(authorizationStream.amount) AS total_amount"
    groupBy: "authorizationStream.cnum"
    sink: groupedAuthorizationsByCnum
  1. Submit the application like the following specified the pathes to endpoints.yaml and application.yaml in options endpointConfigFilePath and dataflowConfigFilePath respectively:

Submit Spark Application

spark-submit \
--class org.tesey.dataflow.DataflowProcessor \
--master yarn \
--deploy-mode cluster \
target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=SparkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml

Submit Flink Application

./bin/flink run \
-m yarn-cluster \
-c org.tesey.dataflow.DataflowProcessor target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=FlinkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml

Endpoints specification

Field Description
name
string
The name used to identify endpoint
type
string
The type of endpoint, now is supported types are kafka and file
schemaPath
string
The path to Avro schema that corresponds with the structure of ingesting/exporting records
format
string
The format of ingesting/exporting data. currently supported formats are avro and parquet
options The set of options depends on the endpoint type

EndpointType

The DataflowAggregator currently supported the following endpoint types:
  • kafka - the endpoint type used to read/write messages in Apache Kafka topics
  • file - the endpoint type used to read/write files to HDFS or to object storage like S3, GS, etc.
  • jms - the endpoint type used to read/write in JMS queues.

Kafka endpoint options

Name Description
topic
string
The name of Kafka topic
bootstrapServers
string
A comma-separated list of host and port pairs that are the addresses of the Kafka brokers

File endpoint options

Name Description
pathToDataset
string
The path of ingesting/exporting files

Jms endpoint options

Name Description
brokerHost
string
The JMS broker host
brokerPort
string
The JMS broker port
queueManager
string
The name of a queue manager
messageChannel
string
The name of a messageChannel
transportType
integer
The transport type
queueName
string
The queue name
rootTag
string
The tag of a root element (specified for xml format)
recordTag
string
The tag of a record element (specified for xml format)
xsltStylesheetPath
string
The path of a XSLT file
charset
string
The character set

Dataflows specification

Field Description
name
string
The name used to identify dataflow
isFirst
boolean
The flag telling the dataflow should be processed firstly
source
string
The name of source / dataflow used to read data from
select
string
A comma-separated list of selected fields
filter
string
The filter predicate
sink
string
The name of endpoint that should be used as sink to write records to
groupBy
string
A comma-separated list of fields which is used for grouping rows on
window
integer
Window size in milliseconds
join


dataflow
string

The name of dataflow that should be joined with the dataflow described above

where
string

The join predicate

About

Tesey DataFlow lets you process dataflows as in batch and as well in streaming modes in any Apache Beam's supported execution engines including Apache Spark, Apache Flink, Apache Samza, etc.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages