Tesey DataFlow

Tesey DataFlow lets you process dataflows as in batch and as well in streaming modes in any Apache Beam's supported execution engines including Apache Spark, Apache Flink, Apache Samza, etc.

Installation

Clone the repository, and install the package with

mvn clean install

Usage

Describe endpoints used in processing data in endpoints.yaml similar to following:

endpoints:
  - name: authorization
    type: kafka
    schemaPath: avro/authorization.avsc
    format: avro
    options:
    - name: topic
      value: authorization
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: transaction
    type: kafka
    schemaPath: avro/transaction.avsc
    format: avro
    options:
    - name: topic
      value: transaction
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: summary
    type: kafka
    schemaPath: avro/summary.avsc
    format: avro
    options:
    - name: topic
      value: summary
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: customer
    type: kafka
    schemaPath: avro/customer.avsc
    format: avro
    options:
    - name: topic
      value: customer
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: report
    type: kafka
    schemaPath: avro/report.avsc
    format: avro
    options:
    - name: topic
      value: report
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092
  - name: groupedAuthorizationsByCnum
    type: kafka
    schemaPath: avro/groupedAuthorizationsByCnum.avsc
    format: avro
    options:
    - name: topic
      value: groupedAuthorizationsByCnum
    - name: bootstrapServers
      value: kafka-cp-kafka-headless:9092

Describe dataflows that should be processed in application.yaml similar to following:

dataflows:
  - name: authorizationStream
    source: authorization
    isFirst: true
    window: 60000
  - name: transactionStream
    source: transaction
    window: 60000
  - name: summaryStream
    source: authorizationStream
    select: "authorizationStream.operationId, authorizationStream.cnum, authorizationStream.amount, authorizationStream.currency, authorizationStream.authTime, transaction.entryId, transaction.entryTime"
    window: 60000
    join:
      dataflow: transactionStream
      where: "authorizationStream.operationId = transaction.operationId"
    sink: summary
  - name: customerStream
    source: customer
    window: 60000
  - name: reportStream
    source: summaryStream
    select: "summaryStream.operationId, summaryStream.cnum, customerStream.firstName, customerStream.lastName, summaryStream.amount, summaryStream.currency, summaryStream.authTime, summaryStream.entryId, summaryStream.entryTime"
    join:
      dataflow: customerStream
      where: "summaryStream.cnum = customerStream.cnum"
    sink: report
  - name: groupedAuthorizationsByCnumStream
    source: summaryStream
    select: "authorizationStream.cnum, SUM(authorizationStream.amount) AS total_amount"
    groupBy: "authorizationStream.cnum"
    sink: groupedAuthorizationsByCnum

Submit the application like the following specified the pathes to endpoints.yaml and application.yaml in options endpointConfigFilePath and dataflowConfigFilePath respectively:

Submit Spark Application

spark-submit \
--class org.tesey.dataflow.DataflowProcessor \
--master yarn \
--deploy-mode cluster \
target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=SparkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml

Submit Flink Application

./bin/flink run \
-m yarn-cluster \
-c org.tesey.dataflow.DataflowProcessor target/tesey-dataflow-1.0-SNAPSHOT.jar \
--runner=FlinkRunner \
--streaming=true \
--endpointConfigFilePath=configs/endpoints.yaml \
--dataflowConfigFilePath=configs/application.yaml

Endpoints specification

Field	Description
`name` string	The name used to identify endpoint
`type` string	The type of endpoint, now is supported types are `kafka` and `file`
`schemaPath` string	The path to Avro schema that corresponds with the structure of ingesting/exporting records
`format` string	The format of ingesting/exporting data. currently supported formats are `avro` and `parquet`
`options`	The set of options depends on the endpoint type

EndpointType

The DataflowAggregator currently supported the following endpoint types:

kafka - the endpoint type used to read/write messages in Apache Kafka topics
file - the endpoint type used to read/write files to HDFS or to object storage like S3, GS, etc.
jms - the endpoint type used to read/write in JMS queues.

Kafka endpoint options

Name	Description
`topic` string	The name of Kafka topic
`bootstrapServers` string	A comma-separated list of host and port pairs that are the addresses of the Kafka brokers

File endpoint options

Name	Description
`pathToDataset` string	The path of ingesting/exporting files

Jms endpoint options

Name	Description
`brokerHost` string	The JMS broker host
`brokerPort` string	The JMS broker port
`queueManager` string	The name of a queue manager
`messageChannel` string	The name of a messageChannel
`transportType` integer	The transport type
`queueName` string	The queue name
`rootTag` string	The tag of a root element (specified for xml format)
`recordTag` string	The tag of a record element (specified for xml format)
`xsltStylesheetPath` string	The path of a XSLT file
`charset` string	The character set

Dataflows specification

Field Description

name
string The name used to identify dataflow

isFirst
boolean The flag telling the dataflow should be processed firstly

source
string The name of source / dataflow used to read data from

select
string A comma-separated list of selected fields

filter
string The filter predicate

sink
string The name of endpoint that should be used as sink to write records to

groupBy
string A comma-separated list of fields which is used for grouping rows on

window
integer Window size in milliseconds

join

`dataflow` string	The name of dataflow that should be joined with the dataflow described above
`where` string	The join predicate

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src/main		src/main
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tesey DataFlow

Installation

Usage

Submit Spark Application

Submit Flink Application

Endpoints specification

EndpointType

Kafka endpoint options

File endpoint options

Jms endpoint options

Dataflows specification

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

tesey-io/DataFlow

Folders and files

Latest commit

History

Repository files navigation

Tesey DataFlow

Installation

Usage

Submit Spark Application

Submit Flink Application

Endpoints specification

EndpointType

Kafka endpoint options

File endpoint options

Jms endpoint options

Dataflows specification

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages