Skip to content

Conversation

@xinyuiscool
Copy link
Contributor

Add beam quick start, examples and api docs.

Copy link
Member

@dxichen dxichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, thanks for the docs!


{% endhighlight %}

To run this Beam program with Samza, you can simply provides "--runner=SamzaRunner" as a program argument. You can follow our [quick start](/startup/quick-start/{{site.version}}/beam.html) to set up your project and run different examples. For more details on writing the Beam program, please refer the comprehensive [Beam programming guide](https://beam.apache.org/documentation/programming-guide/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/provides/provide
s/refer the comprehensive/refer to the..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

```
$ deploy/examples/bin/run-beam-standalone.sh org.apache.beam.examples.WordCount \
--configFilePath=$PWD/deploy/examples/config/standalone.properties \
--inputFile=/Users/xiliu/opensource/samza-beam-examples/pom.xml --output=word-counts.txt \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove username, I have a patch for these docs here apache/samza-beam-examples#1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to KafkaWordCount, to avoid the batch problems we have.

```
$ deploy/examples/bin/run-beam-yarn.sh org.apache.beam.examples.WordCount \
--configFilePath=$PWD/deploy/examples/config/yarn.properties \
--inputFile=/Users/xiliu/opensource/samza-beam-examples/pom.xml \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove username

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by switching to kafka.


#### Samza SQL API examples
You can easily create a Samza job declaratively using
[Samza SQL](https://samza.apache.org/learn/tutorials/0.14/samza-sql.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change version to latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!


### Apache Beam - A Samza’s Perspective

The goal of Samza is to provide large-scale streaming processing capabilities with first-class state support. This does not contradict with Beam. In fact, while Samza lays out a solid foundation for large-scale stateful stream processing, Beam adds the cutting-edge stream processing API and model on top of it. The Beam API and model allows further optimization in the Samza platform, including multi-stage distributed computation and parallel processing on the per-key basis. The performance enhancements from these optimizations will benefit both Samza and its users. Samza can also further improve Beam model by providing various use cases. Adopting Beam provides a solid understanding of the latest data processing technology, and we believe Samza will benefit from it. No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Adopting Beam provides a solid understanding of the latest data processing technology/ Beam provides cutting-edge data processing capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


### Introduction

Apache Beam brings an easy-to-use, but powerful API and model for state-of-art stream and batch data processing with portability across a variety of languages. The Beam API and model has the following characteristics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:
s/but powerful API/ powerful API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems better to keep the but. I removed "," to improve readability.


- *Simple constructs, powerful semantics*: the whole beam API can be simply described by a `Pipeline` object, which captures all your data processing steps from input to output. Beam SDK supports over [20 data IOs](https://beam.apache.org/documentation/io/built-in/), and data transformations from simple [Map](https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/transforms/MapElements.html) to complex [Combines and Joins](https://beam.apache.org/releases/javadoc/2.11.0/index.html?org/apache/beam/sdk/transforms/Combine.html).

- *Strong consistency via event-time*: Beam provides advanced [event-time support](https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data) so you can perform windowing and aggregations based on when the events happen, instead of when they are consumed. The event-time mechanism improves the accuracy of processing results, and has repeatability when reprocessing the same data set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

  1. s/instead of when they are consumed/instead of arrival time?
  2. s/and has repeatability/and guarantees repeatability in results/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


1. Download and install [Apache Maven](http://maven.apache.org/download.cgi) by following Maven’s [installation guide](http://maven.apache.org/install.html) for your specific operating system.

1. A script named "grid" is included in this project which allows you to easily download and install Zookeeper, Kafka, and Yarn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the individual line-items in SetUp(Install JDK, install maven, install grid) are numbered with 1. May be it would better to provide them right ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. install grid shouldn't be marked a 1. I fixed in the update.

@dxichen
Copy link
Member

dxichen commented Mar 12, 2019

LGTM, thanks!

Copy link
Contributor

@shanthoosh shanthoosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@shanthoosh shanthoosh merged commit 6711a9f into apache:master Mar 12, 2019
asfgit pushed a commit that referenced this pull request Mar 12, 2019
* SAMZA-2124: Add Beam API doc to the website

* Address pr feedback
Zhangyx39 pushed a commit to Zhangyx39/samza that referenced this pull request Apr 3, 2019
* SAMZA-2124: Add Beam API doc to the website

* Address pr feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants