-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Streaming] Examples using Twitter's Algebird library #480
Conversation
…lgebird Conflicts: project/SparkBuild.scala
…text.twitterStream method
val stream = ssc.twitterStream(username, password, filters, | ||
StorageLevel.MEMORY_ONLY_SER) | ||
|
||
val users = stream.map(status => status.getUser.getId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A note about this: currently Algebird CMS only supports Long inputs. Since it uses hashing under the hood it should be possible to have any hashable input as with HyperLogLog, but not currently.
So for now this example works on user ids, so running it over relatively small durations will not result in very heavily-skewed data (which is where the sketch will be most useful). If we could take String inputs then it would be more interesting as we could do TopK on hashtags (for example) which is likely to be a lot more skewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This maybe an important point that may confused people. Can you added a line to the comment at the top?
Glad to see this pull req. Hope this helps CMS and HLL make more impacts. I agree that the CMS interface is suboptimal now. We are going to update it to support the same approach as HLL (probably in algebird 0.2.0). Let us know if there are any algorithms to add. I'd love to collaborate and share this code in Algebird (which we extracted from scalding). |
<version>3.0.3</version> | ||
<groupId>com.twitter</groupId> | ||
<artifactId>algebird-core_2.9.2</artifactId> | ||
<version>0.1.8</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1.9 is out!
@johnynek thanks for the comments! Look forward to 0.2.0 in that case since CMS with any hashable inputs will be neat. Also if I find some time I'd be happy to try a scalding version of the example. |
[Streaming] Examples using Twitter's Algebird library
Thank you very much. This is a great addition. |
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <[email protected]> Closes mesos#480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
This PR adds two examples for streaming that use monoids from Twitter's Algebird library:
See https://groups.google.com/forum/?fromgroups=#!topic/spark-users/4ht9ndVaZQY