Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Streaming] Examples using Twitter's Algebird library #480

Merged
merged 7 commits into from
Feb 22, 2013
Merged

[Streaming] Examples using Twitter's Algebird library #480

merged 7 commits into from
Feb 22, 2013

Conversation

MLnick
Copy link
Contributor

@MLnick MLnick commented Feb 19, 2013

This PR adds two examples for streaming that use monoids from Twitter's Algebird library:

  • HyperLogLog for approximate distinct object counting with low memory overhead
  • CountMinSketch for approximating object frequency in a stream as well as TopK or "heavy hitters" estimation

See https://groups.google.com/forum/?fromgroups=#!topic/spark-users/4ht9ndVaZQY

val stream = ssc.twitterStream(username, password, filters,
StorageLevel.MEMORY_ONLY_SER)

val users = stream.map(status => status.getUser.getId)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note about this: currently Algebird CMS only supports Long inputs. Since it uses hashing under the hood it should be possible to have any hashable input as with HyperLogLog, but not currently.

So for now this example works on user ids, so running it over relatively small durations will not result in very heavily-skewed data (which is where the sketch will be most useful). If we could take String inputs then it would be more interesting as we could do TopK on hashtags (for example) which is likely to be a lot more skewed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This maybe an important point that may confused people. Can you added a line to the comment at the top?

@johnynek
Copy link

Glad to see this pull req. Hope this helps CMS and HLL make more impacts.

I agree that the CMS interface is suboptimal now. We are going to update it to support the same approach as HLL (probably in algebird 0.2.0). Let us know if there are any algorithms to add. I'd love to collaborate and share this code in Algebird (which we extracted from scalding).

<version>3.0.3</version>
<groupId>com.twitter</groupId>
<artifactId>algebird-core_2.9.2</artifactId>
<version>0.1.8</version>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1.9 is out!

@MLnick
Copy link
Contributor Author

MLnick commented Feb 21, 2013

@johnynek thanks for the comments! Look forward to 0.2.0 in that case since CMS with any hashable inputs will be neat. Also if I find some time I'd be happy to try a scalding version of the example.

tdas added a commit that referenced this pull request Feb 22, 2013
[Streaming] Examples using Twitter's Algebird library
@tdas tdas merged commit cfa65eb into mesos:streaming Feb 22, 2013
@tdas
Copy link
Contributor

tdas commented Feb 22, 2013

Thank you very much. This is a great addition.

sarahgerweck pushed a commit to AtScaleInc/spark2 that referenced this pull request Jan 22, 2014
Handful of 0.9 fixes

This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate.

@mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in mesos#477 and follow up discussion with him.
pwendell pushed a commit to andyk/mesos-spark that referenced this pull request May 5, 2014
1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
2, Fix SPARK-1491: maven hadoop-provided profile fails to build
3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)

Author: witgo <[email protected]>

Closes mesos#480 from witgo/format_pom and squashes the following commits:

03f652f [witgo] review commit
b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
0da4bc3 [witgo] merge master
d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
e345919 [witgo] add avro dependency to yarn-alpha
77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
934f24d [witgo] review commit
cf46edc [witgo] exclude jruby
06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
99464d2 [witgo] fix maven hadoop-provided profile fails to build
0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants