Commit 9920c823 authored by blaunet's avatar blaunet

Initial COmmit

parents
File added
tweets/*
scored_tweets/*
formated_tweets/*
concat_dataframe/*
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
File added
This diff is collapsed.
Apache Storm is an effort undergoing incubation at the Apache Software
Foundation (ASF), sponsored by the Apache Incubator PMC.
Incubation is required of all newly accepted projects until a further review
indicates that the infrastructure, communications, and decision making process
have stabilized in a manner consistent with other successful ASF projects.
While incubation status is not necessarily a reflection of the completeness
or stability of the code, it does indicate that the project has yet to be
fully endorsed by the ASF.
This diff is collapsed.
Apache Storm
Copyright 2014 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
This product includes software developed by Yahoo! Inc. (www.yahoo.com)
Copyright (c) 2012-2014 Yahoo! Inc.
YAML support provided by snakeyaml (http://code.google.com/p/snakeyaml/).
Copyright (c) 2008-2010 Andrey Somov
The Netty transport uses Netty
(https://netty.io/)
Copyright (C) 2011 The Netty Project
This product uses LMAX Disruptor
(http://lmax-exchange.github.io/disruptor/)
Copyright 2011 LMAX Ltd.
This product includes the Jetty HTTP server
(http://jetty.codehaus.org/jetty/).
Copyright 1995-2006 Mort Bay Consulting Pty Ltd
JSON (de)serialization by json-simple from
(http://code.google.com/p/json-simple).
Copyright (C) 2009 Fang Yidong and Chris Nokleberg
Alternative collection types provided by google-collections from
http://code.google.com/p/google-collections/.
Copyright (C) 2007 Google Inc.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, [is used by many companies](https://github.com/nathanmarz/storm/wiki/Powered-By), and is a lot of fun to use!
The [Rationale page](https://github.com/nathanmarz/storm/wiki/Rationale) on the wiki explains what Storm is and why it was built. [This presentation](http://vimeo.com/40972420) is also a good introduction to the project.
Storm has a website at [storm-project.net](http://storm-project.net). Follow [@stormprocessor](https://twitter.com/stormprocessor) on Twitter for updates on the project.
## Documentation
Documentation and tutorials can be found on the [Storm wiki](http://github.com/nathanmarz/storm/wiki).
Developers and contributors should also take a look at our [Developer documentation](DEVELOPER.md).
## Getting help
__NOTE:__ The google groups account storm-user@googlegroups.com is now officially deprecated in favor of the Apache-hosted user/dev mailing lists.
### Storm Users
Storm users should send messages and subscribe to [user@storm.incubator.apache.org](mailto:user@storm.incubator.apache.org).
You can subscribe to this list by sending an email to [user-subscribe@storm.incubator.apache.org](mailto:user-subscribe@storm.incubator.apache.org). Likewise, you can cancel a subscription by sending an email to [user-unsubscribe@storm.incubator.apache.org](mailto:user-unsubscribe@storm.incubator.apache.org).
You can also [browse the archives of the storm-user mailing list](http://mail-archives.apache.org/mod_mbox/incubator-storm-user/).
### Storm Developers
Storm developers should send messages and subscribe to [dev@storm.incubator.apache.org](mailto:dev@storm.incubator.apache.org).
You can subscribe to this list by sending an email to [dev-subscribe@storm.incubator.apache.org](mailto:dev-subscribe@storm.incubator.apache.org). Likewise, you can cancel a subscription by sending an email to [dev-unsubscribe@storm.incubator.apache.org](mailto:dev-unsubscribe@storm.incubator.apache.org).
You can also [browse the archives of the storm-dev mailing list](http://mail-archives.apache.org/mod_mbox/incubator-storm-dev/).
### Which list should I send/subscribe to?
If you are using a pre-built binary distribution of Storm, then chances are you should send questions, comments, storm-related announcements, etc. to [user@storm.apache.incubator.org](user@storm.apache.incubator.org).
If you are building storm from source, developing new features, or otherwise hacking storm source code, then [dev@storm.incubator.apache.org](dev@storm.incubator.apache.org) is more appropriate.
### What will happen with storm-user@googlegroups.com?
All existing messages will remain archived there, and can be accessed/searched [here](https://groups.google.com/forum/#!forum/storm-user).
New messages sent to storm-user@googlegroups.com will either be rejected/bounced or replied to with a message to direct the email to the appropriate Apache-hosted group.
### IRC
You can also come to the #storm-user room on [freenode](http://freenode.net/). You can usually find a Storm developer there to help you out.
## License
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
## Project lead
* Nathan Marz ([@nathanmarz](http://twitter.com/nathanmarz))
## Committers
* James Xu ([@xumingming](https://github.com/xumingming))
* Jason Jackson ([@jason_j](http://twitter.com/jason_j))
* Andy Feng ([@anfeng](https://github.com/anfeng))
* Flip Kromer ([@mrflip](https://github.com/mrflip))
* David Lao ([@davidlao2k](https://github.com/davidlao2k))
* P. Taylor Goetz ([@ptgoetz](https://github.com/ptgoetz))
* Derek Dagit ([@d2r](https://github.com/d2r))
* Robert Evans ([@revans2](https://github.com/revans2))
* Michael G. Noll ([@miguno](https://github.com/miguno))
## Contributors
* Christopher Bertels ([@bakkdoor](http://twitter.com/bakkdoor))
* Michael Montano ([@michaelmontano](http://twitter.com/michaelmontano))
* Dennis Zhuang ([@killme2008](https://github.com/killme2008))
* Trevor Smith ([@trevorsummerssmith](https://github.com/trevorsummerssmith))
* Ben Hughes ([@schleyfox](https://github.com/schleyfox))
* Alexey Kachayev ([@kachayev](https://github.com/kachayev))
* Haitao Yao ([@haitaoyao](https://github.com/haitaoyao))
* Dan Dillinger ([@ddillinger](https://github.com/ddillinger))
* Kang Xiao ([@xiaokang](https://github.com/xiaokang))
* Gabriel Grant ([@gabrielgrant](https://github.com/gabrielgrant))
* Travis Wellman ([@travisfw](https://github.com/travisfw))
* Kasper Madsen ([@KasperMadsen](https://github.com/KasperMadsen))
* Michael Cetrulo ([@git2samus](https://github.com/git2samus))
* Thomas Jack ([@tomo](https://github.com/tomo))
* Nicolas Yzet ([@nicoo](https://github.com/nicoo))
* Fabian Neumann ([@hellp](https://github.com/hellp))
* Soren Macbeth ([@sorenmacbeth](https://github.com/sorenmacbeth))
* Ashley Brown ([@ashleywbrown](https://github.com/ashleywbrown))
* Guanpeng Xu ([@herberteuler](https://github.com/herberteuler))
* Vinod Chandru ([@vinodc](https://github.com/vinodc))
* Martin Kleppmann ([@ept](https://github.com/ept))
* Evan Chan ([@velvia](https://github.com/velvia))
* Sjoerd Mulder ([@sjoerdmulder](https://github.com/sjoerdmulder))
* Yuta Okamoto ([@okapies](https://github.com/okapies))
* Barry Hart ([@barrywhart](https://github.com/barrywhart))
* Sergey Lukjanov ([@Frostman](https://github.com/Frostman))
* Ross Feinstein ([@rnfein](https://github.com/rnfein))
* Junichiro Takagi ([@tjun](https://github.com/tjun))
* Bryan Peterson ([@Lazyshot](https://github.com/Lazyshot))
* Sam Ritchie ([@sritchie](https://github.com/sritchie))
* Stuart Anderson ([@emblem](https://github.com/emblem))
* Lorcan Coyle ([@lorcan](https://github.com/lorcan))
* Andrew Olson ([@noslowerdna](https://github.com/noslowerdna))
* Gavin Li ([@lyogavin](https://github.com/lyogavin))
* Tudor Scurtu ([@tscurtu](https://github.com/tscurtu))
* Homer Strong ([@strongh](https://github.com/strongh))
* Sean Melody ([@srmelody](https://github.com/srmelody))
* Jake Donham ([@jaked](https://github.com/jaked))
* Ankit Toshniwal ([@ankitoshniwal](https://github.com/ankitoshniwal))
## Acknowledgements
YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: [YourKit Java Profiler](http://www.yourkit.com/java/profiler/index.jsp) and [YourKit .NET Profiler](http://www.yourkit.com/.net/profiler/index.jsp).
0.9.2-incubating
# Running Apache Storm Securely
The current release of Apache Storm offers no authentication or authorization.
It does not encrypt any data being sent across the network, and does not
attempt to restrict access to data stored on the local file system or in
Apache Zookeeper. As such there are a number of different precautions you may
want to enact outside of storm itself to be sure storm is running securely.
The exact detail of how to setup these precautions varies a lot and is beyond
the scope of this document.
## Network Security
It is generally a good idea to enable a firewall and restrict incoming network
connections to only those originating from the cluster itself and from trusted
hosts and services, a complete list of ports storm uses are below.
If the data your cluster is processing is sensitive it might be best to setup
IPsec to encrypt all traffic being sent between the hosts in the cluster.
### Ports
| Default Port | Storm Config | Client Hosts/Processes | Server |
|--------------|--------------|------------------------|--------|
| 2181 | `storm.zookeeper.port` | Nimbus, Supervisors, and Worker processes | Zookeeper |
| 6627 | `nimbus.thrift.port` | Storm clients, Supervisors, and UI | Nimbus |
| 8080 | `ui.port` | Client Web Browsers | UI |
| 8000 | `logviewer.port` | Client Web Browsers | Logviewer |
| 3772 | `drpc.port` | External DRPC Clients | DRPC |
| 3773 | `drpc.invocations.port` | Worker Processes | DRPC |
| 670{0,1,2,3} | `supervisor.slots.ports` | Worker Processes | Worker Processes |
### UI/Logviewer
The UI and logviewer processes provide a way to not only see what a cluster is
doing, but also manipulate running topologies. In general these processes should
not be exposed except to users of the cluster. It is often simplest to restrict
these ports to only accept connections from local hosts, and then front them with another web server,
like Apache httpd, that can authenticate/authorize incoming connections and
proxy the connection to the storm process. To make this work the ui process must have
logviewer.port set to the port of the proxy in its storm.yaml, while the logviewers
must have it set to the actual port that they are going to bind to.
### Nimbus
Nimbus's Thrift port should be locked down as it can be used to control the entire
cluster including running arbitrary user code on different nodes in the cluster.
Ideally access to it is restricted to nodes within the cluster and possibly some gateway
nodes that allow authorized users to log into them and run storm client commands.
### DRPC
Each DRPC server has two different ports. The invocations port is accessed by worker
processes within the cluster. The other port is accessed by external clients that
want to query the topology. The external port should be restricted to hosts that you
want to be able to do queries.
### Supervisors
Supervisors are only clients they are not servers, and as such don't need special restrictions.
### Workers
Worker processes receive data from each other. There is the option to encrypt this data using
Blowfish by setting `topology.tuple.serializer` to `backtype.storm.security.serialization.BlowfishTupleSerializer`
and setting `topology.tuple.serializer.blowfish.key` to a secret key you want your topology to use.
### Zookeeper
Zookeeper uses other ports for communications within the ensemble the details of which
are beyond the scope of this document. You should look at restricting Zookeeper access
as well, because storm does not set up any ACLs for the data it write to Zookeeper.
This diff is collapsed.
@echo off
@rem Licensed to the Apache Software Foundation (ASF) under one
@rem or more contributor license agreements. See the NOTICE file
@rem distributed with this work for additional information
@rem regarding copyright ownership. The ASF licenses this file
@rem to you under the Apache License, Version 2.0 (the
@rem "License"); you may not use this file except in compliance
@rem with the License. You may obtain a copy of the License at
@rem
@rem http://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
set STORM_HOME=%~dp0
for %%i in (%STORM_HOME%.) do (
set STORM_HOME=%%~dpi
)
if "%STORM_HOME:~-1%" == "\" (
set STORM_HOME=%STORM_HOME:~0,-1%
)
if not exist %STORM_HOME%\lib\storm*.jar (
@echo +================================================================+
@echo ^| Error: STORM_HOME is not set correctly ^|
@echo +----------------------------------------------------------------+
@echo ^| Please set your STORM_HOME variable to the absolute path of ^|
@echo ^| the directory that contains the storm distribution ^|
@echo +================================================================+
exit /b 1
)
set STORM_BIN_DIR=%STORM_HOME%\bin
if not defined STORM_CONF_DIR (
set STORM_CONF_DIR=%STORM_HOME%\conf
)
@rem
@rem setup java environment variables
@rem
if not defined JAVA_HOME (
set JAVA_HOME=c:\apps\java\openjdk7
)
if not exist %JAVA_HOME%\bin\java.exe (
echo Error: JAVA_HOME is incorrectly set.
goto :eof
)
set JAVA=%JAVA_HOME%\bin\java
set JAVA_HEAP_MAX=-Xmx1024m
@rem
@rem check envvars which might override default args
@rem
if defined STORM_HEAPSIZE (
set JAVA_HEAP_MAX=-Xmx%STORM_HEAPSIZE%m
)
@rem
@rem CLASSPATH initially contains %STORM_CONF_DIR%
@rem
set CLASSPATH=%STORM_HOME%\*;%STORM_CONF_DIR%
set CLASSPATH=%CLASSPATH%;%JAVA_HOME%\lib\tools.jar
@rem
@rem add libs to CLASSPATH
@rem
set CLASSPATH=!CLASSPATH!;%STORM_HOME%\lib\*
if not defined STORM_LOG_DIR (
set STORM_LOG_DIR=%STORM_HOME%\logs
)
if not defined STORM_LOGBACK_CONFIGURATION_FILE (
set STORM_LOGBACK_CONFIGURATION_FILE=%STORM_HOME%\logback\cluster.xml
)
%JAVA% -client -Dstorm.options= -Dstorm.conf.file= -cp %CLASSPATH% backtype.storm.command.config_value java.library.path > temp.txt
FOR /F "delims=" %%i in (temp.txt) do (
FOR /F "tokens=1,* delims= " %%a in ("%%i") do (
if %%a == VALUE: (
set JAVA_LIBRARY_PATH=%%b
goto :storm_opts)
)
)
:storm_opts
set STORM_OPTS=-Dstorm.options= -Dstorm.home=%STORM_HOME% -Djava.library.path=%JAVA_LIBRARY_PATH%
set STORM_OPTS=%STORM_OPTS% -Dlogback.configurationFile=%STORM_LOGBACK_CONFIGURATION_FILE%
set STORM_OPTS=%STORM_OPTS% -Dstorm.log.dir=%STORM_LOG_DIR%
del /F temp.txt
if not defined STORM_SERVER_OPTS (
set STORM_SERVER_OPTS=-server
)
if not defined STORM_CLIENT_OPTS (
set STORM_CLIENT_OPTS=-client
)
:eof
@echo off
@rem Licensed to the Apache Software Foundation (ASF) under one
@rem or more contributor license agreements. See the NOTICE file
@rem distributed with this work for additional information
@rem regarding copyright ownership. The ASF licenses this file
@rem to you under the Apache License, Version 2.0 (the
@rem "License"); you may not use this file except in compliance
@rem with the License. You may obtain a copy of the License at
@rem
@rem http://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.
@rem The storm command script
@rem
@rem Environment Variables
@rem
@rem JAVA_HOME The java implementation to use. Overrides JAVA_HOME.
@rem