Apache Storm 2.0 distributed computing system available

saw the light significant release of the distributed event processing system Apache Storm 2.0, notable for the transition to a new architecture implemented in Java instead of the previously used Clojure language.

The project allows you to organize guaranteed processing of various events in real time. For example, Storm can be used to analyze data streams in real time, perform tasks for machine learning, organize continuous computing, implement RPC, ETL, etc. The system supports clustering, creating fault-tolerant configurations, guaranteed data processing mode and has high performance sufficient to process more than a million requests per second on one cluster node.

Integration with various queuing systems and database technologies is supported. The Storm architecture implies the reception and processing of unstructured constantly updated data streams using arbitrary complex processors with the ability to partition between different stages of computing. The project was handed over to the Apache community following Twitter's takeover of BackType, the original developer of the framework. In practice, Storm was used in BackType to analyze the reflection of events in microblogs by comparing new tweets and the links used in them on the fly (for example, it was assessed how external links or announcements published on Twitter are relayed by other participants).

The functionality of Storm is compared to the Hadoop platform, with the key difference being that the data is not hosted in storage, but comes from outside and is processed in real time. Storm does not have a built-in layer for organizing storage, and an analytical query continues to be applied to incoming data until it is canceled (whereas Hadoop uses finite-time MapReduce jobs, Storm uses the idea of ​​continuously executing "topologies"). The execution of handlers can be distributed to several servers - Storm automatically parallelizes work with threads to different nodes of the cluster.

The system was originally written in Clojure and runs inside the JVM virtual machine. The Apache Foundation launched an initiative to move Storm to a new kernel written in Java, the results of which are proposed in the release of Apache Storm 2.0. All base components of the platform have been rewritten in Java. Support for writing handlers in Clojure has been retained, but is now offered in the form of bindings. Storm 2.0.0 requires Java 8. The multithreading model has been completely redesigned to allow to achieve a noticeable increase in performance (for some topologies, delays were reduced by 50-80%).

Apache Storm 2.0 distributed computing system available

The new version also introduces a new typed Streams API that allows you to define handlers using functional programming style operations. The new API is implemented on top of the regular core API and supports automatic consolidation of operations to optimize their processing. Support for saving and restoring state in the backend has been added to the Windowing API for window operations.

Added support to the handler launch scheduler to take into account additional resources when making decisions that are not limited to
CPU and memory, such as network and GPU settings. A large number of improvements have been made related to ensuring integration with the platform Kafka. The access control system has been expanded, in which it became possible to create administrator groups and delegate tokens. Added improvements related to SQL and metrics support. The admin interface has new commands for debugging the cluster state.

Applications Storm:

  • Processing streams of new data or database updates in real time;
  • Continuous Computing: Storm can execute continuous queries and process continuous streams, delivering processing results to the client in real time.
  • Distributed Remote Procedure Call (RPC): Storm can be used to provide parallelism for the execution of resource-intensive queries. A job ("topology") in Storm is a node-distributed function that waits for messages to be processed. After receiving the message, the function processes it in the local context and returns the result. An example of using distributed RPC would be to process search queries in parallel, or to perform operations on a large set of sets.

Storm features:

  • A simple programming model that greatly simplifies real-time data processing;
  • Support for any programming languages. There are modules for Java, Ruby and Python languages, adaptation to other languages ​​is not difficult due to a very simple communication protocol, which requires about 100 lines of code to implement support;
  • Fault tolerance: to run a data processing task, you need to generate a jar file with code. Storm will independently distribute this jar file among the cluster nodes, connect the handlers associated with it, and organize monitoring. When the job ends, the code will be automatically disabled on all nodes;
  • Horizontal scalability. All calculations are performed in parallel mode, with an increase in the load on the cluster, it is enough just to connect new nodes;
  • Reliability. Storm guarantees that each incoming message will be fully processed at least once. Once the message will be processed only if there are no errors during the passage of all handlers, if there are problems, then unsuccessful processing attempts will be repeated.
  • Speed. Storm's code is written with performance in mind and uses the system ZeroMQ.

Source: opennet.ru

Add a comment