|
Michael Factor, IBM Research |
Editor’s note: This post was authored by Michael Factor, IBM
Fellow
As
I sat on the plane during the return trip from the OpenStack Summit in Tokyo, I couldn’t
help but reflect on the tremendous value of open source. One might ask: Why is
IBM so interested in open source?
It’s
not just about OpenStack and its various projects such as Nova,
Cinder,
or Swift. There are so many benefits in the broad world of open source
projects; some of which are being integrated with and used in IBM’s cloud,
including OpenStack and others. Examples of these projects include Spark, Docker,
Kafka,
ElasticSearch,
Parquet,
among others. Each of these projects brings value on its own – not the least of
which is enabling consistency and choice as to where the workload runs since these
projects can all be deployed in local, dedicated, and public clouds. But the
really big value comes from combining projects to address real world problems.
It’s a definite case of the whole is greater than the sum of its parts. In
short, 1+1=3.
We
can look at all of these services as puzzle pieces that when put together solve
real world problems. In an ideal world, the different open source projects we
need would all be available as services on public, private, and dedicated
clouds. And it would be trivial to integrate the services by a point and a
click.
For
example, we should be able to take a message bus like Kafka, and use a simple configuration
command with no coding to have it archive the messages in an object store
service. Once in that service, another point-and-click could pull the data into
an analytics engine (like Apache Spark).
The best (and worst) of open source
times
It
is the best of times for developers because of the wide array of open source puzzle
pieces out there, which can be used to build a solution. It is also the worst
of times since there is still work needed to make sure these pieces snap
together easily, and too often this burdens the developer.
But
my colleagues at IBM’s Research lab in Haifa and I are working on integrating
several of these projects – so we really can snap together those puzzle pieces!
For
example, my colleague Gil
Vernik, a cloud storage, security and analytics expert in the lab, is enabling Tachyon to use Swift as its persistent underlying storage system. Tachyon
is a very active, but relatively new, open source project that provides an
in-memory file system with automated tiering. While Tachyon is general purpose,
one of its most known use cases is improving Spark’s performance over
conventional stores. While Tachyon is in-memory, it needs some place to persist
data when it is no longer being used, or if it runs out of memory in higher
performance tiers.
When
Gil first started this work, the main choices for persisting data were HDFS or Amazon
S3 (other options exist today). Both are good solutions but each has its
limitations: HDFS was not designed as a long term, multi-tenant store; Amazon’s
S3 is part of a public cloud. By adding support for OpenStack Swift, there is
now a multi-tenant, object store under Tachyon that can provide long term,
cost-effective persistent store.
And
Guy
Hadash, another colleague on our cloud team, is developing a solution to
aggregate and store messages from the Kafka message bus in OpenStack Swift.
Visual bookmark social media site Pinterest had a small project called Secor which knows how to subscribe to Kafka,
aggregate messages, and then put them in Amazon's S3. These messages are stored
sequentially in the object. The object can later be retrieved for batch
processing of the messages.
Guy
extended Secor so it could store messages via the Swift API. This gives Secor
users the choice of using Kafka-Secor-Swift in all three deployment models:
local, dedicated, and public. He then stored the data in Parquet format, not just as a list of messages. Parquet is designed
for objects or files that contain tabular data (like CSV). This data is stored in
a columnar format, enabling efficient compression and the retrieval of only
selected columns from the store. Lastly, Guy annotated the created objects with
metadata, such as the minimum or maximum values of various columns in the
table. We have already implemented prototypes of all three steps. A patch for
the first step is already part of the community Secor code, and we are starting
to work with the community on the second.
Then
there’s what we are doing with our partners in the context of the EU COSMOS project. Paula Ta-Shma,
a cloud security and analytics expert on our team, presented this work at the recent Spark Summit in Amsterdam. The goal of this
work is to help improve the timeliness of Madrid’s city buses by automating their
reactions to changes in traffic.
Data
is available from thousands of static sensors. Using NodeRed (an open source
project for defining data flows), the data is retrieved from the sensors and
put on Kafka. As described above, Secor aggregates, formats, and annotates this
data, before storing it in Swift. Spark is then used to analyze this data. Using
standard Spark machine learning, the solution defines the expected values for
speed and traffic density on different days and times. A complex event
processing engine, which also subscribes to the messages from the sensors,
takes the threshold values and can issue warnings to trigger corrective action,
such as rerouting a bus or changing the traffic light pattern, when needed.
These
examples clearly show the tremendous value in putting puzzle pieces together from
different open source projects to build something even better. We are
continuing to work on integrating services and look forward to describing some
of our other efforts in future blogs.
Labels: COSMOS, Gil Vernik, Guy Hadash, IBM Research - Haifa, IoT, Kafka, Michael Factor, open stack, Parquet, Paula Ta-shma, Secor, Spark, storage, Swift