How open source is driving the big data market

March 2026
M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31
« Feb		Apr »

Published on the 28/06/2017 | Written by Steve Singer

There is a clear split between legacy and next- generation approaches to software development, writes Steve Singer…

Legacy vendors in the big data space generally have internal development organisations, dedicated to building proprietary, bespoke software. It’s an approach that has worked well over the years – but it is being supplanted by open source approaches.

That’s because the big data market has always moved fast and it’s had an element of open source from the beginning; for example, a large proportion of what the major Hadoop vendors deliver is based on open source. These vendors, and those building complementary technology, are best placed to take advantage of new big data trends (like Spark Streaming) and build solutions that add value to customers.

Traditional legacy and proprietary approaches to data integration still have their place. These vendors have solid products, reliable technology and well-funded development teams. However, their products are typically built on a traditional architecture which may not easily adapt to the big data environment.

These products may work effectively for businesses that are doing things the way they always have. For straightforward requirements around data integration, data quality and ETL, while retaining existing processes and approach, there may be little need to move away from proprietary vendors entrenched in your information architecture.

The difficulty comes when organisations want to launch new projects or drive business transformation or product refreshes. Such moments can bring concerns around cost and flexibility.

For example, looking for additional functionality around big data ingestion could see legacy vendor licensing approaches becoming an issue. Buying perpetual software means major upfront costs and once the decision is made, it can be hard to modify or partially cancel the license should business needs change (or if it doesn’t work).

In addition, traditional legacy architectures are often unwieldy, and it can be difficult for businesses to adapt to evolving big data projects or environments.

By contrast, a flexible, licence-based open source environment offers multiple benefits for businesses that want explore big data. Subscription models mean the ability to dip a toe in. And if the licensing is less, as it often is, there is the ability to try without major overhead.

There’s more to it than the cost argument. Th collaborative, partnership approach to product development associated with open source means the ability to tap into the work of communities of people, potentially accelerating the pace of innovation.

If you think about the latest high-impact big data Apache projects, for example, there are multiple organisations and individuals focused on the development of each one as well as the creation of new projects.

Such are the benefits it delivers that open source is becoming a standard approach in the big data arena. It is helping to drive innovative new technologies like Apache Spark and subsequently Spark Streaming, as well as helping to fuel emerging projects like Apache Beam.

While it is easy for open source vendors to support such projects, it often takes a major ‘crowbarring effort’ for legacy vendors to do so. And, often by the time they do, the rest of the world has moved on.

Just as the cloud has moved from disruptive force into the mainstream, the same process is now happening to open source. That’s why growing numbers of businesses in the big data integration field are adopting an open source first approach.

ABOUT STEVE SINGER//