How big is ‘big data’?

Projections about the growth of data significantly underestimate how much data is going to be created.

I came across a link to a new report from IDC called the 2010 Digital Universe Study. The report echoes what we've been telling our clients for the past year: the projections of the past few years about the growth of data significantly underestimate how much data is going to be created. Some highlights of the report:

• In 2010, the Digital Universe (a fancy term for all the data created by consumers and businesses on earth, including video, audio, documents, etc.) will grow by 1.2 zettabytes, or 1.2 million petabytes.

• By 2020, the Digital Universe will be 44 times as large as it was in 2009.

• Surprisingly, the number of objects (i.e., files that contain digital data) will increase faster than the total amount of data, due to smaller file sizes - even though lots of large video and audio files are being created, so are massive amounts of small files created by devices, sensors, etc.

The report goes on to highlight some of the biggest issues the future torrent of data will pose:

Searching: How to find a digital needle in a gigantic data haystack? Most of the data will be unstructured, implying new kinds of searching mechanisms are required.

Data Tiers: If you thought Hierarchical Storage Management was important before, imagine how necessary it will be in the face of zettabytes of data. A strategy to define a layered approach to storage, based on historical use, immediacy of need, and cost of storage will be necessary.

Privacy and Compliance: How can the increasing requirements of privacy and compliance be controlled with so much data under management?

Headcount Mismatch: While the amount of data will increase 44 times, and the number of files will increase 67 times, the number of employees will increase by only 1.4 times.

Cloud Computing: It's the Economics Stupid

The report notes that by 2020, much of this data will be held in cloud environments or will be "touched by cloud," which means data that transits through a cloud service or is temporarily held in a cloud application. The report estimates that perhaps 15 percent of all data will be held in the cloud, and that around one-third will live in or pass through the cloud. Frankly, I think that underestimates what's going to be in the cloud, for this reason:

It's clear that the growth of data is accelerating, which is to say that much of it will be created later in the 2010 - 2020 decade. This means that the average corporation is going to experience an increasing deluge of data - in other words, no matter what level of investment they've already got in storage, it will be accelerating as the decade goes on. This will require ever-increasing amounts of storage and an ever-increasing capital budget for storage devices - not to mention more headcount. There's a truism in economics that something that can't go on, won't go on. I just don't see most companies funding an ever-increasing number of storage devices and employees to manage them, i.e., most companies can't afford the projected growth of storage, so they won't go down the road of on-site storage. Long before they get to the logical conclusion of how much investment, capital, and headcount is required to manage the increased storage, they'll turn to specialized providers who have figured out how to manage enormous amounts of storage more cost-effectively.

Another reason the report underestimates how much data will be in the cloud is that much of the data will, increasingly, originate in the cloud, because of the use of SaaS applications and the hosting of custom applications in IaaS clouds. Just as the rate of change in storage amounts will increase through the decade, so too will the number of cloud applications - which means the data associated with those apps will be created in the cloud to begin with. Another way to look at this is: what proportion of applications do you think will reside in external cloud environments by 2020? I'm betting it's significantly more than 15 percent of all apps.

The report then turns to privacy and compliance issues and concludes that, despite the best efforts of IT groups, the proportion of data left inadequately protected will increase throughout the next decade, due to the lack of investment made available by the business units that fund central IT expenditure. Unless driven by specific legal requirements (e.g., SarbOx) or actual data breaches, data privacy and compliance runs a poor second (and a long way back) to the functionality requirements of business units. And this is not to mention the use of external cloud providers by business units, which makes eliding IT data security requirements even easier.

The report concludes with some predictions:

• The increased complexity of managing digital information will be an incentive to move to cloud services.

• Within datacentres, expect continued pressure for datacentre automation, consolidation, and virtualisation.

• Expect more end-user self-service.

• Expect bottlenecks in key specialties such as security, information management, advanced content management, and real-time processing.

If you work in IT, you owe it to yourself to read this report and consider its implications. I may sound like a broken record, but the future of IT is going to look a lot different than the past - even the recent past. This report offers information to guide your strategy.


Bernard Golden is CEO of consulting firm HyperStratus and the author of Virtualisation for Dummies.

