The ‘data lake’ concept is being touted vendors as an essential component to capitalise on Big Data opportunities.
Gartner, however, points out there is little alignment between vendors about what comprises a data lake, or how to get value from it.
"In broad terms, data lakes are marketed as enterprise-wide data management platforms for analysing disparate sources of data in its native format," Nick Heudecker, research director at Gartner, says in a statement.
"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format. This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organisation."
However, while the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.
Getting value out of the data remains the responsibility of the business end user. Technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place.
"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," says Andrew White, vice president at Gartner. "Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organisation, the proposition of enterprise-wide data management has yet to be realised."
Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured. The data lake concept hopes to solve two problems: One is it tries to solve information silos. Rather than having dozens of independently managed collections of data, an organisation can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.
The second problem data lakes conceptually tackle pertains to Big Data. Big Data projects require a large amount of varied information. The information is so varied that it's not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system constrains future analysis.
"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used — data is simply dumped into the data lake," says White.
"However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."
Gartner lists at least three substantial risks organisations can face.
The most important, it says, is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.
Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.
Finally, it says, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure. For these reasons, Gartner recommends organisations focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake.
"There is always value to be found in data but the question your organisation has to address is this — do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalise to a degree that effort, and try to sustain the value-generating skills we develop?" says White. “If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."
Send news tips and comments to firstname.lastname@example.org
Follow CIO New Zealand on Twitter:@cio_nz
Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.