It is a ‘great time’ to be a data scientist, says Dr Rami Mukhtar, senior researcher, National ICT Australia (NICTA). A data scientist, he says, “is truly a unique fusion of an individual who has a very deep understanding of the science of machine learning, which is in itself a fusion of both the optimisation theory and probability in statistics”.
Mukhtar heads NICTA’s Big Data Analytics project, which he set up two years ago. The programme has developed and contributed ‘Scoobi’ to the open source eco-system. Scoobi is becoming a popular productivity framework for the Hadoop big data storage and processing platform.
Big Data, he says, “is a science of fielding algorithms that enable machines to recognise complex patterns in data. It fuses machine learning with a very deep understanding of computer science and algorithms and that, of course, is key to being able to take this machine learning and deploy it in a very scalable way. That is what a data scientist is, somebody who can deploy machine learning in a scalable way.”
Big Data is “the opportunity to really collect data sources, both big and small, in their source form, or their raw form, in one location or one place, unencumbered by the boundaries of a business or the boundaries of information silos across the business”, says Mukhtar.
The process provides enterprises an “amazing opportunity to deploy analytics for the first time right across information management systems that may be fragmented across a business.”
He says the practice is markedly different from the past when enterprises would have different questions they would like to ask and then have analysts structure or extract variables or features which are very powerful in predicting the answers to the questions.
“In a big data mindset, it's quite different. You've got all of this data, now analysts can go right to the source data and say, what sort of algorithms can I run onto that data to actually distil out meaning from that data? We can now leverage advances in the sciences which enable us to do that in a more automated, rapid way.
“Now analysts can say, well I have got this idea. I've got this hypothesis, I want to test it out, go straight to the source data, deploy it and check it out. Does the hypothesis check out or not?”
He says a data scientist can come from disciplines that include computer science or statistics.
“But the one I see more commonly is a computer scientist to actually be trained in the science of machine learning. A postgraduate graduate degree in machine learning would typically be a good combination with a computer science bachelor degree.
For computer science graduates, “I think it is a very exciting area,” he says.
“If you feel you are a very analytical person and the IT side of computer science you find maybe a bit bland, then data science is a very interesting discipline.”
“It is really about taking the analytical mind and combining it with computer science skill to actually deliver a very powerful capability which is highly desired” in today’s enterprises, says Mukhtar, who has a PhD from the University of Melbourne and degrees in computer science and electrical engineering.
“Always look at your own backyard, because you've got heaps of data which you don't even know you have. Machine to machine data, transactional data. Collected - it's very valuable,” he says.
“It is not about changing the world, boiling oceans,” he says, on the key takeaway for CIOs. “You are not going to be tearing down your data warehouse. It is a matter of really annexing or adding to what you already have in your information strategy, augmenting it to give you net value.
“That is the crux of it – net value,” he says. “Organisations that make the leap now are clearly advantaged to other organisations that wait to be a follow up because right now, businesses which are data driven are extracting far more value than businesses that are intuition driven.”
Panelists from different sectors highlight how Big Data impacts their enterprise.
“We are getting more data from more sources that we need to integrate and increasingly those are more complicated. The data we're getting is quicker. We're used to monthly, daily, and we're now getting almost real-time data to be able to analyse,” says Phil Wickenden, vice president CRM, at GE Capital.
Wickenden says one of the company’s challenges as it tries to become more customer centric is the change in its systems from being very big, product-based systems to becoming much more digital. GE needs to be able to track across those systems, he says. “Instead of having one front end, customers can come into 20 different websites, and that's really good because they're actually getting a good service,” he says. “We need now to understand how much value we're getting out of that, and can we actually improve that by using that information.
Another panelist, Matthew Long, business intelligence manager at Western Health, says his company is “doing big data slowly.”
“We’ve always just targeted structured information,” he says, and yet a lot of the information the company collects is unstructured. “It's all in doctors' notes etc. How do we then pull that in as well to really enrich the analytical process?
He explains, “A good day for us is when nobody turns up. In the public sector and health, it's not about generating dollars. We've got a very finite amount of resources to work with and it's about improving what we can do with that very limited resource. So we're not trying to sort of get more clicks or more hits. What we're trying to do is really improve [work] flows.”
David Campbell, technical fellow at Microsoft, says it is interesting how much the answer to the question on who owns corporate data has shifted over time.
“If you told Boeing Aircraft 20 years ago that in order for them to build the next generation of planes they would have to share their designs with hundreds of partners around the globe they would have looked like you as if you were crazy because that was their core IP,” he says. “But the dynamics of business has shifted. They needed to do that and so how do they do that and also retain their degree of control?”
The trick, he says, is to ask: “Can I generate value in a way that I stay in control by sharing what [data] and in what form? That should be the question.”
Divina Paredes covered the Big Data Symposium in Sydney as a guest of Microsoft.
Join the CIO New Zealand group on LinkedIn. The group is open to CIOs, IT Directors, COOs, CTOs and senior IT managers.