Big Data
eBook - ePub

Big Data

Opportunities and challenges

Share book
  1. 60 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Big Data

Opportunities and challenges

Book details
Book preview
Table of contents
Citations

About This Book

Despite the current hype around big data, there is no denying that its potential to benefit organisations, businesses and customers is enormous. The articles in this ebook aim to give practical guidance for all those who want to understand big data better and learn how to make the most of it. Topics range from big data analysis, mobile big data and managing unstructured data to technologies, governance and intellectual property and security issues surrounding big data.

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Big Data an online PDF/ePUB?
Yes, you can access Big Data by in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Tratamiento de datos. We have over one million books available in our catalogue for you to explore.
1 WHERE ARE WE WITH BIG DATA?
Brian Runciman, Head of Editorial and Website Services at BCS, The Chartered Institute for IT, looks at what big data is all about.
INTRODUCTION
There have been many descriptions of big data of late – mostly metaphors or similes for ‘big’ (deluge, flood, explosion) – and not only is there a lot of talk about big data, there is also a lot of data. But what can we do with structured and unstructured data? Can we extract insights from it? Or is ‘big data’ just a marketing puff term?
There is absolutely no question that there is an awful lot more data around now than there was only a few years ago. IBM say that ‘every day we create 2.5 quintillion bytes of data – so much that 90 per cent of the data in the world today has been created in the last two years alone’.
SOURCES
Social media platforms produce huge quantities of data, both from individual network profiles and the content that influencers and the less influential alike produce. Short form blogging, link-sharing, expert blog comments, user forums, ‘likes’ and more all contain potentially useful information.
There is also data produced through sheer activity, for example machine-generated content in the form of device log files, which could be characterised as the ‘internet of things’. This would include output from such things as geo-tagging.
Yet more data can be mined from software-as-a-service and cloud applications – data that’s already in the cloud but mostly divorced from internal enterprise data. Another large, but at this stage largely untapped, area is the data languishing in legacy systems, which include things like medical records and customer correspondence.
CAVEATS
A post from BCS’s future blogger called into question some of the behind-the-scenes story: ‘For the big data commercial advocates, there must be algorithms that can trawl the data and create outcomes better, that is to say more cost effectively, than traditional advertising. Where is the evidence that such algorithms exist? How will these algorithms be created and evaluated and improved upon if they do exist? One problem is that in a huge data set, there may be many spurious correlations, and the difference between causation and correlation is hard to prove.’
As we would perhaps expect, the likes of IBM say that big data goes beyond hype: ‘While there is a lot of buzz about big data in the market, it isn’t hype. Plenty of customers are seeing tangible ROI using IBM solutions to address their big data challenges.’
Big Blue go on to quote a 20 per cent decrease in patient mortality by analysing streaming patient data in the health care arena; a telco that enjoyed a 92 per cent decrease in processing time by analysing networking and call data; and a whopping 99 per cent improved accuracy in placing power generation resources by analysing 2.8 petabytes of untapped data for a utilities organisation.
TOOLS
To handle large data sets in times gone-by enterprises used relational databases and warehouses from proprietary suppliers. However, these just can’t handle the volumes of data being produced. This has seen a trend towards some open source alternatives such as Hadoop, which Wikipedia defines as ‘an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware.’
Wired recently reported on Cloudera – one of several companies that help build and use Hadoop applications – which is offering a Google-style search engine for Hadoop called, uninspiringly, Cloudera Search. Interestingly, Wired pointed to a recent Microsoft paper on whether customers really need to put all their data in Hadoop. It argued that ‘most companies don’t (have) data problems that justify the use of big clusters of servers. Even Yahoo and Facebook, two of the companies most associated with big data, are using clusters to solve problems that could actually be done on a single server.’
Despite that, interest is on the up and big organisations are taking advantage. A recent piece from The Sun Daily mentions that ‘analyst firm International Data Corp projects the global big data technology and services market will grow at a compound annual growth rate of 31.7 per cent – about seven times the rate of the overall information and communications technology market’.
The same article reports further investment in the perceived future of big data with announcements by Dell, Intel Corporation and Revolution Analytics of the Big Data Innovation Centre in Singapore. The new centre brings together expertise from all three organisations to provide training programmes, proof-of-concept capabilities and solution development support on big data and predictive analytic innovations catering to the Asian market.
HOW AND WHEN
The ‘when’ of embracing any new technology is massively variable depending on your organisation’s aims, business sector and so on. Some of the things that could affect your timing are neatly summed up by Redmond magazine in a recent article, simply by listing some of the possible motivators. They mention that you could utilise ‘CRM [customer relationship management] systems and data feeds to tweets mentioning their organisations that can alert them to a sudden problem with a product’. If this kind of real-time feedback is of benefit, then dipping a toe into the deluge of the big data waters is best done sooner rather than later.
Another area mentioned is ‘potential market opportunities spawned by an event’ – not as business-critical as product feedback, but important in a time of global austerity. Redmond magazine also mentions things such as online and big-box retailers using big data to automate their supply chains on the fly and law enforcement agencies analysing huge amounts of data to thwart potential crime and terror attacks. The scope and motivations vary widely, but potential benefits are both long and short-term.
As to how to go about it, some of the tools are mentioned above, often oriented around Hadoop. Microsoft recently launched Windows Azure HDInsight and Redmond magazine also cited VMware’s key application infrastructure and big data and analytics portfolio called Pivotal.
There’s plenty to read about, as the following list shows.
Further reading
Microsoft’s special report on using clusters for analytics: http://research.microsoft.com/apps/pubs/default.aspx?id=179615
Victor Mayer-Schonenberger and Kenneth Cukier, ‘Big Data’ review: http://www.bostonglobe.com/arts/books/2013­/03/05/book-review-big-data-viktor-mayer-schonberger-and-kenneth-cukier/T6YC7rNqXHgWowaE1oD8vO/story.html
IBM on big data: www-01.ibm.com/software/data/bigdata
Wired on Cloudera: www.wired.com/wiredenterprise­/2013/06/cloudera-search
The hardware perspective: www.techrepublic.com/blog/­big-data-analytics/are-we-headed-for-a-platform-change-for-big-data/445?tag=content;blog-list-river
Big data sources: www.zdnet.com/top-10-categories-for-big-data-sources-and-mining-technologies-7000000926
Hadoop: http://en.wikipedia.org/wiki/Hadoop
Things you should know about implementing big data: http://redmondmag.com/articles/2013/05/01/buried-in-big-data.aspx
2 BIG DATA TECHNOLOGIES
Keith Gordon MBCS CITP, former Secretary of BCS Data Management Specialist Group and author of Principles of Data Management, looks at definitions of big data and the database models that have grown up around it.
Whether you live in an ‘IT bubble’ or not, it is very difficult nowadays to miss hearing of something called ‘big data’. Many of the emails hitting my inbox go further and talk about ‘big data technologies’. These fall into two camps: the technologies to store the data and the technologies required to analyse and make sense of the data.
So, what is big data? In an attempt to find out I attended a seminar put on by The Institution of Engineering and Technology (IET) in 2012. After listening to five speakers I was even more confused than I had been at the beginning of the day. Amongst the interpretations of the term ‘big data’ I heard on that day were:
  • Making the vast quantities of data that is held by the government publically available – the ‘Open Data’ initiative. I am really not sure what ‘big’ means in this scenario!
  • For a future project, storing in a ‘hostile’ environment with no readily available power supply, and then analysing in slow time large quantities of very structured data of limited complexity. Here ‘big’ means ‘a lot of’.
  • For a telecoms company, analysing data available about a person’s previous web searches and tying that together with that person’s current location so that, for instance, they can be pinged with an advert for a nearby Chinese restaurant if their searches have indicated they like Chinese food before they have walked past the restaurant. Here ‘big’ principally means ‘very fast’.
  • Trying to gain business intelligence for the mass of unstructured or semi-structured data an organisation has in its documents, emails and so on. Here ‘big’ equates to ‘complex’.
So, although there is no commonly accepted definition of big data, we can say that it is data that can be defined by some combination of the following five characteristics:
  • Volume – Where the amount of data to be stored and analysed is large enough to require special considerations.
  • Variety – Where the data consists of multiple types of data, potentially from multiple sources; here we need to consider structured data held in tables or objects for which the metadata is well defined, semi-structured data held as documents or similar where the metadata is contained internally (for example XML documents) or unstructured data, which can be photographs, video or any other form of binary data.
  • Velocity – Where the data is produced at high rates and operating on ‘stale’ data is not valuable.
  • Value – Where the data has perceived or quantifiable benefit to the enterprise or organisation using it.
  • Veracity – Where the correctness of the data can be assessed.
Interestingly, I saw an article from The New York Times about a group that works for the council in New York. It was faced with the problem of finding the culprits who were polluting the sewers with old cooking fats. One department had details of where the sewers ran and where they were getting blocked, another department had maps of the city with details of all the restaurants and a third department had details of which restaurants had contracts with disposal companies for the removal of old cooking fats.
Putting this information together produced details of the restaurants that did not have disposal contracts, were close to the blockages and were, therefore, possible culprits. That was described as an application of big data, but there was no mention of any specific big data technologies. Was it just an application of common sense and good detective work?
THE TECHNOLOGIES
More recently, following the revelations from Edward Snowden, the American whistle-blower, The Washington Post had an article explaining how the National Security Agency is able to store and analyse the massive quantities of data it is collecting about the telephone, text and online conversations that are going on around the world. This was put down to the arrival, within the last few years, of big data technologies.
However, it is not just government agencies that are interested in big data. Large data-intensive companies, such as Amazon and Google, are taking the lead in some of the developments of the technologies to handle big data.
Our beloved SQL databases, based on the relational model of data, do not scale easily to handle the growing quantities of structured data and have only limited facilities for handling semi-structured and unstructured data. There is, therefore, a need for alternative storage models for data.
Collectively, databases built around these alternative storage models have become known as NoSQL databases, where this can mean ‘NotOnlySQL’ or ‘No,NeverSQL’ depending on the alternative storage model being considered (or, indeed, your perception of SQL as a database language).
There are over 150 different NoSQL databases available on the market. They all achieve performance gains by do...

Table of contents