Subscribe to our Newsletter

Tim Matteson's Posts (14)

  • Top 10 Commercial Hadoop Platforms

    Guest blog post by Bernard Marr

    Hadoop – the software framework which provides the necessary tools to carry out Big Data analysis – is widely used in industry and commerce for many Big Data related tasks.

    It is open source, essentially meaning that it is free for anyone to use for any purpose, and can be modified for any use. While designed to be user-friendly, in its “raw” state it still needs considerable specialist knowledge to set up and run.

    Because of this a large number of commercial versions have come onto the market in recent years, as vendors have created their own versions designed to be more easily used, or supplied alongside consultancy services to get you crunching through your data in no time.

    These days, this is often provided in the form of “Hadoop-as-a-service” – all of the installation will actually take place within the vendors own cloud, with customers paying a subscription to access the services.

    Here’s a run-down, in no particular order, of 10 of the most popular or interesting commercial Hadoop platforms on the market today.


    One of the first commercial Hadoop offerings and still the most popular, reportedly with more installations running than any of its competitors. Cloudera also contribute Impala, which offers real-time massively parallel processing of Big Data to Hadoop.

    Amazon Web Services

    Open source Big Data frameworks may not be the first thing that springs to mind when you think of Amazon, but the retailer was another one of the first to offer Hadoop in the cloud as part of its Amazon Web Services package. AWS is a hosted solution integrating Hadoop with Amazon’s Elastic Cloud Compute and Simple Storage Service (S3) cloud-based data processing and storage services.


    Of the vendors listed here, Horton is one of the few which offer 100% open source Hadoop technology without any proprietary (non-open) modifications. They were also the first to integrate support for Apache HCatalog, which creates “metadata” – data within data – simplifying the process of sharing your data across other layers of service such as Apache Hive or Pig.


    Uses some differing concepts, such as native support for UNIX file systems rather than HDFS, meaning it will be more familiar to DBAs used to working in a UNIX environment. MapR technologies is also spearheading development of the Apache Drill project, which provides advanced tools for interactive real-time querying of Big Datasets.


    It might be a relative newcomer to the Hadoop ecosystem, but IBM has deep roots in the computing industry, particularly in distributed computing and data management. Its BigInsights package adds its proprietary analytics and visualization algorithms to the core Hadoop infrastructure.

    Microsoft HDInsight

    Engineered to run on Microsoft’s Azure cloud platform, Microsoft’s Hadoop package is based on Hortonworks’, and has the distinction of being the only big commercial Hadoop offering which runs in a Windows environment.

    Intel Distribution for Apache Hadoop

    Another giant of the tech world which has recently turned its attention towards Hadoop. Intel’s distribution adds the company’s Graph Builder and Analytics Toolkit functions to Hadoop, and claims that security updates to the infrastructure mean that their solution offers added security for your data. 

    Datastax Enterprise Analytics

    Datastax offers its own distribution of the Apache Cassandra database management system on top of its Hadoop installation. It also includes custom proprietary systems to handle security, search, dashboard and visualization. Customers include Netflix, where it powers the recommendation engine by analyzing over 10 million data points every second!

    Teradata Enterprise Access for Hadoop

    Teradata offer hardware and software for implementing Big Data solutions, as well as their own Hadoop package, which is also based on the Hortonworks distribution. Proprietary technology supplied alongside the open source components include their QueryGrid analytics engine and Viewpoint dashboard.

    Pivotal HD

    Pivotal was formed as a joint venture between storage system provider EMC and virtualization specialists VMware. Pivotal HD (Hadoop Distribution) forms part of the company’s Big Data Suite, which also includes database tools Greenplum and analytics platform Gemfire. Customers include China’s national rail operator, China Railway – sorting out the logistics for rail journeys for 3.5 billion passengers certainly qualifies as Big Data!

    Which others would you add to my list? What are your views on any of these, please share in the comments below.

    DSC Resources

    Additional Reading

    Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

    Read more…
  • Ten top languages for crunching Big Data

    Guest blog post by Bernard Marr

    With an ever-growing number of businesses turning to Big Data and analytics to generate insights, there is a greater need than ever for people with the technical skills to apply analytics to real-world problems.

    Computer programming is still at the core of the skillset needed to create algorithms that can crunch through whatever structured or unstructured data is thrown at them. Certain languages have proven themselves better at this task than others. Here’s a brief overview of 10 of the most popular and widely used.

    Fractal landscape simulation requires a lot of computing (this one possibly produced with MATLAB)


    Julia is a relative newcomer, having existed only for a few years, however it is quickly gaining popularity with data scientists praising both its flexibility and ease of use. Although designed as a “jack of all trades” language, able to cope with any sort of application, it is thought to be particularly efficient at utilizing the power of distributed systems such as Hadoop, frequently used in Big Data.

    Crowd-sourced data science website Kaggle is currently running a competition which doubles as a tutorial on getting started with Julia – it will show you how to use it to create algorithms designed to detect text characters, such as roadside graffiti, in Google Street View images.


    The SAS language is the programming language behind the SAS (Statistical Analysis System) analytics platform, which has been used for statistical modelling since the 1960s and is still popular today after many years of updates and refinements. Although unlike many of the other languages mentioned here it isn’t open source, so it isn’t free, there is a free University Edition designed for learners, available here.


    Python is one of the most popular open source (free) languages for working with the large and complicated datasets needed for Big Data. It has become very popular in recent years because it is both flexible and relatively easy to learn. Like most popular open source software it also has a large and active community dedicated to improving the product and making it popular with new users. A free Code Academy course will take you through the basics in 13 hours.  


    Like Python, R is hugely popular (one poll suggested that these two open source languages were between them used in nearly 85% of all Big Data projects) and supported by a large and helpful community. Where Python excels in simplicity and ease of use, R stands out for its raw number crunching power. Its widespread adoption means you are probably executing code written in R every day, as it was used to create algorithms behind Google, Facebook, Twitter and many other services. A free, online beginners’ course in programming R can be found here.


    Although SQL is not designed for the task of handling messy, unstructured datasets of the type which Big Data often involves, there is still a need for structured, quantified data analytics in many organizations. Older and less sexy than Python or R, it was still used by 30% of organizations for their data crunching, according to one poll (the same one mentioned above!) and is a useful tool for any statistician. A free course which will teach you the basics of SQL programming is available here.


    Scala is based on Java and compiled code runs on the Java Virtual Machine platform, meaning it can be run on just about any platform. Just like Java it has become popular with data scientists and statisticians thanks to its powerful number-crunching abilities, and scalability (hence the name!) A free course suitable for those with some basic experience of programming another language such as Java or Python is available here.


    As the name suggests MATLAB is designed for working with matrixes which makes it very good for statistical modelling and algorithm creation. It isn’t open source so doesn’t have the volume of free community-driven support but this is alleviated somewhat by its widespread use in academia meaning that many will be introduced to it at college and if not there are ample resources online. Coursera offers Vanderbilt University’s Introduction to Programming with Matlab free of charge.


    HiveQL is a query-based language for coding instructions to Apache Hive, designed to work on top of Apache Hadoop or other distributed storage platforms such as Amazon’s S3 file system. It is based on SQL, one of the oldest and most widely-used data programming languages, meaning it has been well adopted since its initial development by Facebook. It has since been passed to the Apache Foundation and given open source status. An intermediate level tutorial for those already familiar with SQL is available here.

    Pig Latin

    Another Hadoop-oriented, open source system, Pig Latin is the language layer of the Apache Pig platform, which is used to create Hadoop MapReduce jobs which sort and apply mathematical functions to large, distributed datasets. Like other newer languages, users can create functions in more established languages such as Python to carry out functions which are not natively supported. An online Pig tutorial can be found here.


    Go has been developed by Google and released under an open source licence. Its syntax is based on C, meaning many programmers will be familiar with it, which has aided its adoption. Although not specifically designed for statistical computing, its speed and familiarity, along with the fact it can call routines written in other languages (such as Python) to handle functions it can’t cope with itself, means it is growing in popularity for data programming. An online introduction and tutorial can be found here.

    DSC Resources

    Additional Reading

    Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

    Read more…
  • Hadoop Security Issues and Best Practices

    Originally posted on Analytic Bridge

    The big data blast has given rise to a host of information technology software and tools and abilities that enable companies to manage, capture, and analyze large data sets of unstructured and structure data for result oriented insights and competitive success. But with this latest technology comes the challenge of keeping confidential information secure and private.

    Big data that resides within a Hadoop environment contains sensitive confidential data such as bank account details financial information in the form credit card, corporate business, property information, personal confidential information, security information of clients and all.

    Due to the confidential nature of all of data and the losses that can be done should it fall into the wrong hands, it is mandatory that it be protected from unauthorized access.

    look at some general Hadoop security issues along with best practices to keep sensitive data protected and secure.

    Security concerns with Hadoop

    It wasn’t all that long ago that Hadoop in the enterprise was primarily deployed on-premise. As such, informative confidential data was safely confined in isolated clusters or data silos where security wasn’t a problem. But that fastly changed as Hadoop developed into Big Data as-a-Service (BDaaS), took to the cloud, and became surrounded by an ever-growing ecosystem of softwares and applications. And while these innovations have served to democratize data and bring Hadoop into the mainstream, they have also created new security concerns for organizations that now struggle to scale security in step with Hadoop’s rapid technological advances.

    For many companies Hadoop has developed into an enterprise data platform. That poses new security challenges as data that was once siloed is brought together in a vast data lake and made accessible to a variety of users across the organization. Among these challenges are:

    • Ensuring the proper authentication of users who access Hadoop.

    • Ensuring that authorized Hadoop users can only access the data that they are entitled to access.

    • Ensuring that data access histories for all users are recorded in accordance with compliance regulations and for other important purposes.

    • Ensuring the protection of data—both at rest and in transit—through enterprise-grade encryption.

    Hadoop security best practices

    Clearly, today’s compnies face formidable security challenges. And the stakes regarding data security are being raised ever higher as sensitive healthcare data, personal retail customer data, smart phone data, and social media and sentiment data become more and more a part of the big data mix. It’s time for companies to reevaluate the protection and safety of their data in Hadoop and to reacquaint themselves with the below Hadoop security good practices.

    1. Plan before you deploy – Big data protection strategies must be determined during the planning phase of the Hadoop deployment. Before moving any data into Hadoop it’s important to identify any confidential data elements, along with where those elements will reside in the hadoop system. In addition, all company privacy policies and pertinent industry and governmental regulations must be taken into consideration during the planning phase in order to better identify and reduce compliance exposure risk.

    2. Don’t overlook basic security measures
    – Basic security measures can go a long way in meeting Hadoop security challenges. To ensure user identification and control user access to sensitive data it’s important to create users and groups and then map users to groups. Permissions should be assigned and locked down by groups, and the use of strong passwords should be strictly enforced. Fine grained permissions should be assigned on a need-to-know basis only and broad stroke permissions should be avoided as much as possible.

    3. Choose the right remediation technique – When big data analytics needs require access to real data, as opposed to data that has been desensitized, there are two remediation techniques to choose from—encryption or masking. While masking offers the most secure remediation, encryption might be a better choice as it offers greater flexibility to meet evolving needs. Either way it’s important to ensure that the data protection solutions being considered are capable of supporting both remediation techniques. That way, both masked and unmasked versions of sensitive data can be kept in separate Hadoop directories if desired.

    4. Ensure that encryption integrates with access control– Once an encryption solution is chosen it must be made compatible with the organization’s access control technology. Otherwise, users with different credentials won’t have the appropriate, selective access to sensitive data in the Hadoop environment that they require.

    5. Monitor, detect and resolve issues– Even the best security models will be found wanting without the capability to detect non-compliance issues and suspected or actual security breaches and quickly resolve them. Organizations need to make sure that best practice monitoring, and detection processes are in place.

    6. Ensure proper training and enforcement– To be fully effective, best practice policies and procedures with respect to data security in Hadoop must be frequently revisited in employee training and constantly supervised and enforced.

    Hadoop is enabling organizations to analyze vast and rich data stores and derive actionable insights that inform new and better products and services and help to create competitive advantage. But the benefits of Hadoop come with risks. Hopefully the above information will help organizations to gain a better understanding of the security and compliance issues associated with Hadoop and to implement best practices to keep sensitive data safe and secure going forward.

    Read more…
  • Guest blog post by Bill Vorhies

    Summary:  Just when you thought you had a handle on the database market it fragments again.  Here’s an overview to help you keep up.

    It’s taken a lot of effort to keep up with the changes in RDBMS and NoSQL data base design.  We chose to live in this fast changing profession so it’s up to us to keep up.  But just about the time I thought I had a handle on this, the field fragments again.

    NoSQL:  Basically we’re talking about Hadoop here.  The architecture is MPP, horizontally scalable but only eventually consistent.  But the NoSQL name hardly fits anymore with the essential demise of MapReduce and the proliferation of all the SQL-on-Hadoop tools including Hive, Presto, Drill, Impala, and a few others I’ve probably missed here.

    RDBMS:  This hardly warrant a review since it remains the go-to architecture for many if not most applications.  Yes even in this NoSQL age architects may still prefer this model.  You get SQL and ACID.  What you don’t get is (easy) horizontal scaling on commodity hardware.

    NewSQL (now “Avant Garde RDBMS”???):  Here’s where this starts to break down.  I used to think I had a good handle on NewSQL.  It was RDBMS on steroids because you got SQL, ACID, and distributed MPP architecture.  Now Gartner wants to change the name!  I wish they’d stop doing that. 

    In mid-November Gartner’s Adam Ronthal released a report entitled “When to Use New RDBMS Offerings in a Dynamic Data Environment.” In which he renames this category “Avant Garde” RDBMS.  He says, “emerging RDBMS vendors are pushing the boundaries of scalability, distributed processing, and hybrid on-premises and cloud deployments, offering new functionality and capabilities for information leaders.”  (Well is it ‘emerging RDBMS’ or ‘Avant Garde’.  How about some consistency here.)  He goes on to predict “Through 2019, 70% of new projects requiring scale-out elasticity, distributed processing and hybrid cloud capabilities for relational applications, as well as multi-data-center transactional consistency, will prefer an emerging RDBMS over a traditional RDBMS.”  He may be completely correct, but what was wrong with just sticking with the NewSQL name?

    Hybrid Transactional Analytic Platforms (HTAPs):  Yes this is still a new category emerging over the last year or so.  Most of the developers don’t even call them DBs prefering the term ‘platform’.  The common factor is that they are completely in-memory (mostly DRAM but potentially a little SSD around the edges for less active storage).  For more background see The Need for Speed.  The thing that truly sets these apart is that they are optimized for BOTH transactional and analytic tasks SIMULTANEOUSLY.  This left most of us scratching our heads but yes, full ACID and dually optimized.  Wasn’t supposed to be possible but here it is.  All the majors now offer these including SAP, Oracle, Microsoft, IBM, and Teradata.  And most of the developers that used to be NewSQL have entered this space as well including VoltDB, NuoDB, Clustrix, and MemSQL.

    Digging a little deeper into just how it is possible to do both these things at once I came across this December 2014 IDC report “The Analytic Transactional Data Platform:  Enabling the Real Time Enterprise”.  You may be able to find it gratis from one of the developers mentioned above.  If you’re a motivated architecture type, the report details five different architectural strategies used by the various developers to create this magic.  No doubt these tech strategies will mutate over time.

    Here’s one thing Gartner probably got right. “By 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single platform.”  None of the major players can afford to be without any of these capabilities.  So stay tuned.  Another major change is underway.


    December 3, 2015

    Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.


    About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  Bill is also Editorial Director for Data Science Central.  He can be reached at:

    [email protected] or [email protected]

    Read more…
  • Guest blog post by Khushbu Shah.

    Hadoop has continued to grow and develop ever since it was introduced in the market 10 years ago. Every new release and abstraction on Hadoop is used to improve one or the other drawback in data processing, storage and analysis. Apache Hive was introduced by Facebook to manage and process the large datasets in the distributed storage in Hadoop. Apache Hive is an abstraction on Hadoop MapReduce and has its own SQL like language HiveQL. Cloudera Impala was developed to resolve the limitations posed by low interaction of Hadoop Sql. Cloudera Impala provides low latency high performance SQL like queries to process and analyze data with only one condition that the data be stored on Hadoop clusters.

    Data explosion in the past decade has not disappointed big data enthusiasts one bit. It has thrown up a number of challenges and created new industries which require continuous improvements and innovations in the way we leverage technology.

    Big Data keeps getting bigger. It continues to pressurize existing data querying, processing and analytic platforms to improve their capabilities without compromising on the quality and speed. A number of comparisons have been drawn and they often present contrasting results. Cloudera Impala and Apache Hive are being discussed as two fierce competitors vying for acceptance in database querying space. While Hadoop has clearly emerged as the favorite data warehousing tool, the Cloudera Impala vs Hive debate refuses to settle down.

    Impala vs Hive: Difference between Sql on Hadoop components

    We try to dive deeper into the capabilities of Impala and Hive to see if there is a clear winner or are these two champions in their own rights on different turfs. We begin by prodding each of these individually before getting into a head to head comparison.

    Cloudera Impala

    Step aside, the SQL engines claiming to do parallel processing! Impala’s open source Massively Parallel Processing (MPP) SQL engine is here, armed with all the power to push you aside. The only condition it needs is data be stored in a cluster of computers running Apache Hadoop, which, given Hadoop’s dominance in data warehousing, isn’t uncommon. Cloudera Impala was announced on the world stage in October 2012 and after a successful beta run, was made available to the general public in May 2013.

    Cloudera Impala is an excellent choice for programmers for running queries on HDFS and Apache HBase as it doesn’t require data to be moved or transformed prior to processing. Cloudera Impala easily integrates with Hadoop ecosystem, as its file and data formats, metadata, security and resource management frameworks are same as those used by MapReduce, Apache Hive, Apache Pig and other Hadoop software. It is architected specifically to assimilate the strengths of Hadoop and the familiarity of SQL support and multi user performance of traditional database. Its unified resource management across frameworks has made it the de facto standard for open source interactive business intelligence tasks.

    Cloudera Impala has the following two technologies that give other processing languages a run for their money:

    Columnar Storage

    Data is stored in columnar fashion which achieves high compression ratio and efficient scanning.

    Columnar Storage in Cloudera Impala


    Tree Architecture

    This is fundamental to attaining a massively parallel distributed multi – level serving tree for pushing down a query to the tree and then aggregating the results from the leaves.

    Tree Architecture of Impala


    Impala massively improves on the performance parameters as it eliminates the need to migrate huge data sets to dedicated processing systems or convert data formats prior to analysis. Salient features of Impala include:

    • Hadoop Distributed File System (HDFS) and Apache HBase storage support
    • Recognizes Hadoop file formats, text, LZO, SequenceFile, Avro, RCFile and Parquet
    • Supports Hadoop Security (Kerberos authentication)
    • Fine – grained, role-based authorization with Apache Sentry
    • Can easily read metadata, ODBC driver and SQL syntax from Apache Hive

    Impala’s rise within a short span of little over 2 years can be gauged from the fact that Amazon Web Services and MapR have both added support for it.

    Apache Hive

    Initially developed by Facebook, Apache Hive is a data warehouse infrastructure build over Hadoop platform for performing data intensive tasks such as querying, analysis, processing and visualization. Apache Hive is versatile in its usage as it supports analysis of huge datasets stored in Hadoop’s HDFS and other compatible file systems such as Amazon S3. To keep the traditional database query designers interested, it provides an SQL – like language (HiveQL) with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. Other features of Hive include:

    • Indexing for accelerated processing
    • Support for different storage types such as plain text, RCFile, HBase, ORC and others
    • Metadata storage in RDBMS, bringing down time to perform semantic checks during query execution
    • Has SQL like queries that get implicitly converted into MapReduce, Tez or Spark jobs
    • Familiar built in user defined functions (UDFs) to manipulate strings, dates and other data – mining tools.

    If you are looking for an advanced analytics language which would allow you to leverage your familiarity with SQL (without writing MapReduce jobs separately) then Apache Hive is definitely the way to go. HiveQL queries anyway get converted into a corresponding MapReduce job which executes on the cluster and gives you the final output. Hive (and its underlying SQL like language HiveQL) does have its limitations though and if you have a really fine grained, complex processing requirements at hand you would definitely want to take a look at MapReduce.

    Impala vs Hive – 4 Differences between the Hadoop SQL Components

    Impala has been shown to have performance lead over Hive by benchmarks of both Cloudera (Impala’s vendor) and AMPLab. Benchmarks have been observed to be notorious about biasing due to minor software tricks and hardware settings. However, it is worthwhile to take a deeper look at this constantly observed difference. The following reasons come to the fore as possible causes:

    1. Cloudera Impala being a native query language, avoids startup overhead which is commonly seen in MapReduce/Tez based jobs (MapReduce programs take time before all nodes are running at full capacity). In Hive, every query has this problem of “cold start” whereas Impala daemon processes are started at boot time itself, always being ready to process a query.
    2. Hadoop reuses JVM instances to reduce startup overhead partially but introduces another problem when large haps are in use. Cloudera benchmark have 384 GB memory which is a big challenge for the garbage collector of the reused JVM instances.
    3. MapReduce materializes all intermediate results, which enables better scalability and fault tolerance (while slowing down data processing). Impala streams intermediate results between executors (trading off scalability).
    4. Hive generates query expressions at compile time whereas Impala does runtime code generation for “big loops”.

    Impala vs Hive-Performance

    Image Credit :

    The above graph demonstrates that Cloudera Impala is 6 to 69 times faster than Apache Hive.To conclude, Impala does have a number of performance related advantages over Hive but it also depends upon the kind of task at hand. That being said, Jamie Thomson has found some really interesting results through dumb querying published on, especially in terms of execution time. For all its performance related advantages Impala does have few serious issues to consider. Being written in C/C++, it will not understand every format, especially those written in java. If you are starting something fresh then Cloudera Impala would be the way to go but when you have to take up an upgradation project where compatibility becomes as important a factor as (or may be more important than) speed, Apache Hive would nudge ahead.

    In practical terms, Apache Hive and Cloudera Impala need not necessarily be competitors. As both have a MapReduce foundation for executing queries, there can be scenarios where you are able to use them together and get the best of both worlds – compatibility and performance. Hive is the more universal, versatile and pluggable language. Once data integration and storage has been done, Cloudera Impala can be called upon to unleash its brute processing power and give lightning fast analytic results.

    Original article here.

    Read more…
  • Guest blog post by Bernard Marr

    Basically Spark is a framework - in the same way that Hadoop is - which provides a number of inter-connected platforms, systems and standards for Big Data projects.

    Like Hadoop, Spark is open-source and under the wing of the Apache Software Foundation. Essentially, open-source means the code can be freely used by anyone. Beyond that, it can also be altered by anyone to produce custom versions aimed at particular problems, or industries. Volunteer developers, as well as those working at companies which produce custom versions, constantly refine and update the core software adding more features and efficiencies. In fact Spark was the most active project at Apache last year. It was also the most active of all of the open source Big Data applications, with over 500 contributors from more than 200 organizations.

    Source for picture: click here

    Spark is seen by techies in the industry as a more advanced product than Hadoop - it is newer, and designed to work by processing data in chunks "in memory". This means it transfers data from the physical, magnetic hard discs into far-faster electronic memory where processing can be carried out far more quickly - up to 100 times faster in some operations.

    Spark has proven very popular and is used by many large companies for huge, multi-petabyte data storage and analysis. This has partly been because of its speed. Last year, Spark set a world record by completing a benchmark test involving sorting 100 terabytes of data in 23 minutes - the previous world record of 71 minutes being held by Hadoop.

    Additionally, Spark has proven itself to be highly suited to Machine Learning applications. Machine Learning is one of the fastest growing and most exciting areas of computer science, where computers are being taught to spot patterns in data, and adapt their behaviour based on automated modelling and analysis of whatever task they are trying to perform.

    It is designed from the ground up to be easy to install and use - if you have a background in computer science! In order to make it available to more businesses, many vendors provide their own versions (as with Hadoop) which are geared towards particular industries, or custom-configured for individual clients' projects, as well as associated consultancy services to get it up and running.

    Spark uses cluster computing for its computational (analytics) power as well as its storage. This means it can use resources from many computer processors linked together for its analytics. It's a scalable solution meaning that if more oomph is needed, you can simply introduce more processors into the system. With distributed storage, the huge datasets gathered for Big Data analysis can be stored across many smaller individual physical hard discs. This speeds up read/write operations, because the "head" which reads information from the discs has less physical distance to travel over the disc surface. As with processing power, more storage can be added when needed, and the fact it uses commonly available commodity hardware (any standard computer hard discs) keeps down infrastructure costs.

     Unlike Hadoop, Spark does not come with its own file system - instead it can be integrated with many file systems including Hadoop's HDFS, MongoDB and Amazon's S3 system.

    Another element of the framework is Spark Streaming, which allows applications to be developed which perform analytics on streaming, real-time data - such as automatically analyzing video or social media data - on-the-fly, in real-time.

    In fast changing industries such as marketing, real-time analytics has huge advantages, for example ads can be served based on a user's behavior at a particular time, rather than on historical behavior, increasing the chance of prompting an impulse purchase.

    So that's a brief introduction to Apache Spark - what it is, how it works, and why a lot of people think that it's the future. I hope you found it useful.

    You might also want to read:

    Upcoming Spark and Hadoop webinars

     DSC Resources

    Additional Reading

    Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

    Read more…
  • 10 Free Hadoop Tutorials


    Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. This brief tutorial provides a quick introduction to Big Data, MapReduce algorithm, and Hadoop Distributed File System.

    Following is an extensive series of tutorials on developing Big-Data Applications with Hadoop. Since each section includes exercises and exercise solutions, this can also be viewed as a self-paced Hadoop training course. All the slides, source code, exercises, and exercise solutions are free for unrestricted use. Click on a section below to expand its content. The relatively few parts on IDE development and deployment use Eclipse, but of course none of the actual code is Eclipse-specific.

    Yahoo Developer Network

    This series of tutorial documents will walk you through many aspects of the Apache Hadoop system. You will be shown how to set up simple and advanced cluster configurations, use the distributed file system, and develop complex Hadoop MapReduce applications. Other related systems are also reviewed.


    In this tutorial we will be analyzing geolocation and truck data. We will import this data into HDFS and build derived tables in Hive. Then we will process the data using Pig and Hive. The processed data is then imported into Microsoft Excel where it can be visualized.


    HadoopTutorials is a online video tutorial. This blog covers HDFS, Map Reduce, Data Fundamentals and etc.., in detail.

    BigData is the latest buzzword in the IT Industry. Apache’s Hadoop is a leading Big Data platform used by IT giants Yahoo, Facebook & Google. 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to describe collection of data that is huge in size and yet growing exponentially with time.In short, such a data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.


    Organizations use Hadoop as a scalable framework for storing and processing massive volumes of data using a distributed computing model. From its roots as an open source Apache project, Hadoop has been tweaked and modified over the years by various users such as Yahoo!, EMC2, Apple, and Facebook4 to manage incredibly huge amounts of digital data that are being created every second. If used correctly, these data can lead to game-changing decisions in business, technology, politics, and everyday life. That’s the reason why data — like gold or diamond — is now being mined, stored, and processed nonstop by well-paid data scientists and other big data professionals.


    Essential Knowledge for everyone associated with Big Data & Hadoop for Non-Geeks. This course builds an essential fundamental understanding of Big Data problems and Hadoop as a solution. This course takes you through: Understanding of Big Data problems with easy to understand examples. History and advent of Hadoop right from when Hadoop wasn’t even named Hadoop.  What is Hadoop Magic which makes it so unique and powerful.

    Mapr Academy

    This is an introductory level course about big data, Hadoop and the Hadoop ecosystem of products. Covered are a big data definition, details about the Hadoop core components, and examples of several common Hadoop use cases: enterprise data hub, large scale log analysis, and building recommendation engines.

    udemy by Jigar Vora

    This course is intended for people who wants to know what is big data. The course covers what is big data, How hadoop supports concepts of Big Data and how different components like Pig, Hive,MapReduce of hadoop support large sets of data Analytics.

    Originally posted on Data Science Central

    Read more…
  • Read more…
  • Fake data science

    Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

    Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science.

    To add to the confusion, executives, decision makers building a new team of data scientists sometimes don't know exactly what they are looking for, ending up hiring pure tech geeks, computer scientists, or people lacking proper experience. The problem is compounded by HR who do not know better, producing job ads which always contain the same keywords: Java, Python, Map Reduce, R, NoSQL. As if a data scientist was a mix of these skills.

    Indeed, you can be a real data scientist and have none of these skills. NoSQL and MapReduce are not new concepts: many embraced them long before these keywords were created. But to be a data scientist, you also need:

    • business acumen, 
    • real big data expertise, 
    • ability to sense the data, 
    • distrust models, 
    • knows about the curse of big data
    • ability to communicate, understand which problems management is trying to solve
    • ability to correctly assess lift or ROI on the salary paid to you
    • ability to quickly identify a simple, robust, scalable solution to a problem
    • being able to convince and drive management in the right direction, sometimes against their will, for the benefit of the company, its users and shareholders
    • a real passion for analytics
    • real applied experience with success stories
    • data architecture knowledge
    • data gathering and cleaning skills

    A data scientist is also a business analyst, statistician and computer scientist - being a generalist in these three areas, and expertise in a few fields (e.g. robustness, design of experiments, algorithm complexity, dashboards and data visualization)

    Fake Data Science Examples

    Here are two examples of mis-labeled data science products, and the reason why we are interested in creating a standard and best practices for data scientists. Not that these two products are bad, they indeed have a lot of intrinsic value. But it is not data science.

    1. eBook: An Introduction to Data Science

    Most of the book is about old statistical theory. Throughout the book, R is used to illustrate the various concepts. The entire book is about small data, with the exception of the last few chapters where you learn a bit of SQL (embedded in R code) and how to use a R package to extract tweets from Twitter, and create what the author calls a word cloud (it has nothing to do with cloud computing).

    Even the Twitter project is about small data anyway, and there's no distributed architecture (e.g. Map Reduce) in it. Indeed the book never talks about data architecture. Its level is elementary. Each chapter starts with a very short introduction in simple English (suitable for middle school students) about big data / data science, but these little data science excursions are out-of-context, and independent from the projects and technical presentations.

    I guess the author (Jeffrey Stanton) added these short paragraphs so that he could re-name his "Statistics with R" eBook as "Introduction to Data Science". But it's free and it's a nice, well written book to get high school students interested in statistics and programming. It's just that it has nothing to do with data science.

    2. Data Science Certificate

    Delivered by a respected public University (we won't mention the name). The advisory board is mostly senior technical guys, most have academic positions. The data scientist is presented as "a new type of data analyst": I strongly disagree with this. Data scientists are not junior people.

    This program has a strong data architecture and computer science flair, and this CS content is of great quality. That's a very important part of data science, but in my opinion, it covers only one third of data science. It has a bit of old statistics too and some nice statistics lessons on robustness and other stuff, but nothing about six sigma, approximate solutions, the Lorentz curve, the 80/20 rules and related stuff, cross-validation, design of experiments, modern pattern recognition, lift metrics, third party data, Monte Carlo simulations, life cycle of data science projects, and nothing found in a MBA curriculum. It requires knowledge of Java and Python for admission. It is also very expensive - several thousand dollars.

    To be admitted, you need to take a 90-minute test (multiple choices) with questions that only fresh graduates would be able to answer. Click here to see the admission test: could you pass? Ironically, this online test is the same for everyone (I double checked), so technically, you could first take it using a fake name, save the questionnaire, then pay someone to answer the questions, then take the test again but this time with your real name - and complete it in just 30 seconds and get all the answers correct! I guess they don't have a real data scientist on board to help them with fraud detection issues. In short, the admission process will eliminate most real data scientists (those with years of successful business experience) except the fraudsters.

    Related articles:

    Originally posted on Analytic Bridge

    Read more…
  • The Emerging Data Stack and Mobile Access

    Originally posted on Data Science Central


    The emerging  "Data Stack" or "Data Layer" is in full transition and can be viewed and defined many different ways. The ability to capture, analyze and learn from data generated at unprecedented scale, combined with means to access that information, on demand, when relevant, creates business opportunities we are only just beginning to appreciate. 

    One way simply defines data in a three layer stack:

    • Internal Data: The data gathered into a data warehouse from the transactional systems of a company.
    • Contextual Data: The data from external sources that adds context to tell the whole story by adding spatial data, population, demographics, and so on.
    • The Integrated Data Model: The metadata that ties everything together to support advanced analysis.

    The top layer of the stack, internal data, is specific to an organization. The contextual layer comes from other sources. The integrated data model is for advanced data analytics applications.

    Another more complex way is represented in the above image.

    There are three data layer trends: data growth, web application user growth and the explosion of mobile computing. 

    Data growth [Big Data]. IDC estimates an organizations data will double every two years. Mining this raw data for valuable, actionable insights is challenging. Hadoop (HDFS, MapReduce, Cassandra and Hive) are batch-processing oriented and assist in analyzing large data sets.

    User growth [NoSQL]. Most new interactive software systems are accessed via browser. If available on the public Internet, these applications now have 2 billion potential users and a 24x7 uptime requirement. Regardless of dataset size, these software systems put unprecedented pressure on the data layer: massive user concurrency; need for predictable, low-latency random access to data to maintain a snappy interactive user experience; and the need for continuous operations, even during database maintenance. Couchbase and MongoDB are open source NoSQL technologies that meet the data management needs of interactive web applications. 

    Mobile computing growth [Mobile Sync]. Mobile devices are increasingly where we create and consume information. But data aggregation and processing will be accomplished in the cloud. IDC estimates that in 2015, 1.4 of the 4.9 zettabytes created that year will be "touched by the cloud." Delivering the right data to millions of mobile devices, when and where it is needed (and then getting it back again) is the mobile-cloud data sync challenge. 

    These three trends may constitute the future emerging modern data stack - one that supports the ebb and flow of information from web and mobile applications to the cloud.

    The key is to design and build a data warehouse / business intelligence (BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse datasets. 

    Data comes from a variety of sources (internal, external, contextual, integrated): data directly created by users of web and mobile applications, observations and metadata related to the use of web and mobile applications, external data feeds, intermediate analysis results. The processing of this information creates information needed by user-facing applications and is fed into a NoSQL solution. 

    The NoSQL solution provides low-latency, random access to the data, meeting the needs of web applications. It also allows a mobile synchronization server quick, random access to data needed by mobile users.

    A Mobile Sync Server manages transient connections with mobile devices, delivering data to native mobile applications when and where it is needed; and receiving information in return.



    Read more…
  • Guest blog post by Tony Agresta

    Technology to store, manage, search and analyze Big Data leaps to the top of the agenda for Financial Institutions as enterprise NoSQL databases come of age.

    Financial Institutions are focused on initiatives to survive in a world where regulatory pressure, risk mitigation and increasing volumes of data continue to pressure legacy infrastructures. Improved operational efficiency and revenue generation are at the forefront of the agenda.

    Specific areas of concentration vary across regions of the world. Some common strategic initiatives in 2012 and 2013 include:

    • Infrastructure Improvements – Current IT infrastructure needs an overhaul if financial institutions are able to respond to increasing regulatory pressure, new product innovation and huge volumes of complex data. By-products of this change include but are not limited to improved data analysis, reporting, data visualization, business intelligence and predictive analytics.
    • Data Proliferation – The growth in mobile applications and social media present unique challenges. As customer touch points expand, new sources of data with new forms of complexity proliferate. Initial approaches to data storage have resulted in silos leading banks to consider innovative approaches to data consolidation.
    • Operational Efficiency – Eroding profit margins, a reliance on legacy systems that can’t handle the load and the need for real time applications have led financial institutions to focus on operational efficiency. As they make improvements in this area, resources are being shifted to revenue generating projects.
    • Data Security and Scalability – In the quest for security and scalability, CIOs and IT leaders are focusing on technology deployments where reliability and uptime are paramount. In turn, high performance transactional environments that utilize all forms of data will allow financial institutions to compete effectively against more nimble players.

    You can read the full article published in World Financial Review here:

    Read more…
  • Originally posted on Data Science Central

    Here's a different angle on a much analyzed question at the heart of our professional activities.  In this article, Steve Miller of Inquidia tackles how NoSQL has changed our traditional understanding of Predictive Analytics and Data Science.  You might also look back at our previous post How NoSQL Fundamentally Changed Machine Learning.

    Here's the beginning of Steve's take on this:

    My company, Inquidia Consulting, is currently engaged in/completing several predictive analytics and data science projects. While we distinguish PA from DS, there's often not a hard dividing line between the two with our customers. Indeed, though we demur, some now consider data science to be any application of statistical methods to business problems.

    For Inquidia, both PA and DS generally involve statistics and machine learning of some sort, often “climaxing” with predictive models trained and validated on existing data. The ultimate goal is to deploy the models to make go-forward predictions in a business process.

    Inquidia's PA work is usually more narrowly focused than its DS cousin, often as not a particular modeling task with relevant data identified in advance for a relatively short-term project. And the PA customer may suggest “theories” on what the final models might look like for us to test. R, Python and SAS are preferred PA platforms.

    DS projects, in contrast, are more comprehensive but nebulous, with substantial computation/data integration/wrangling, big (and perhaps unstructured) data , and exploration challenges that precede theorizing and  subsequent modeling. In many cases, DS work is shaped more by data programming than by modeling. The Cloud, Redshift, Hadoop/Impala, Spark, R and Python are Inquidia's usual suspect DS platforms.

    Read the entire article here.

    Read more…
  • Hadoop:- A soft Introduction

    Originally posted on Data Science Central

    What is Hadoop:
    Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets (In the range of terabytes to zetabytes).

    Who uses Hadoop:
    Hadoop is mainly used by the companies which deals with large amount of data. They may require Process the dataPerform Analysis or Generate ReportsCurrently all leading organizations including Facebook, Yahoo, Amazon, IBM, Joost, PowerSet, New York Times, Veoh etc are using Hadoop.

    Why Hadoop:
    MapReduce is Google's secret weapon: A way of breaking complicated problems apart, and spreading them across many computers. Hadoop is an open source implementation of MapReduce, and its own filesystem HDFS(Hadoop distributed file system)

    Hadoop has defeated Super Computer in tera sort:
    Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. 

    Europe’s Largest Ad Targeting Platform Uses Hadoop:
    Europe’s Largest Ad Company get over 100GB of data daily, Now using classical solution like RDBMS they need 5 days to for analysis and generate reports. So they were running 1 weak behind. After lots of research they started using hadoop. Now Interesting fact is "Tey are able to process data and generate reports with in 1 Hour" Thats the beauty of Hadoop. 

    Leading Distributions of Hadoop:

    1. Apache Hadoop:
    The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
    Apache Hadoop Offers:

    • Hadoop CommonThe common utilities that support the other Hadoop subprojects.
    • HDFSA distributed file system that provides high throughput access to application data.
    • MapReduceA software framework for distributed processing of large data sets on compute clusters.
    • AvroA data serialization system.
    • ChukwaA data collection system for managing large distributed systems.
    • HBaseA scalable, distributed database that supports structured data storage for large tables.
    • HiveA data warehouse infrastructure that provides data summarization and ad hoc querying.
    • MahoutA Scalable machine learning and data mining library.
    • PigA high-level data-flow language and execution framework for parallel computation.
    • ZooKeeperA high-performance coordination service for distributed applications.

    2. Cloudera Hadoop:
    Cloudera’s Distribution for Apache Hadoop (CDH) sets a new standard for Hadoop-based data management platforms. It is the most comprehensive platform available today and significantly accelerates deployment of Apache Hadoop in your organization. CDH is based on the most recent stable version of Apache Hadoop. It includes some useful patches backported from future releases, as well as improvements we have developed for our customers

    Cloudera Hadoop Offers:
    • HDFS – Self healing distributed file system
    • MapReduce – Powerful, parallel data processing framework
    • Hadoop Common – a set of utilities that support the Hadoop subprojects
    • HBase – Hadoop database for random read/write access
    • Hive – SQL-like queries and tables on large datasets
    • Pig – Dataflow language and compiler
    • Oozie – Workflow for interdependent Hadoop jobs
    • Sqoop – Integrate databases and data warehouses with Hadoop
    • Flume – Highly reliable, configurable streaming data collection
    • Zookeeper – Coordination service for distributed applications
    • Hue – User interface framework and SDK for visual Hadoop applications
    Architecture of Hadoop:
    The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data
    Source: Apache
    Name Node:
    NameNode manages the namespace, file system metadata, and access control. There is exactly one NameNode in each cluster. We can say NameNode is master and data nodes are slave. It Contains all the informations about data(ie meta data)

    Data Node:
    DataNode Holds Actual file system data. Each data node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster.

    Install / Deploy Hadoop:
    Hadoop can be installed in 3 modes
    1. Standalone mode:
    To deploying Hadoop in standalone mode, we just need to set path of JAVA_HOME. In this mode there is no need to start the daemons and no need of name node format as data save in local disk. For Tutorial / Instructions 
    2. Pseudo Distributed mode:
    In this mode all the daemons(nameNode, dataNode, secondaryNameNode, jobTracker, taskTracker) run on single machine. For Tutorial / Instructions 
    3. Distributed mode:
    In this mode daemons(nameNode, jobTracker, secondaryNameNode(Optionally)) run on master(NameNode) and daemons(dataNode and taskTracker) run on slave(DataNode). For Tutorial / Instructions

    Source: Technology-Mania

    Read more…
  • 10 Hadoop Start ups to watch

    Everybody knows Cloudera, MapR, Splunk, 10Gen and Hortonworks. But what about Platfora or Hadapt? These 10 startups are my bet on which big data companies will probably be game-changers in 2012
    Read more…