Subscribe to our Newsletter

All Posts (381)

Originally posted on Data Science Central

Thousands of articles and tutorials have been written about data science and machine learning. Hundreds of books, courses and conferences are available. You could spend months just figuring out what to do to get started, even to understand what data science is about.

In this short contribution, I share what I believe to be the most valuable resources - a small list of top resources and starting points. This will be most valuable to any data practitioner who has very little free time. 

Map-Reduce Explained

These resources cover data sets, algorithms, case studies, tutorials, cheat…

Read more…

5 Big Data Myths Businesses Should Know

Guest blog post by Larry Alton

Big data is seeping into every facet of our lives. Smart home gadgets are becoming part of the nerve systems of new and remodeled homes, and many renters are demanding these interconnected gadgets from landlords.

But nowhere has Big Data created a bigger buzz than in business. Companies of all sizes are collecting data at a seemingly insurmountable rate. Big data is larger than ever before.

We’ve collected more data in…

Read more…

Originally posted on Data Science Central

We just started in this article to provide answers to one of the largest collection of data science job interview questions ever published, and we will continue to add answers to most of these questions. Some answers link to solutions offered in my Wiley data science book: you can find this book here. The 91 job interview questions were originally published here with no answers, and we recently added 50 questions to identify a true data scientist, …

Read more…

Guest blog post by Syed Danish Ali

Review

The challenges of big data can be captured succinctly as follows[1],[2]:

  • Volume; ever increasing volume which breaks down traditional data-holding capacity
  • Variety; more and more heterogeneous data from many formats and types are bombarding the data environment
  • Velocity; more and more data is time sensitive now; frequent updates are taking place instead of relying on historical old data and data in real time is being generated now by the internet of things, amongst others.
  • Veracity; how valid and reliable is the data? Since now we have so much data, any point of view can be supported by selective adaption of data.

For volume, Map Reduce[3]…

Read more…

Originally posted on Data Science Central

This article introduces Mahout, a library for scalable machine learning, and studies potential applications through two Mahout projects. It was written by Linda Terlouw. Linda is a computer scientist who works on Data Science (Data Analysis, Data Visualization, Process Mining).

Apache Mahout is a library for scalable machine learning. Originally a subproject of Apache Lucene (a high-performance text search engine library), Mahout has progressed to be a top-level Apache project. …

Read more…

Intro to Bigdata Architecture

Originally posted on Data Science Central

Architecture:

In 2000, Seisint Inc. (now LexisNexis Group) developed a C++-based distributed file-sharing framework for data storage and query. The system stores and distributes structured, semi-structured, and unstructured data across multiple servers. Users can build queries in a C++ dialect called ECL. ECL uses an "apply schema on read" method to infer the structure of stored data when it…

Read more…

Guest blog post by Ankit Jain

Hadoop is a leading platform for big data analytics and enterprises have been using Hadoop-based applications to perform a host of operations, including the management, evaluation, and storage of massive amounts of data. Ensuring optimal consumption of time, Hadoop-as-a-service enables companies to function in an extremely cost-effective manner.

The report reviews the global Hadoop-as-a-service market on the basis of geography and divides the international market into Europe, Asia Pacific, North America, and Rest of the World. North America is the largest market for HaaS and is anticipated to retain its lead through 2023. Europe follows next in line with a competitive HaaS…

Read more…

Originally posted on Data Science Central

The era of big data has witnessed a paradigm shift into analytics. Today, it’s no longer sufficient to simply gather data from social media, IoT, and wearable devices, and be unable to manage or filter it. It is more about delivering the right data to the right person, at the right time.

This trend is growing crucial as data is multiplying every day and pouring in from various devices and smart machines including wearables, electronic gadgets, and other devices. Such factors call for the treatment of vast pools of structured and unstructured data with care and precision. This is precisely where invisible analytics come in.

Big Data was the Past; 2015 is the Start Point to Take Analytics to the Next Level

By far, big data has remained as an enabler…

Read more…

Crime Analysis with Zeppelin, R & Spark

Guest blog post by Raghavan Madabusi

Apache Zeppelin, a web-based notebook, enables interactive data analytics including Data Ingestion, Data Discovery, and Data Visualization all in one place. Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. Currently, Zeppelin supports many interpreters such as Spark (Scala, Python, R, SparkSQL), Hive, JDBC, and others. Zeppelin can be configured with existing Spark eco-system and share SparkContext across Scala, Python, and R.

Check out the Use Case that demonstrates some of the Zeppelin's capabilities...

Crime Analysis Zeppelin Dashboard:…

Read more…

Hadoop and In-Cluster Analytics: Whitepaper

Recent technology advances within the Apache Hadoop ecosystem have provided a big boost to Hadoop’s viability as an analytics environment. It is likely that you have a significant amount of data in Hadoop and a need to accurately analyze it. In this whitepaper we outline 3 points to help you avoid bumping up against the limits of the database you’re moving the data into—or how much of it you can afford to use. Thanks to the improvements with SQL-in-Hadoop technologies you can now leverage the power of the cluster you already have in place, expanding and accelerating what you can do while saving you time and resources.

Whitepaper offered courtesy of Looker

Looker

>> Download…

Read more…

Guest blog post by skumar T

Yarn Resource manager (The Yarn service Master component)

1) Controls of the total resource capacity of the cluster

2) Whatever the container is needed in the cluster it sets the minimum container size that is controlled by yarn configuration property

àyarn.scheduler.minimum-allocation-mb 1024(This value changes based on cluster ram capacity)

Description: The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum

and similarly Max container size

-->yarn.scheduler.maximum-allocation-mb  8192 (This value changes based on cluster ram…

Read more…

Guest blog post by Ankit Jain

Since its inception in the year 2008, the global Hadoop market has observed growth at a tremendous pace. This market, valued US$1.5 billion in 2012, is estimated to grow at a CAGR of 54.7% from 2012 to 2018. By the end of 2018, this market could amass a net worth of US$20.9 billion. With the massive amount of data generated every day across major industries, the global Hadoop market is anticipated to observe significant growth in the future as well.

Why Hadoop?

Quite naturally, the mounting scales of unstructured data generated every single day from data-intensive industries such as telecommunication, banking and finance, social media, research, healthcare, and defence has led to the rising adoption of Hadoop solutions.

The major factors driving the need to adopt Hadoop are its cost-sensitive and scalable methodologies…

Read more…

Guest blog post by Tanmay Bhandari

Originally posted on Data Science Central

In the book Hadoop: The definitive guide, Tom white quotes Grace Hopper, “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” For long Hadoop has been the data analytics system preferred by businesses all over. The recent entry of the spark engine has however given businesses an option other than Hadoop for data analytics purposes.

A lot of discussion among experts in the field of big data analytics is over which of the two data analytics engines, the Hadoop or the Spark, is the better…

Read more…

In any hiring process, a candidate with a professional certification always gets extra attention. Here are a few of the certifications in data science.

IBM Certified Data Architect -- Big Data

By this training, you will be able to master your skills in handling big data. The data architect will be having knowledge in different big data technologies, knowing their differences and then finally integrate them to find solutions of any business obscurity. The certification holder will be able to plan big data processors and help in the hardware and software architecture planning. This course is certified by IBM named as IBM Big data and is an added advantage to get your resume shortlisted in Interviews.

EMC Data Scientist…

Read more…
Choice is a good thing, but too much choice can lead to confusion and to buyers taking a “wait-and-see” approach until the market coalesces around the eventual winners. Lack of choice was an important factor in how quickly and readily companies bought into the RDBMS movement 30 or so years ago. I believe that too much choice is holding companies back from buying into the Hadoop / NoSQL movement.
Read more…

70 MongoDB Interview Questions and Answers

Guest blog post by Laetitia Van Cauwenberge

According to Wikipedia, MongoDB is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB avoids the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. MongoDB is developed by MongoDB Inc. and is published as free and open-source software. MongoDB is the fourth most popular type of database management system, and the most popular for document stores.

MongoDB is definitely a great skill to have on your…

Read more…

Hadoop, named after a toy elephant that belonged to the child of one its inventors, is an open-source software framework. It is capable of storing colossal amounts of data and handling massive applications and jobs endlessly. Hadoop’s capabilities make it one of the most sought after data platforms for successful businesses all over the world.

Hadoop Benefits

Because it can store and quickly process any type of data, Hadoop is lightyears ahead of the game in the open-source world. Data is increasing and changing everyday due to social media inventions, new mobile devices, and technological advancements. Here are a few more benefits it exudes:

  • Malleability - Hadoop is not like other databases that need to process its data before storing it. You can store as much as you need to and then process it later. That applies to images, videos, and text as well.
  • Failure tolerance - All of your data is protected against the…
Read more…

Guest blog post Bill Vorhies

Summary:  What if I told you there’s a database in wide use today that does everything RDBMS and Hadoop can do but is 50 years old?  Never heard of MUMPS?  Check out these startling facts.

 

If you’ve never heard of MUMPS don’t feel like the lone ranger.  A colleague mentioned it to me and drove me to a bit of research.  What I found is really astounding.

  • MUMPS was born in 1966 to solve the problem of massive data flowing into multi-user systems in the healthcare industry.
  • It predates RDBMS but has all the features of NoSQL including (in its modern form) massive parallel processing, horizontal scaling, and runs on commodity…
Read more…

Resources

Research