Big Data Overview Big data are those data sets with sizes beyond the ability of commonly used software tools to capture, curate, process and manage data within a tolerable elapsed time. Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. All in a time-frame, big data preserves the intrinsic value of the data. The four distinct applications segments comprise the big data market.
Four big data segments are:
Design - Engineering collaboration
Discover - Core simulation i.e supplanting the physical experimentation
Decide - Analytics
Deposit - Web 2.0 and data warehousing
Data is getting bigger and bigger, industry analyst Doug Laney described the “3Vs” (volume, variety, and velocity) as the key data management challenges for enterprises. The same “3Vs” that have been used in the past years by just about anyone attempting to define or describe big data.
Volume Organizations collect huge data from a variety of sources, including business transactions, social media, and information from sensor or machine to machine data. In the past, storing it would’ve been a problem, but now new technologies (such as Hadoop) have eased the burden.
Velocity The data streams in at an unprecedented speed and it must be dealt with in a timely manner. Sensors ,RFID tags, and smart metering are driving the need to deal with torrents of data in near-real time.
Variety These data comes in all types of formats, from structured, numeric data in traditional databases to unstructured text documents, email, pictures,video, audio, stock ticker data and financial transactions.
Why big data? Three trends that disrupting the database status Quo are:-
Big Users It was not that long ago when 1,000 daily users of an application was a lot and 10,000 was an extreme. But today, with the growth in global communication and Internet use, the increased number of hours users spend online, and the growing popularity of tablets and smartphones, it's not uncommon for apps to have millions of users a day.
Big Data Nowadays, data can be easy to capture and accessed through third parties such as facebook, D & B (Dun & Bradstreet), and others. Personal user information, geo-location data, social graphs, user-generated content, machine logging data, and sensor-generated data are just a few examples of the ever-expanding array of data being captured.
Cloud Computing Most new applications (both consumer and Enterprise) use a three-tier internet architecture, that runs in a public or private cloud, and support large numbers of users.
Why Is Big Data Important? The importance of big data doesn’t revolve around how much data we have, but what we do with it. We can take data from any source and analyze it to find knowledge and information that enable
New product development and optimized offerings, and
When we combine big data with high-powered analytics, we can accomplish business-related tasks such as:
Determine root causes of failures, issues, and defects in near-real time.
Generate coupons at the point of sale based on the customer’s buying habits.
Recalculate entire risk portfolios in minutes.
Detect anomaly fraudulent behavior before it affects your organization.
Background of Data Analytics Big data analytics is the process of examining, cleaning, transforming and modeling large amounts of data of a variety of types in order to discover information, conclude and support decision making. Data mining is a data analysis technique that focuses on modeling and knowledge discovery for predictive rather than descriptive application. The varieties of data analysis according to statistical modeling and application are:
Descriptive Statistics - What is data representing?
Exploratiory Data Analysis (EDA) - Discover new feature
Confirmatory data analysis (CDA) - Conform or falsify existing hypothesis
Predictive Analysis - Forecast or classify
Text Analysis - Linguistic extraction
Big data analytics can be done with the software tools commonly used as part of advanced analytics such as predictive analysis and data mining. But the unstructured data sources used for big-data analytics may not fit in traditional data warehouses. The traditional data warehouses may not be able to handle processing demands posed by big data. The technologies which are associated with big data analytics include No SQL databases, Hadoop and MapReduce known about these technologies form the core of an open source software framework that supports the processing of large data sets across clustered systems.
Big data analytics initiatives include
Internal data analytics skills
The high cost of hiring experienced analytics professionals,
Challenges to integrating Hadoop systems and data warehouses
Big Analytics delivers competitive advantage in two ways compared to the traditional analytical model. Big Analytics describes the efficient use of a simple model applied to volumes of data that would be too large for the traditional analytical environment. Research suggests that a simple algorithm with a large volume of data is more accurate than a sophisticated algorithm with little data. Another thing is, the term “analytics” refers to the use of information technology to harness statistics, algorithms and other tools of mathematics to improve decision-making. Guidance for analytics must recognize that processing of data may not be linear. May involve the use of data from a wide array of sources. Principles of fair information practices may be applicable at different points in analytic processing. Guidance must be sufficiently flexible to serve the dynamic nature of analytics and the richness of the data to which it is applied.
The power and promise of analytics
Enterprise data and business intelligence.
Big Data analytics can improve network security.
Security professionals manage enterprise system risks by controlling access to the systems, services, and applications defending against external threats.
Protecting valuable data and assets from theft and loss.
Monitoring the network to quickly detect and recover from an attack.
Big data analytics is particularly important to network monitoring, auditing and recovery.
Business Intelligence uses big data and analytics for these purposes.
Big data to address patient care issues and to reduce hospital readmission rates.
The focus on the lack of follow-up with patients, medication management issues and insufficient coordination of care.
Data is preprocessed to correct any errors and to format it for analysis.
Analytics to Reduce the Student Dropout Rate (Educational Data).
Analytics applied to education data can help schools and school systems better understand how students learn and succeed.
Based on these insights, schools and school systems can take steps to enhance education environments and improve outcomes.
Assisted by analytics, educators can use data to assess and when necessary re-organize classes, identify students who need additional feedback or attention.
Direct resources to students who can benefit most from them.
Role of Distributed System in Big Data A distributed system is a collection of independent computers that appears to its users as a single coherent system. Distributed system is one in which components located at networked computers communicate and coordinate their actions only by passing messages. Distributed system play an important role in managing the big data problems that prevail in today’s world. In the distributed approach, data are placed in multiple machines and are made available to the user as if they are in a single system. Distributed system makes the proper use of hardware and resources in multiple locations and multiple machines.
How big data and distributed systems solve traditional scalability problems ? It is very rare to see an enterprise that relies completely on the centralized system but there are nevertheless still many organizations that keep a tight grip on internal data center and renounces any more distribution than is absolutely necessary. This may be due to investments in existing infrastructure or security concern that arises from a risk-averse culture. Anyway, the centralized system is becoming less and less feasible due to many avoidable factors:The increase of client devices in abundance every year is creating increasing complex array of endpoints to serve. The exponential expansion of amount and variety of data collected due to social, mobile and embedded technology advancement. The need to process and analyze this data for business insights becomes crucial in a competitive marketplace. For continuous development and deployment systems need high componentization for greater flexibility and agility. The cost of scaling internally to provide the computing resources to keep up with demand while maintaining acceptable performance levels becomes too high to handle from both an administrative and infrastructure standpoint. A potential single point of failure is unacceptable in an era when decisions are made in real time, a loss of access to business data is a disaster. And end users don't tolerate 'downtime'.
How does embracing a more distributed architecture address issues mentioned above? Different aspects of the distributed computing paradigm resolve different types of performance and availability issues. Here are a few examples:
Peer pressure is a good thing The peer-to-peer distributed computing model maintains uninterrupted uptime and access to applications and data even in the partial system failure. Some vendor SLAs(service-level agreement) guarantee high availability ie about 99% uptime or higher, a feat which few or no enterprises can match using centralized computing. The automated failover mechanisms mean end users are often unaware that there is even a problem since communication with servers is not compromised. What about latency issues? SLAs may be customized with specific performance metrics for response time and also other factors that align with business objectives.
The sky is the limit The virtually unlimited scalability of cloud computing provides the ability to increase or decrease usage of the infrastructure resources on demand. Instant and automated provisioning and de-provisioning of servers and other resources allow enterprises to perform better by ensuring that end user access to applications keeps up with simultaneous, resource intensive demand even when traffic spikes unexpectedly.
Data is a big deal The use of distributed systems also has implications for Big Data. The rise of NoSQL options provides an opportunity for enterprises to bifurcate their data stream to accept and fully utilize both relational data via SQL DBs and non-relational data with DB options such as MongoDB and MarkLogic.Meanwhile, SQL still has the edge when it comes to reporting functionality, security, and manageability. On another hand, if you have scale problems that are hard or expensive to solve with traditional technologies, NoSQL helps to fill these needs in ways that we didn't have before. Implementing native applications that run on thick clients relieves servers of some of their workload and can deliver a faster and more user-friendly experience (assuming there isn't a need to update data frequently between the client and the server). Using a tiered structure that divides responsibilities among the web, application and data servers can permit organizations to out-source any of these processes or layers that can be handled most effectively by third party vendor. This type of multi-tiered distributed computing can also be used to lessen the burden on internal servers even deploying applications for thin clients such as mobile devices.
Bargain-basement pricing The large scale distributed virtualization technology has reached the point, where third-party data center and cloud providers can squeeze every last drop of processing power out of their CPUs to drive costs down further than ever before. Even an enterprise-class private cloud may reduce overall costs if it is implemented appropriately. The number of cloud service vendors are increasing. In addition to lowering costs, relieving the administrative burden from internal IT personnel may free up resources to developing applications that improve performance in other ways.
Versatility in technology choices A distributed architecture is able to serve as an umbrella for many different systems.(According to Apache.org)Hadoop is just one example of a framework that can bring together a broad array of tools such as :
Hadoop Distributed File System(HDFS) Provides high-throughput access to application data
Hadoop YARN Job scheduling and cluster resource management
Hadoop MapReduce Parallel processing of big data
Pig High-level data-flow language for parallel computation
ZooKeeper High-performance coordination service for distributed applications
"Big data", Jimmy Gutterman,2009, Release2.0:issue11
"Whats the big data? 12 definitions" at /whatsthebigdata.com
"How big data solve scalability problems? at "/theserverside.com/feature