Glossary of Terms related to Big Data-Alphabet C to D
Alphabet – C
Cassandra:
Based upon the beautiful name of famous Greek princess, is a popular open source database management system by “The Apache Software Foundation”. Cassandra is the preferred choice to handle large volumes of data across distributed servers, especially when scalability and high availability without compromising performance is the prime requirement. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make Cassandra a perfect platform for mission-critical data.
Clickstream Analytics:
Click-stream is the path followed by a web surfer while going through a website. And click-stream Analytics deals with the process of collecting, analyzing and reporting the collection of huge volume of data generated during the process of visit to a website by any visitors.
Cloud Computing:
Cloud computing is essentially software and/or data hosted and running on remote servers and accessible on demand from anywhere on the internet. It is revolutionising the way in which data is stored and analysed. Due to the growth of “big data”
cloud computing has become increasingly important by its dynamic approach to data analysis.
Cluster Analysis:
Cluster Analysis is an explorative analysis and is the main task of exploratory data mining, and a common technique for statistical data analysis. Cluster analysis is also called segmentation analysis or taxonomy analysis. It involves grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
Cluster Computing:
It is a term used for computing using a “cluster” of pooled resources of several machines. Computer clusters with every node set to work together to solve a single problem, controlled and scheduled by software. Clusters are deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.
COAP:
Constrained Application Protocol is a specialized web transfer Internet Application protocol for limited resource devices that can be translated to HTTP if needed. CoAP is used with constrained nodes and constrained networks in the “Internet of Things (IoT)”. CoAP is designed for machine-to-machine (M2M) devices which are deeply embedded and have much less memory and power supply than traditional internet devices have.
Comparative Analytics:
Comparative analysis deals with drawing comparison among multiple processes, data sets or other objects using statistical techniques such as pattern analysis, filtering and decision-tree analytics etc. Comparative analysis is extensively used in healthcare sector for bringing together “big data” (from payers, providers, supply chain and patients) in real-time and helps in comparing large volumes of medical records, documents, images etc. for more effective and hopefully accurate medical diagnoses. In other words, comparative analytics acts as a tool to convert big data into useful action.
Connection Analytics:
Connection analytics discovers interrelated connections and influences between people, products and processes within a network to refine analytic results. Connection analytics can be applied to network of relationships and business areas like human resources, operations, marketing and security etc.
Alphabet – D
DaaS:
In the world of data computing, (DaaS) stands for “Data-as-a-service” and is a cousin of (SaaS) which stands for “Software-as-a-service”. DaaS stands on the foundation of the concept that the high quality data can be provided on demand to the consumers.
Dark Data:
This refers to all the data that is gathered and processed by an organization not used for any meaningful purposes and hence it is “dark” and may never be analyzed. Call center logs, social network feeds, meeting notes stored in the course of regular business activity, but are hardly used for any useful purposes can be termed as “dark data”.
Data Analyst:
Data Analyst is responsible to work upon the data that has been fed into and processed through the machine. Data Analyst intelligently digs into the data, organizes it, reviews it, validates it and prepares structured reports using Microsoft Office tools – Excel, PowerPoint, and sometimes Access etc. to help organization take better business decisions.
Data Cleansing:
Data cleansing also called data scrubbing, is the process of detecting, amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated. Organizations use a data scrubbing tool to systematically examine data for defects by using rules, algorithms, and look-up tables. Reason being dirty data leads to incorrect analysis and bad decisions.
Data Engineering:
It is the multi-disciplinary practice of engineering computing systems, computer software for collection, storage, and processing of data so that the information can be partly extracted through the analysis of data by a data scientist.
Data Flow Management:
The specialized process of absorbing raw device data, while managing the flow of several thousand producers and consumers. Then performing basic data enrichment, analysis in stream, aggregation, splitting, schema translation, format conversion, and other initial steps to prepare the data for further business processing.
Data Governance:
The process of data management of the entire data that an organization has to ensure that high data quality exists throughout the complete lifecycle of the data. The key focus areas of data governance include availability, usability, consistency, data integrity and data security within a data lake.
Data Integration:
Data integration is the combination of technical and business processes used to combine data from variety of sources providing a unified view into meaningful and valuable information for the user.
Data Lake:
A data lake is a method of storing large amount of raw data within a system or repository, in its native format. The data lake consists of structured data from relational databases, semi-structured data, unstructured data (like emails, documents and PDFs) and even binary data thereby creating a centralized data repository accommodating all forms of data. From data lake we can easily access enterprise-wide data as per the requirement and can process it and make intelligent use of it.
Data Mining:
A practice to generate new information through the process of examining and analyzing large databases. It is the process of finding meaningful patterns and deriving insights in large sets of data using sophisticated pattern recognition techniques. Data miners make use of statistics, machine learning algorithms, and artificial intelligence for deriving meaningful patterns.
Data Operationalization:
The process of strictly defining variables into measurable factors. It is the process of consistently peeping through the large volumes of data, finding signals within the noise and deliver actionable information to business stakeholders to drive better outcomes.
Data Preparation:
The process of gathering data from different internal systems and external sources, profiling, validating and consolidating into a single file or a data table, primarily for use in the analysis. The primary objective of data preparation is to ensure that the gathered information is accurate and consistent for the analysis.
Data Processing:
Data processing is the process of retrieving, transforming, analyzing, or classifying information by a machine. It involves conversion of data into usable and desired form.
Data Science:
Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, for solving analytically complex problems. It uses basic techniques and theories from mathematics, statistics, information science, and computer science, particularly from the sub-domains of machine learning, classification, cluster analysis, uncertainty quantification, computational science, data mining, databases, and visualization.
Data Scientist:
Data scientists are the persons with extraordinary skills to manipulate the data. They love to play with huge mass of structured or unstructured data points and use their skills of mathematics, statistics and computer science to cleanse, polish and organize the data. Then they apply all their story-telling skills, analytical powers, industry knowledge, contextual understanding, and scepticism of existing assumptions – to uncover hidden solutions to business challenges.
Data Swamp:
A Data Swamp is an unstructured, out of control and ungoverned, Data Lake where due to a lack of process, standards and governance, data is hard to find, hard to use and is consumed out of context.
Data Validation:
Data validation is the process of ensuring that data has been cleansed to ensure that the data is clean, correct and useful before it is processed.
Data Virtualization:
Data virtualization is to data management which allows aggregating data from different sources of information to develop a single, logical and virtual view of information so that it can be easily accessed by the applications, dashboards and portals without requiring technical details of where it is stored and how it is formatted etc.
Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of large data collected from various sources and used to support of management’s decision making process.
Descriptive Analytics:
Descriptive analytics interprets the historical data or the past to understand changes taken place in the business. The past refers to any point of time that an event has occurred, whether it is one minute ago, or one month back. Descriptive analytics help us learn from the behaviours in the past, and predict their influence on future outcomes. For example If a company spends 5% on rent, 10% on utilities, 10% on insurance, 20% wages, 8% on maintenance, 18% on training, 22% on marketing and the remaining on miscellaneous activities during year 2017, that we call descriptive analytics.
Device Layer:
Device Layer provides visibility into the internal network devices by gathering network device topology, interface and health metrics. With device layer we get end-to-end visibility into the application performance and richer network path metrics in a single pane of glass.
Dirty Data:
In a data warehouse, dirty data is a database record that contains errors. Dirty data can be caused by a number of factors including duplicate records, incomplete or outdated data, and the improper parsing of record fields from disparate systems. The dirty data needs to be fixed as quickly as possible.
Continue to Next Part of Glossary of Terms E to N
An expert on R&D, Online Training and Publishing. He is M.Tech. (Honours) and is a part of the STG team since inception.
Very Good Post. Please post new articles related to new Technologies