Introduction to Big Data
What is big data ?
- Data that is too large to be stored in a single computer
Classify digital data ?
- Structured
- Semi-Structured
- Unstructured
Characteristics of big data ?
- Volume
- Velocity
- Variety
What are the different velocities in which data is collected ?
- Batch
- Periodic
- Near Real-time
- Real-time
What are the different varieties of data ?
- Structured
- Semi-Structured
- Unstructured
What are the other 3 V’s of big data ?
- Veracity
- Volatility
- Variability
Instead of the traditional 3 V’s some include an addition V. What is that ?
- Veracity
What are the two different ways to classify data analytics ?
- basic, operational, advanced and monetized
- analytics 1.0, 2.0, 3.0
Explain the type of analytics done during the three chronological classes of analytics ?
- 1.0 : Descriptive
- 2.0 : Diagnostic
- 3.0 : Predictive and Prescriptive
Describe a typical data ware house
?
Differentiate BI and BD ?
BI | BD |
---|---|
Descriptive, Diagnostic | Predictive |
Simple, clean, small datasets | Large, raw, complex, varied dataset |
What happened and Why | New insights |
What the different types of DBs used in big data and give an example for each one ?
DB Type | Name |
---|---|
Key value | Redis, Riak |
Document | HBase, Cassandra |
Wide column | mongoDB, couchDB |
Graph | neo4J, InfiniteGraph |
What is a wide-column database ? The format of column can vary from row to row
Draw the apache Hadoop ecosystem
?
What is in-memory analytics ? Do all the processing in RAM
What is In-Database Processing ? Integration of data analytics into data warehousing
What is a symmetric multiprocessing system ? A symmetric multiprocessor system (SMP) is a multiprocessor system with centralized shared memory called main memory (MM) operating under a single operating system with two or more homogeneous processors
What is tightly coupled multiprocessing ? Symmetric Multiprocessing System
What are the three types of multiprocessing architectures ?
- Shared Memory
- Shared Disk
- Shared Nothing
Explain Consistency ? All nodes should see the same data at the same time
Explain Availability ?
- Node failures do not prevent survivors from continuing to operate
- This condition states that every request gets a response on success/failure of nodes.
- Every client gets a response, regardless of the state of any individual node in the system.
Explain Partition Tolerance ?
- The system continues to operate despite network partitions failures.
- Partition-tolerant systems can sustain any amount of network failure that doesn’t result in a failure of the entire network.
- Data records are sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
What is the CAP theorem ? Can have only two of Consistency, Availability and Partition-Tolerance
When to choose consistency and when to choose availability? give examples ?
- Choose availability over consistency when your business requirements allow some flexibility around when data in the system synchronizes.
- Choose consistency over availability when your business requirements demand atomic reads and writes.
What are the different types of consistencies ?
- Strong
- Weak
- Eventual
What are the different variants of eventual consistency ?
- Monotonic Read
- Monotonic Write
- Read Your Writes
- Casual consistency
What is BASE ? Basically Available Soft state Eventual consistency
Differentiate ACID and BASE ?
ACID | BASE |
---|---|
Availability Less Important | Weak Consistency |
Complex mechanisms | Simple and Fast |
Notes
CAP Theorem
In the event of a network partition a Distributed System can either choose to be consistent or available but not both
Simple Example
- Suppose we have two ATMs
- The supported operations are Deposit, Withdrawal, Check Balance
- There are no central DB and these ATMs are connected by a network
- Assume that a network partition occurs. Now the ATMs have to choose between being available and being consistent
- Case 1: Availability
- If the ATMs choose to be available then they will operate even though they can’t communicate with each other
- Suppose your balance is 100 and both the ATMs have the same value now
- A network partition occurs
- You withdraw 80 from ATM A
- You go to ATM B and withdraw 80. It will allow this transaction because as far as it knows your balance is still 100. The ATMs made a choice to service this request even though it knew the other ATM is unreachable
- Case 2: Consistency
- In this case the ATMs will be unreachable until they can talk to one another