Nosql
Relational databases assume that the relationships are of similar importance, document databases assume that relationships form a hierarchical structure and relationships between documents are less important
NoSQL-type DBs get their power from the developer spending a lot more time and care in thinking about exactly how to access data. Many NoSQL databases loosen the constraints on what you can store in a given record, but in return they are a great deal more fussy about how you access records. If you want to skip careful design of how you access records, you want the relational DB.
If your data cannot be represented on literally a sheet of paper, NoSQL is the wrong data store for you. And I don't mean sheets of paper with references that say "now turn to page 64 for the diagram", no, I mean a sheet of paper per document. That is what a normalized record looks like in a document store.
Horizontal scaling is a distinct benefit of NoSQL, which is why companies like Netflix and Spotify use document databases.
- RDBMSs more lend themselves to vertical scaling, which can get costly.
NoSQL databases fit better into the whole paradigm of distributed computing, and NoSQL databases make the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software, and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches.
Instead of reshaping data when a query is processed (as an RDBMS system does), a NoSQL database organizes data so that its shape in the database corresponds with what will be queried.
- This is a key factor in increasing speed and scalability.
Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed
Going from SQL to NoSQL is easier than from NoSQL to SQL
When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck.
- Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases.
Types of NoSQL Databases
Document-Based Databases
Document-based databases store the data in JSON objects. Each document has key-value pairs like structures:
The document-based databases are easy for developers as the document directly maps to the objects as JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.
Ex. Mongo, Couch, Couchbase, DocumentDB(?)
Key-Value Database
Here, keys and values can be anything like strings, integers, or even complex objects. They are highly partitionable and are the best in horizontal scaling. They can be really useful in session oriented applications where we try to capture the behavior of the customer in a particular session.
key-value stores, in general, always maintain a certain number of replicas to offer reliability.
Ex. DynamoDB, Redis, Cassandra
Wide Column-Based Database
This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. It groups the columns logically into column families.
- For example, in a relational database, you have multiple tables but in a wide-column based database, instead of having multiple tables, we have multiple column families. Cassandra or key-value stores, in general, always maintain a certain number of replicas to offer reliability.
Ex. Cassandra
Implementations
Cassandra
A key-value store approach to NoSQL
Cassandra's approach to data availability is as follows: Instead of having one master node, it utilizes multiple masters inside a cluster. With multiple masters present, there is no fear of any downtime. The redundant model ensures high availability at all times.
Cassandra is designed to manipulate huge data arrays across multiple nodes.
In contrast to the relational database organizing data records in rows, Cassandra’s data model is based on columns to provide faster data retrieval. The data is stored in the form of hash.
designed to be scaled across multiple data centers out of the box, without a lot of headaches.
DynamoDB (Amazon)
Dynamo is a fully-managed, highly available schemaless NoSQL database.
- The DB engine can manage structured or semi-structured data, including JSON documents.
- supports key–value and document data structures
Uses tables, though this concept is loosely related to how tables are used in SQL.
- essentially, we would include all domain primitives of a single domain in a Dynamo table, and use GSIs as our means of "JOINing" the domain primitives together.
- each GSI should represent an access pattern (ie. how that data is accessed, such as "get all people by gender", "get all last names by country")
- One of the biggest mistakes people make with dynamo is thinking that it's just a relational database with no relations. It's not.
To get the full benefits of Dynamo, and it requires you often to design your data layer very well up-front. Dynamo is not recommended for a system that hasn't mostly stabilized in design.
When to use Dynamo
Dynamo is really good for high read:write ratio (at least 4:1), meaning Dynamo is a good candidate for high-read applications.
- By default each DynamoDB table is allocated 40,000 read units and 40,000 write units of capacity per second.
Dynamo may support large, complex schemas but it gets more difficult to maintain and understand. Dynamo is a better candidate for applications with simpler schemas.
Dynamo offers effortless cross-region replication. Therefore, it is a good candidate for apps that distributed geographically.
Data is stored as partitions
Data is stored as a partitioned B-tree.
- Items are distributed across 10-GB storage units, called partitions (physical storage internal to DynamoDB)
- Each table has one or more partitions
- DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item is stored. Each item’s location is determined by the hash value of its partition key.
Unlike Redis, in that it is immediately consistent and highly-durable, centered around that single data structure.
- If you put something into DynamoDB, you’ll be able to read it back immediately and, for all practical purposes, you can assume that what you have put will never get lost.
Terms
Items
Analogous with row of a SQL table
Attribute
Analogous with column name of a SQL table
Marshalling
Before we can Create/Update a record in DynamoDB, a plain JS Object needs to be converted into a DynamoDB Record.
- Marshalling refers to our ability to convert a Javascript object into a DynamoDB Record.
Example
AWS.DynamoDB.Converter.unmarshall({
"updated_at":{"N":"146548182"},
"uuid":{"S":"foo"},
"status":{"S":"new"}
})
// { updated_at: 146548182, uuid: 'foo', status: 'new' }
Dynamo provides seamless integration with services such as Redshift (large scale data analysis), Cognito (identity pools), Elastic Map Reduce (EMR), Data Pipeline, Kinesis, and S3. Also, has tight integration with AWS lambda via Streams and aligns with the server-less philosophy; automatic scaling according to your application load, pay-per-what-you-use pricing, easy to get started with, and no servers to manage.
Expression
DynamoDB in practice
The general rule of thumb is to choose Dynamo for low throughput apps as writes are expensive and consistent reads are twice the cost of eventually consistent reads
When to use DynamoDB?
- In case you are looking for a database that can handle simple key-value queries but those queries are very large in number
- In case you are working with OLTP workload like online ticket booking or banking where the data needs to be highly consistent
When not use DynamoDB?
- In cases where you have to do computations on the data.
- Relational databases run their queries close to the data, so if you’re trying to calculate the sum total value of orders per customer, then that rollup gets done while reading the data, and only the final summary (one row per customer) gets sent over the network. However, if you were to do this with DynamoDB, you’d have to get all the customer orders (one row per order), which involves a lot more data over the network, and then you have to do the rollup in your application, which is far away from the data.
- If worried about high vendor lock-in.
Pricing $256/TB/month
By default, you should start with DynamoDB’s on-demand pricing and only consider the provisioned capacity as cost optimization. On-demand costs $1.25 per million writes, and $0.25 per million reads.
- Then, if your usage grows significantly, you will almost always want to consider moving to provisioned capacity (significant cost savings).
- if you believe that on-demand pricing is too expensive, then DynamoDB will very likely be too expensive, even with provisioned capacity. In that case, you might want to consider a relational database.
UE Resources
Elasticsearch
What is it?
ElasticSearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene
- You can send data in the form of JSON documents to Elasticsearch using the API. Elasticsearch automatically stores the original document and adds a searchable reference to the document in the cluster’s index. You can then search and retrieve the document using the Elasticsearch API
- ES is NoSQL and is more powerful, flexible, and faster than SQL's LIKE
Why use it?
ES is typically used when you have:
- high data volumes, and are likely to need multiple nodes to process the data
- unstructured or semi-structured data (log files, text, ...). You ingest the raw data in its original form.
- the data is treated as a blob, and thus never updated. It’s ingested once, queried, and then purged according to some bulk retention policy (e.g. older than 30 days)
- you need to access aggregate data more than individual records
- you need to index in real time, allowing you ingest high-throughput data streams and query that data quickly, making it well-suited for applications that require constant updates and querying of rapidly changing data
When to use ElasticSearch?
- If your use case requires a full-text search, including features like fuzzy matching, stemming (e.g. having the word "run" also match "runs", "running" etc), and relevance scoring.
- If your use case involves chatbots where these bots resolve most of the queries, such as when a person types something there are high chances of spelling mistakes. You can make use of the in-built fuzzy matching practices of the ElasticSearch
- Also, ElasticSearch is useful in storing logs data and analyzing it
Elastic search scales horizontally with your requirements.
Forms part of the ELK stack (along with Logstash and Kibana), giving us log analysis, monitoring, and visualization in the context of application and server logs.
How does it work?
When you're searching for text. ES ranks search results based on how close the phrase or words are. SQL doesn't do this nearly as well.
- ES starts to shine when you start to do a lot of filtering
Secondary indexes are the raison d’être of search servers such as Elasticsearch.
Example Reddit post as ElasticSearch JSON document:
{
"id": "abcdefg",
"title": "Amazing subreddit for nature lovers!",
"content": "Hey everyone!\n\nI just stumbled upon this incredible subreddit called NatureIsBeautiful and I can't stop scrolling through the posts.",
"author": "nature_enthusiast23",
"created_at": "2023-06-05T14:30:00Z",
"subreddit": "NatureIsBeautiful",
"upvotes": 1500,
"comments": 87,
"tags": ["nature", "photography", "community"],
"url": "https://www.reddit.com/r/NatureIsBeautiful/comments/abcdefg/amazing_subreddit_for_nature_lovers/"
}
Tools
CouchDB
Like MongoDB, Couch is a document-oriented NoSQL databases, but Mongo and Couch diverge significantly in their implementations.
- CouchDB uses the semi-structured JSON format for storing data. Queries to a CouchDB database are made via a RESTful HTTP API, using HTTP or JavaScript.
- MongoDB uses BSON, a JSON variant that stores data in a binary format. MongoDB uses its own query language that is distinct from SQL, although they have some similarities.
Like Mongo, Couch is schemaless.
CouchDB and MongoDB differ in their approach to the CAP theorem
- CouchDB favors availability and partition tolerance
- CouchDB uses eventual consistency. Clients can write to a single database node, and this information is guaranteed to eventually propagate to the rest of the database.
- MongoDB prefers consistency and partition tolerance.
- MongoDB uses strict consistency. The database uses a replica set to provide redundancy but at the cost of availability.
As of this writing, Google projects the cost of deploying CouchDB on GCP at $34.72 per month. This estimate is based on a 30 day, 24 hours per day usage in the Central US region, a VM instance with 2 vCPUs and 8 GB of memory, and 10GB of a standard persistent disk.
Couchbase
Every Couchbase node consists of a data service, index service, query service, and cluster manager component. Starting with the 4.0 release, the three services can be distributed to run on separate nodes of the cluster if needed.
In the parlance of CAP Theorem, Couchbase is typically run as a CP system (consistency & partition tolerant)
Provides a SQL-like query language called N1QL
for manipulating JSON data stored in Couchbase.
PouchDB
PouchDB was created to help web developers build applications that work as well offline as they do online. It enables applications to store data locally while offline, then synchronize it with CouchDB and compatible servers when the application is back online, keeping the user's data in sync no matter where they next login. Inspired by Couch
Backlinks