Nosql

Relational databases assume that the relationships are of similar importance, document databases assume that relationships form a hierarchical structure and relationships between documents are less important

NoSQL-type DBs get their power from the developer spending a lot more time and care in thinking about exactly how to access data. Many NoSQL databases loosen the constraints on what you can store in a given record, but in return they are a great deal more fussy about how you access records. If you want to skip careful design of how you access records, you want the relational DB.

If your data cannot be represented on literally a sheet of paper, NoSQL is the wrong data store for you. And I don't mean sheets of paper with references that say "now turn to page 64 for the diagram", no, I mean a sheet of paper per document. That is what a normalized record looks like in a document store.

Horizontal scaling is a distinct benefit of NoSQL, which is why companies like Netflix and Spotify use document databases.

RDBMSs more lend themselves to vertical scaling, which can get costly.

NoSQL databases fit better into the whole paradigm of distributed computing, and NoSQL databases make the most of cloud computing and storage. Cloud-based storage is an excellent cost-saving solution but requires data to be easily spread across multiple servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the cloud saves you the hassle of additional software, and NoSQL databases like Cassandra are designed to be scaled across multiple data centers out of the box, without a lot of headaches.

Instead of reshaping data when a query is processed (as an RDBMS system does), a NoSQL database organizes data so that its shape in the database corresponds with what will be queried.

This is a key factor in increasing speed and scalability.

Generally, NoSQL databases sacrifice ACID compliance for scalability and processing speed

Going from SQL to NoSQL is easier than from NoSQL to SQL

When all the other components of our application are fast and seamless, NoSQL databases prevent data from being the bottleneck.

Big data is contributing to a large success for NoSQL databases, mainly because it handles data differently than the traditional relational databases.

Types of NoSQL Databases

Document-Based Databases

Document-based databases store the data in JSON objects. Each document has key-value pairs like structures:

The document-based databases are easy for developers as the document directly maps to the objects as JSON is a very common data format used by web developers. They are very flexible and allow us to modify the structure at any time.

Ex. Mongo, Couch, Couchbase, DocumentDB(?)

Key-Value Database

Here, keys and values can be anything like strings, integers, or even complex objects. They are highly partitionable and are the best in horizontal scaling. They can be really useful in session oriented applications where we try to capture the behavior of the customer in a particular session.

key-value stores, in general, always maintain a certain number of replicas to offer reliability.

Ex. DynamoDB, Redis, Cassandra

Wide Column-Based Database

This database stores the data in records similar to any relational database but it has the ability to store very large numbers of dynamic columns. It groups the columns logically into column families.

For example, in a relational database, you have multiple tables but in a wide-column based database, instead of having multiple tables, we have multiple column families. Cassandra or key-value stores, in general, always maintain a certain number of replicas to offer reliability.

Ex. Cassandra

Implementations

Cassandra

A key-value store approach to NoSQL

Cassandra's approach to data availability is as follows: Instead of having one master node, it utilizes multiple masters inside a cluster. With multiple masters present, there is no fear of any downtime. The redundant model ensures high availability at all times.

Cassandra is designed to manipulate huge data arrays across multiple nodes.

In contrast to the relational database organizing data records in rows, Cassandra’s data model is based on columns to provide faster data retrieval. The data is stored in the form of hash.

designed to be scaled across multiple data centers out of the box, without a lot of headaches.

DynamoDB (Amazon)

From DynamoDB

Go to text →

Dynamo is a fully-managed, highly available schemaless NoSQL database.

The DB engine can manage structured or semi-structured data, including JSON documents.
supports key–value and document data structures

Uses tables, though this concept is loosely related to how tables are used in SQL.

essentially, we would include all domain primitives of a single domain in a Dynamo table, and use GSIs as our means of "JOINing" the domain primitives together.
- each GSI should represent an access pattern (ie. how that data is accessed, such as "get all people by gender", "get all last names by country")
One of the biggest mistakes people make with dynamo is thinking that it's just a relational database with no relations. It's not.

To get the full benefits of Dynamo, and it requires you often to design your data layer very well up-front. Dynamo is not recommended for a system that hasn't mostly stabilized in design.

When to use Dynamo

Dynamo is really good for high read:write ratio (at least 4:1), meaning Dynamo is a good candidate for high-read applications.

By default each DynamoDB table is allocated 40,000 read units and 40,000 write units of capacity per second.

Dynamo may support large, complex schemas but it gets more difficult to maintain and understand. Dynamo is a better candidate for applications with simpler schemas.

Dynamo offers effortless cross-region replication. Therefore, it is a good candidate for apps that distributed geographically.

Data is stored as partitions

Data is stored as a partitioned B-tree.

Items are distributed across 10-GB storage units, called partitions (physical storage internal to DynamoDB)
Each table has one or more partitions
DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item is stored. Each item’s location is determined by the hash value of its partition key.

Unlike Redis, in that it is immediately consistent and highly-durable, centered around that single data structure.

If you put something into DynamoDB, you’ll be able to read it back immediately and, for all practical purposes, you can assume that what you have put will never get lost.

Terms

Items

Analogous with row of a SQL table

Attribute

Analogous with column name of a SQL table

Marshalling

Before we can Create/Update a record in DynamoDB, a plain JS Object needs to be converted into a DynamoDB Record.

Marshalling refers to our ability to convert a Javascript object into a DynamoDB Record.

Example

AWS.DynamoDB.Converter.unmarshall({
    "updated_at":{"N":"146548182"},
    "uuid":{"S":"foo"},
    "status":{"S":"new"}
})

// { updated_at: 146548182, uuid: 'foo', status: 'new' }

Dynamo provides seamless integration with services such as Redshift (large scale data analysis), Cognito (identity pools), Elastic Map Reduce (EMR), Data Pipeline, Kinesis, and S3. Also, has tight integration with AWS lambda via Streams and aligns with the server-less philosophy; automatic scaling according to your application load, pay-per-what-you-use pricing, easy to get started with, and no servers to manage.

Expression

DynamoDB in practice

The general rule of thumb is to choose Dynamo for low throughput apps as writes are expensive and consistent reads are twice the cost of eventually consistent reads

When to use DynamoDB?

In case you are looking for a database that can handle simple key-value queries but those queries are very large in number
In case you are working with OLTP workload like online ticket booking or banking where the data needs to be highly consistent

When not use DynamoDB?

In cases where you have to do computations on the data.
- Relational databases run their queries close to the data, so if you’re trying to calculate the sum total value of orders per customer, then that rollup gets done while reading the data, and only the final summary (one row per customer) gets sent over the network. However, if you were to do this with DynamoDB, you’d have to get all the customer orders (one row per order), which involves a lot more data over the network, and then you have to do the rollup in your application, which is far away from the data.
If worried about high vendor lock-in.

Pricing $256/TB/month

By default, you should start with DynamoDB’s on-demand pricing and only consider the provisioned capacity as cost optimization. On-demand costs $1.25 per million writes, and $0.25 per million reads.

Then, if your usage grows significantly, you will almost always want to consider moving to provisioned capacity (significant cost savings).
if you believe that on-demand pricing is too expensive, then DynamoDB will very likely be too expensive, even with provisioned capacity. In that case, you might want to consider a relational database.

UE Resources

Elasticsearch

From Elastic Search

Go to text →

What is it?

ElasticSearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene

You can send data in the form of JSON documents to Elasticsearch using the API.
- Elasticsearch automatically stores the original document and adds a searchable reference to the document in the cluster’s index. You can then search and retrieve the document using the Elasticsearch API
due to its distributed nature, documents are available on all nodes of the cluster.
- Each document in an index belongs to one primary shard, but is replicated amongst the other shards.
  - ES selects the shards that the query should go to in a round-robin fashion
ES is NoSQL and is more powerful, flexible, and faster than SQL's LIKE
ES Documents are heavily denormalized, resulting in documents that do not reference one another.

Example Reddit post as ElasticSearch document:

{
  "id": "abcdefg",
  "title": "Amazing subreddit for nature lovers!",
  "content": "Hey everyone!\n\nI just stumbled upon this incredible subreddit called NatureIsBeautiful and I can't stop scrolling through the posts.",
  "author": "nature_enthusiast23",
  "created_at": "2023-06-05T14:30:00Z",
  "subreddit": "NatureIsBeautiful",
  "upvotes": 1500,
  "comments": 87,
  "tags": ["nature", "photography", "community"],
  "url": "https://www.reddit.com/r/NatureIsBeautiful/comments/abcdefg/amazing_subreddit_for_nature_lovers/"
}

Why use it?

ES is typically used when you have:

high data volumes, and are likely to need multiple nodes to process the data
unstructured or semi-structured data (log files, text, ...). You ingest the raw data in its original form.
the data is treated as a blob, and thus never updated. It’s ingested once, queried, and then purged according to some bulk retention policy (e.g. older than 30 days)
you need to access aggregate data more than individual records
you need to index in real time, allowing you ingest high-throughput data streams and query that data quickly, making it well-suited for applications that require constant updates and querying of rapidly changing data

When to use ElasticSearch?

If your use case requires a full-text search, including features like fuzzy matching, stemming (e.g. having the word "run" also match "runs", "running" etc), and relevance scoring.
If your use case involves chatbots where these bots resolve most of the queries, such as when a person types something there are high chances of spelling mistakes. You can make use of the in-built fuzzy matching practices of the ElasticSearch
Also, ElasticSearch is useful in storing logs data and analyzing it

Other use cases:

Add a search box to an app or website
Store and analyze logs, metrics, and security event data
Use machine learning to automatically model the behavior of your data in real time
Automate business workflows using Elasticsearch as a storage engine
Manage, integrate, and analyze spatial information using Elasticsearch as a geographic information system (GIS)
Store and process genetic data using Elasticsearch as a bioinformatics research tool

Elastic search scales horizontally with your requirements.

Forms part of the ELK stack (along with Logstash and Kibana), giving us log analysis, monitoring, and visualization in the context of application and server logs.

As part of the ElasticStack (ELK)

ELK consists of ElasticSearch, Kibana, Beats and Logstash

Logstash and Beats facilitate collecting, aggregating, and enriching your data and storing it in Elasticsearch
Kibana enables you to interactively explore, visualize, and share insights into your data and manage and monitor the stack.
Elasticsearch is where the indexing, search, and analysis magic happens.

How does it work?

When you're searching for text. ES ranks search results based on how close the phrase or words are. SQL doesn't do this nearly as well.

ES starts to shine when you start to do a lot of filtering

Elasticsearch chooses the best underlying data structure to use for a particular field type.

Text is tokenized and stored in an inverted index, which supports very fast full-text searches.
- an inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in.
- ex. if we search for the string London, it is the inverted index that allows us to quickly know that the string occurs in 6 different documents in the index.
Numeric and geolocational data is stored in BKD trees
- this allows for fast-range searches and nearest-neighbor queries in large data sets

Secondary indexes are the raison d’être of search servers such as Elasticsearch.

Mapping is the process by which ES determines how a document is stored and indexed.

How data is retrieved

Based on the query terms passed, each document retrieved will be assigned a score. The documents are then returned to the client sorted by that score.

this is the BM25 algorithm
note: if I pass "prescription refill", then ES recognizes that there are 2 terms: prescription and refill

Some factors that determine the document's score:

rarity - queries that contain rarer terms (amongst all documents) have a higher multiplier, meaning they contribute more to the final score
- ex. the word "the" is likely to be very common amongst all matching documents, while the word "elephant" likely to be rare. As a result, ES recognizes that the word "elephant" is more important, and makes its contribution to the final document's score higher.
- this is known as Inverse Document Frequency (IDF)
density - documents that are longer than average will have the score penalized.
- That is, the more terms in the document (ones that don't match the query), the lower the score for the document.
- expl: this makes intuitive sense: if a document is 300 pages long and mentions the word elephant once, the document is more likely to have said something like "elephant in the room", rather than it actually being a document about elephants. On the other hand, if the document is a tweet of 140 characters, then the word Elephant is much more likely to have actually been about Elephants.
- this is known as Term Frequency (TF)

In the absense of replicas, a given query and set of documents will result in a more-or-less deterministic result

this non-determinism resulting from replicas happens because ES determines which shard the query should go to in a round-robin fashion, so the same query run twice in a row will likely go to different copies of the same shard.

How to use it?

Searching data

The Elasticsearch REST APIs support structured queries, full text queries, and complex queries that combine the two.

Structured queries are similar to the types of queries you can construct in SQL.
- ex. you could search the gender and age fields in your employee index and sort the matches by the hire_date field.
- Query SDL, ElasticSearch SQL
Full-text queries find all documents that match the query string and return them sorted by relevance—how good a match they are for your search terms.

Performing aggregations

Aggregations enable you to build complex summaries of your data and gain insight into key metrics, patterns, and trends.

Instead of just finding the proverbial “needle in a haystack”, aggregations enable you to answer questions like:

How many needles are in the haystack?
What is the average length of the needles?
What is the median length of the needles, broken down by manufacturer?
How many needles were added to the haystack in each of the last six months?
What are your most popular needle manufacturers?
Are there any unusual or anomalous clumps of needles?

Because aggregations leverage the same data-structures used for search, they are also very fast.

ElasticSearch Primitives

Comparison to RDBMS

RDBMS => Databases => Tables => Columns/Rows
Elasticsearch => Clusters => Indices => Shards => Documents

Index

An Elasticsearch index is a logical namespace that holds a collection of documents

That is, an ES Index has nothing to do with database indexes, and are more comparable to tables in SQL
"indexing a document" means "inserting a document into the index"

Index Mapping

essentially a schema for how data will be structured in the index

Each field in a mapping has an analyzer associated with it

Analyzer

Each analyzer contains:

a tokenizer
a normalizer
filters

ES has built-in analyzers, but we can define custom ones, where we define our own tokenizer

Tokenizer

converts text into tokens
- ex. converts "a quick brown fox jumps over the lazy dog" into terms ["a", "quick", "brown"] etc.

Tokenizer types:

word-oriented
partial-word
structured text

N-gram tokenizer

can break a word up into a sliding window of continuous letters

ex. "quick" -> ["qu", "ui", "ic", "ck"]

Edge N-gram tokenizer

ex. "quick" -> ["q", "qu", "qui", "quic", "quick"]

Filter

might do things like removing articles from the terms (e.g. a, the), or do things like include derivate words in the search (e.g. cleaner -> ["cleaning", "cleaned", "cleans"]), or a synonym filter, which adds matches for synonyms that may appear.

Normalizer

A special type of analyzer

emits a single token for a given input, instead of an array of tokens

Queries

Compound vs Leaf queries

leaf query matches against a specific field
compound query combine leaf queries in various ways

Type of compound queries

bool (ex. should, must_and etc.)
boosting
constant_score
dis_max - only the highest score of any leaf query within a compound query will be considered
Function_score - allow us to use more complex functions to determine the score

Leaf queries can have their scores boosted with multipliers

Full-text Query

A type of leaf query

Matches against text in a specific field

match is the most common type of full-text query

Tools

Kibana: a data visualization platform for Elasticsearch

CouchDB

Like MongoDB, Couch is a document-oriented NoSQL databases, but Mongo and Couch diverge significantly in their implementations.

CouchDB uses the semi-structured JSON format for storing data. Queries to a CouchDB database are made via a RESTful HTTP API, using HTTP or JavaScript.
MongoDB uses BSON, a JSON variant that stores data in a binary format. MongoDB uses its own query language that is distinct from SQL, although they have some similarities.

Like Mongo, Couch is schemaless.

CouchDB and MongoDB differ in their approach to the CAP theorem

CouchDB favors availability and partition tolerance
- CouchDB uses eventual consistency. Clients can write to a single database node, and this information is guaranteed to eventually propagate to the rest of the database.
MongoDB prefers consistency and partition tolerance.
- MongoDB uses strict consistency. The database uses a replica set to provide redundancy but at the cost of availability.

As of this writing, Google projects the cost of deploying CouchDB on GCP at $34.72 per month. This estimate is based on a 30 day, 24 hours per day usage in the Central US region, a VM instance with 2 vCPUs and 8 GB of memory, and 10GB of a standard persistent disk.

Couchbase

Every Couchbase node consists of a data service, index service, query service, and cluster manager component. Starting with the 4.0 release, the three services can be distributed to run on separate nodes of the cluster if needed.

In the parlance of CAP Theorem, Couchbase is typically run as a CP system (consistency & partition tolerant)

Provides a SQL-like query language called N1QL for manipulating JSON data stored in Couchbase.

PouchDB

PouchDB was created to help web developers build applications that work as well offline as they do online. It enables applications to store data locally while offline, then synchronize it with CouchDB and compatible servers when the application is back online, keeping the user's data in sync no matter where they next login. Inspired by Couch

Backlinks

Transaction