<h1 id="apache-hadoop">Apache Hadoop<a aria-hidden="true" class="anchor-heading icon-link" href="#apache-hadoop"></a></h1>
<h3 id="what-is-it-and-what-is-it-for">What is it and what is it for?<a aria-hidden="true" class="anchor-heading icon-link" href="#what-is-it-and-what-is-it-for"></a></h3>
<p>Some computations can have their performance improved by splitting the task up evenly among many computers.</p>
<ul>
<li>ex. if we have 100 million numbers and need to find the largest one, we can either give it to one computer to do it, or we can break it into parts (say 100 different groups), and give each group to a different computer. Each computer then solves the problem of finding the largest number among 1 million numbers, and then of the resulting set of 100 numbers, we find the largest one. Doing it this way (ie. in parallel) is exponentially faster.</li>
<li>problems such as these are called <a href="https://en.wikipedia.org/wiki/Embarrassingly_parallel">Embarrassingly Parallel</a>
<ul>
<li>the method of breaking it down (mapping) into pieces and then joining the individual results to form a global result (reducing) is called <em>MapReduce</em></li>
</ul>
</li>
</ul>
<p>Hadoop is an open source software that makes doing <a href="/notes/2k8qfv0q9ce2cai8hf5313m">MapReduce</a> type programming easier. You dont have to worry about installing the program on your 100 machines, breaking your initial data into pieces, copying it to all 100 machines, copying results over from 100 machines, etc. All the housekeeping is managed by Hadoop. Once you setup a hadoop cluster over the 100 machines, you can give it any program and data and it takes care of all the behind the scenes work and give you back the result.</p>
<p>Hadoop has often been used for implementing <a href="/notes/i2wJebVFZ4oLX54iz19Hg">ETL</a> processes</p>
<ul>
<li>data from transaction processing systems is dumped into the distributed filesystem in some raw form, and then MapReduce jobs are written to clean up that data, transform it into a relational form, and import it into an MPP data warehouse for analytic purposes</li>
</ul>
<p>The biggest limitation of <a href="/notes/jOmhZ8ovLYTPbpM1vqSDx">Unix</a> tools is that they run only on a single machine—and that’s where tools like Hadoop come in.</p>
<hr>
<strong>Backlinks</strong>
<ul>
<li><a href="/notes/ulicRRwo3lSFzh3tMfWH9">Apache Flink</a></li>
<li><a href="/notes/zxt3lhonfdhglvijd17ua8c">Zookeeper</a></li>
</ul>

Apache Hadoop

tech

This Dendron vault of tech knowledge is organized according to domains and their sub-domains, along with specific implementation of those domains.

For instance, Git itself is a domain. Sub-domains of Git would include topics like `commit`,
`tags`, `reflog` etc., while implementations of each of those could be `cli`, `strat`
(strategies), `inner` (inner workings), and so on.

The goal of the wiki is to present data in a manner that is from the perspective
of a querying user. Here, a user is a programmer wanting to get key information
from a specific domain. For instance, if a user wants to use postgres functions
and hasn't done them in a while, they should be able to query
`postgres.functions` to see basic implementations, as well as common patterns
that have been employed in the past.

This wiki has been written with myself in mind. While learning each of these
domains, I have been sensitive to the "aha" moments and have noted down my
insights as they arose. I have refrained from capturing information that I
considered obvious or otherwise non-beneficial to my own understanding.

As a result, I have allowed myself to use potentially arcane concepts to help
explain others. For example, in my note on [[unit testing|testing.method.unit]],
I have made reference to the [[microservices|general.arch.microservice]] note.
The ability to analogize between different concepts captured in different notes
allows an opportunity to build strong generalized understandings. Given that
you'd have to understand microservices to be able to draw that same parallel
that I've already drawn, these links won't work for everyone. Since these notes
are written for myself, I have been fine with taking these liberties and leaning
on them heavily.

What I hope to gain from this wiki is the ability to step away from any
given domain for a long period of time, and be able to be passably useful for
whatever my goals are within a short period of time. Of course this is all
vague sounding, and really depends on the domain along with the ends I am
trying to reach.

To achieve this, the system should be steadfast to:
- be able to put information in relatively easily, without too much thought
	required to its location. While location is important, Dendron makes it easy
	to relocate notes, if it becomes apparent that a different place makes more
	sense.
- be able to extract the information that is needed, meaning there is a
	high-degree in confidence in the location of the information. The idea is
	that information loses a large amount of its value when it is unfindable.
	Therefore, a relatively strict ideology should be used when determining
	where a piece of information belongs.
	- Some concepts might realistically belong to multiple domains. For instance, the concept of *access modifiers* can be found in both `C#` and `Typescript`. Therefore, this note should be abstracted to a common place, such as [[OOP|paradigm.oop]].

This Dendron notebook is the sister vault to the general [Second Brain](https://thoughts.kyletycholiz.com).

## Tags
Throughout the garden, I have made use of tags, which give semantic meaning to the pieces of information.

- `ex.` - Denotes an *example* of the preceding piece of information
- `spec:` - Specifies that the preceding information has some degree of *speculation* to it, and may not be 100% factual. Ideally this gets clarified over time as my understanding develops. I try to go back after I have better understood the topic and clear out the notes of `spec:` tags
- `anal:` - Denotes an *analogy* of the preceding information. When I can, I attempt to link concepts to others that I have previously learned.
- `mn:` - Denotes a *mnemonic*
- `expl:` - Denotes an *explanation*

## Resources
### UE (Unexamined) Resources
Often, I come across sources of information that I believe to be high-quality. They may be recommendations or found in some other way. No matter their origin, I may be in a position where I don't have the time to fully examine them (and properly extract notes), or I may not require the information at that moment in time. In cases like these, I will add reference to a section of the note called **UE Resources**. The idea is that in the future when I am ready to examine them, I have a list of resources that I can start with. This is an alternative strategy to compiling browser bookmarks, which I've found can quickly become untenable.

### E (Examined) Resources
Once a resource has been thoroughly examined and has been mined for notes, it will be moved from *UE Resources* to *E Resources*. This is to indicate that (in my own estimation), there is nothing more to be gained from the resource that is not already in the note.

### Resources
This heading is for inexhaustible resources. 
- A prime example would be a quality website that continually posts articles.  - Another example would be a tool, such as software that measures frequencies in a room to help acoustically treat it.