Data Structures and Algorithms Fundamentals
Data types, memory management, and algorithmic complexity: the foundational concepts that determine your code's performance
Imagine you're organizing your personal library. You could simply stack all the books in a corner - this is extremely fast: as soon as you receive a new book, you toss it on the pile in seconds without thinking. But when you need to find a specific title, you'd have to check each book one by one from top to bottom, potentially examining hundreds of books. Alternatively, if you decide to organize them alphabetically on shelves, by genre, or by author, the initial process of placing each book requires more time: you must find the correct location, perhaps move other books to make space. However, once organized, locating any book becomes almost instantaneous - you know exactly where to look.
Knowing when to use one approach or the other is crucial for designing efficient processes. If you're packing books for a move, quickly throwing them into boxes makes perfect sense: you only need to transport them from one place to another, nobody will be searching for a specific title during transit. But if you're organizing a public library where thousands of people will consult books daily, investing time in meticulously cataloging and sorting every volume is indispensable. To make the right decision, we need two things: understand the use case of the process we're going to design and know the different structures available for organizing information.
In the software world, exactly the same thing happens. When we implement an algorithm or build a system, we first need to understand the use case: are we going to insert data constantly and read it occasionally? Or are we going to make thousands of searches per second? But we also need to master the data structures available for organizing information in a computer's memory. There are many types of structures, each optimized for specific tasks: some are exceptional for fast searches but slow for insertions, others are the opposite. However, before diving into linked lists, binary trees, or hash tables, we need to understand three fundamental pillars: what types of data exist, where they live in memory, and why the way we organize them is crucial.
1. Primitive and Reference Data Types
What exactly is a data type? It's the fundamental classification that tells the computer two critical things: how much memory to reserve and what operations are valid. Declaring a variable as int
instead of string
isn't just syntactic formality; it defines whether it will occupy 4 bytes or 8 bytes, and whether you can add to it or concatenate it.
What is a byte? A byte is the basic unit of memory storage, composed of 8 bits (where each bit can be 0 or 1). With one byte you can represent 256 different values (2^8 = 256). When we say that an
int
occupies 4 bytes, it means it uses 32 bits (4 × 8 = 32) to store values, which allows representing approximately 4.3 billion different numbers (2^32 = 4,294,967,296). More bytes = more range of possible values.
All programming languages are built on primitive data types (the atomic blocks) and non-primitive types (complex structures built from primitives).
Primitive Data Types
Primitives are the most basic types: numbers, characters, booleans. They're the raw material with which everything else is built. They have fixed size in memory and represent a single value.
// JavaScript has dynamic primitive types
// Integers and decimals (Number)
let age = 25; // integer
let price = 19.99; // decimal
// Booleans
let isActive = true;
let hasPermission = false;
// Character and Strings (String)
let initial = 'A';
let name = "John";
// Undefined and Null
let undefined; // undefined
let nullValue = null; // null
console.log(typeof age); // "number"
console.log(typeof isActive); // "boolean"
console.log(typeof name); // "string"
Examples of Primitive Data Types
The difference between int
(32 bits, range ±2 billion) and long
(64 bits, range ±9 quintillion) isn't arbitrary. If you're counting users of an app, an int
suffices. If you're calculating national debt, you need long
. Choosing the right type not only prevents overflow errors but also optimizes memory usage.
What is an overflow error? It occurs when you try to store a value that exceeds the maximum range of the data type. For example, if an
int
(32 bits) has a maximum value of 2,147,483,647 and you add 1 to it, you don't get 2,147,483,648 as you'd expect. Instead, the value "overflows" and wraps around to the minimum negative value: -2,147,483,648. It's like a car odometer that goes from 999,999 to 000,000. This type of bug is silent and dangerous: the program doesn't crash, it simply produces completely incorrect results. In critical cases (financial, aerospace, medical systems), an undetected overflow can be catastrophic.
Similarly, float
vs double
is a decision between precision and memory. A float
(32 bits) gives you ~7 digits of precision, sufficient for 3D graphics. A double
(64 bits) gives you ~15 digits, necessary for scientific calculations. The memory difference might seem insignificant for one variable, but multiplied by millions in an array, it becomes critical.
Non-Primitive (or Reference) Data Types
Unlike primitives that store a direct value, reference types store a memory address that points to where the data actually lives. This allows complex and variable-size structures: arrays, strings, custom objects.
// Strings (String) - although they behave like primitives,
// internally they are objects
let message = "Hello World";
let length = message.length; // object property
// Arrays (Array)
let numbers = [1, 2, 3, 4, 5];
let names = ["Ana", "Luis", "Maria"];
let mixed = [1, "text", true, null];
// Access and modification
console.log(numbers[0]); // 1
numbers.push(6); // adds to the end
console.log(numbers.length); // 6
// Objects
let person = {
name: "John",
age: 30,
active: true
};
console.log(person.name); // "John"
Examples of Non-Primitive Data Types
Here's a crucial detail: when you pass a primitive to a function, it's passed by value (a copy). When you pass an object or array, it's passed by reference (the memory address). This means modifying an array inside a function affects the original, but modifying an int
doesn't. Understanding this difference is fundamental to avoiding subtle bugs.
Why does this matter? Because it determines how your code behaves in memory and directly affects performance. Copying an int
costs 4 bytes. Copying an array of 1 million elements costs 4 MB and CPU time. Passing a reference to the array costs only 8 bytes (the pointer size).
// PASS BY VALUE (primitives)
function increment(number) {
number = number + 10; // Modifies the local COPY
console.log("Inside function:", number); // 15
}
let x = 5;
increment(x); // A COPY of x is passed (4 bytes copied)
console.log("Outside function:", x); // 5 (no changes)
// PASS BY REFERENCE (arrays/objects)
function modifyArray(arr) {
arr[0] = 999; // Modifies the ORIGINAL array
console.log("Inside function:", arr); // [999, 2, 3]
}
let numbers = [1, 2, 3];
modifyArray(numbers); // The ADDRESS is passed (8 bytes copied)
console.log("Outside function:", numbers); // [999, 2, 3] (changed!)
// Even if the array has 1 million elements,
// only 8 bytes are copied (the memory address)
Difference Between Pass by Value and Pass by Reference
The Fundamental Operations
Now that we understand what types of data exist and how they behave in memory (value vs reference), we need to establish the fundamental vocabulary: what operations can we perform on any data collection? These operations are universal for all structures. Later we'll have dedicated posts exploring each structure in depth (arrays, linked lists, trees, hash tables, etc.) with all their complexities and use cases. For now, let's see an introduction to the five essential operations:
1. Access: Get an element at a specific position.
- Array: O(1) - Direct memory address calculation:
base_address + (index × element_size)
. Accessingarr[0]
orarr[999999]
takes the same time. - Linked List: O(N) - You must follow pointers node by node from the start. To reach element 1000, you make 1000 jumps.
2. Search: Find an element with a specific value.
- Unsorted Array: O(N) - Without order, you must check element by element. In the worst case, you check the entire collection.
- Binary Search Tree: O(log N) - Each comparison discards half of the search space. 1 million elements requires only ~20 comparisons (log₂(1,000,000) ≈ 20).
3. Insertion: Add a new element.
- Array (in the middle): O(N) - Contiguous memory means shifting all following elements to make space. Inserting
[10,20,30,40,50]
→[10,20,99,30,40,50]
requires moving 30, 40, 50. - Linked List (if you already have the position): O(1) - Just change two pointers. Size doesn't matter.
4. Deletion: Remove an element.
- Array (from the middle): O(N) - You leave a "hole" that must be filled by shifting elements backward.
[10,20,30,40,50]
→[10,20,40,50]
requires moving 40 and 50. - Linked List (if you already have the position): O(1) - Change one pointer to "skip" the deleted node.
5. Traversal: Visit all elements sequentially.
- Array: O(N) - Very efficient due to cache locality. Contiguous elements are loaded together in cache.
- Linked List: O(N) - Less efficient. Each node can be scattered in memory, causing frequent cache misses. In practice can be 10-20x slower than an array even though both are O(N).
With this vocabulary established, we can now dive into where this data lives in memory and why that matters so much for the performance of these operations.
2. Memory Management: Stack and Heap
We already know what types of data exist, but where do they exactly live when you run your program? The answer is more complex than it seems. Your computer's memory isn't a single, homogeneous space; it's divided into specialized regions with different characteristics of speed, size, and management.
The Memory Hierarchy: From Fastest to Slowest
Before talking about Stack and Heap, you need to understand that your computer has multiple memory levels, each with a distinct balance between speed, size, and cost. It's like having different types of storage in a kitchen: some spaces are small but within arm's reach (fast), others are spacious but require walking to them (slow).
1. CPU Registers
Registers are the variables that live inside the processor. They're extremely fast (1 CPU cycle, ~0.3 nanoseconds) but incredibly limited: a modern processor has only about 16-32 general-purpose storage locations. When you write int x = 5;
, the compiler tries to keep x
in a register whenever possible. You don't have direct control over this, the compiler decides.
2. Cache Memory (L1, L2, L3)
The cache is ultrafast memory that lives next to the processor and acts as an intermediary between the CPU and RAM. It's like a desk near a library: you keep there the books you're constantly using so you don't have to go to the shelves every time.
- L1: The fastest (~1 nanosecond), smallest (~32-64 KB), integrated in each CPU core
- L2: Intermediate (~3-10 nanoseconds), medium size (~256 KB - 1 MB), also per core
- L3: Slower (~10-20 nanoseconds), larger (~8-32 MB), shared among all cores
When your program accesses array[0]
, the processor loads not just that element but a complete "block" of nearby memory (typically 64 bytes) into cache, betting that you'll soon access array[1]
, array[2]
, etc.
Why does data organization matter? Cache Locality
This cache prefetching behavior explains why the physical organization of data in memory has a massive impact on performance. There are two fundamental ways to organize data:
Contiguous data (like arrays): Stored in consecutive memory blocks. When you access the first element, the cache automatically loads nearby elements. If your next operation needs array[1]
or array[2]
, they're already in cache (access in ~1 nanosecond instead of ~100 nanoseconds from RAM). This access pattern is called cache locality and is extremely efficient.
Fragmented data (like linked lists): Each element can be anywhere in RAM. Element 0 might be at address 0x1000
, element 1 at 0x9AF3C8
, and element 2 at 0x2B14
. Each node has a pointer to the next. When you access the first element, the cache loads memory near 0x1000
, but the next element is at 0x9AF3C8
, completely far away. Cache miss. The processor must wait ~100 nanoseconds to bring that data from RAM. This repeats on each access.
The difference is dramatic. Processing 1 million integers in an array can be 10-20x faster than processing them in a linked list, even if both algorithms are O(N). Big O complexity doesn't capture this hardware detail, but in real systems with millions of operations per second, it can mean the difference between a server responding in 50 ms or in 5 ms. This is why arrays are usually the first choice for storing data collections, unless you need the specific advantages of other structures (like fast insertions at the beginning with linked lists).
3. RAM (Random Access Memory)
RAM is the main memory where your running programs live. It's much larger (8 GB - 64 GB typically) but significantly slower than cache (~100 nanoseconds). When you declare int arr[1000000];
, those 4 MB are allocated in RAM.
Everything we'll discuss about Stack and Heap happens in RAM. RAM is volatile: when you turn off the computer, everything is lost. That's why you must save files to disk.
4. Disk Storage (SSD/HDD)
Disks are massive (256 GB - 4 TB) but extremely slow compared to RAM (~100,000 nanoseconds for SSD, ~10,000,000 nanoseconds for traditional HDD). They're not part of a program's working memory, but there are techniques like "virtual memory" where the operating system can use disk as an extension of RAM when it runs out (this is called "swap" and is devastating for performance).
The speed difference is brutal: If a register access took 1 second, a cache access would take 3 seconds, RAM would take 5 minutes, and disk would take 1 week. That's why code that keeps data close in memory (cache-friendly) can be orders of magnitude faster.
Memory Addresses and Pointers
Now that we know where memory lives, let's talk about how we identify it.
What is a Memory Address?
RAM is like a giant hotel with millions of rooms. Each byte in RAM has a unique address, typically represented in hexadecimal: 0x7fff5fbff5c0
. This address is simply a number that identifies where that byte is in physical memory.
When you declare int x = 42;
, the operating system assigns it 4 consecutive bytes in RAM, let's say at addresses 0x1000
to 0x1003
. The value 42
is stored in those 4 bytes.
What is a Pointer?
A pointer is simply a variable that stores a memory address. Following our hotel analogy, a pointer is like having the room number written on a note: the note isn't the room itself, but it tells you exactly where to find it.
int x = 42; // Normal variable at address 0x1000
int* ptr = &x; // Pointer that stores 0x1000
In high-level languages (JavaScript, Python, Java), pointers are "hidden". When you do let obj = {name: "Ana"};
, internally obj
is a reference (basically a pointer) to the address where the object lives in memory. The language doesn't let you manipulate the address directly for safety, but it exists.
Why are pointers important? Because they allow sharing data without copying it. As we saw earlier in the pass by value vs pass by reference example, if a function receives a pointer to an array of 1 million elements, it only copies 8 bytes (the address), not the full 4 MB of the array. This is the fundamental reason why passing objects and arrays by reference is so efficient.
Two Memory Regions with Distinct Characteristics
With this context, we can now talk about how the RAM your program uses is organized. Your program doesn't have access to all RAM, but to a virtual space assigned by the operating system, divided into specialized regions. The two most important are Stack and Heap:
1. Stack Memory
The Stack is a memory region that functions exactly like a stack of plates in your kitchen. You can only interact with the top plate: add a new plate on top (push), or remove the top plate (pop). You can't remove the middle plate without toppling everything. This behavior is called LIFO (Last-In-First-Out): the last one in is the first one out.
How does it work in your program?
Every time you call a function, the system creates a "frame" or stack frame on the Stack. Imagine each frame is a box containing:
- The parameters you passed to the function
- The local variables the function declares
- The memory address where it should return when finished
When the function ends, its frame is automatically removed from the Stack, freeing all the memory it used. It's like removing the top plate: fast, simple, automatic.
Visual example:
main() function calls calculateArea(10, 5)
┌─────────────────────┐
│ calculateArea │ ← Top of Stack
│ width: 10 │
│ height: 5 │
│ area: 50 │
├─────────────────────┤
│ main() │
│ result: ? │
└─────────────────────┘
calculateArea ends and returns
┌─────────────────────┐
│ main() │ ← Top of Stack
│ result: 50 │
└─────────────────────┘
Key characteristics:
-
Fast: Allocating memory on the Stack is instantaneous. The system only needs to "move a pointer" upward (like stacking a plate). There's no search for free space or complex management. Allocating 100 local variables takes the same time as allocating 1.
-
Automatic: You don't need to manually "free" or "delete" anything. When a function ends, its frame automatically disappears from the Stack. This prevents a common type of bug called "memory leak".
-
Limited: The Stack is small, typically 1-8 MB depending on your operating system. Why? Because it needs to be fast and predictable. If you declare a giant array as a local variable (
int arr[10000000];
), you could exhaust the Stack and cause a stack overflow, which crashes your program. This is why very deep recursions (a function calling itself thousands of times) can cause stack overflow: each call stacks a new frame. -
Predictable: All variables on the Stack must have a known size before the program runs (at compile time). When you write
int x = 5;
, the compiler knows thatx
will occupy exactly 4 bytes. This allows the system to reserve space instantly. You can't create an array whose size is decided during execution:int arr[n];
wheren
is user input only works if your compiler uses special extensions, because it breaks the predictability rule.
// Local variables are stored on the Stack
function calculateArea(width, height) {
// 'width', 'height', and 'area' are on the Stack
let area = width * height;
return area;
// When exiting the function, these variables are automatically deleted
}
function processRectangle() {
let result = calculateArea(10, 5); // 'result' on the Stack
console.log(result);
// 'result' is deleted when the function ends
}
Stack Usage Example
2. Heap Memory
If the Stack is an orderly and predictable stack of plates, the Heap is a giant, flexible warehouse. Unlike the Stack where everything follows strict entry and exit rules, in the Heap you can request space of any size, at any time, and that space will remain available until you explicitly free it (or until the garbage collector does it for you in modern languages).
How does it work in your program?
When you need to create something whose size you don't know until the program runs, or something that must survive beyond a function, you use the Heap. In languages like JavaScript or Python, you barely think about this, it happens automatically. In C/C++, you must request it explicitly with keywords like new
or malloc
.
Imagine the Heap as a huge parking lot. When you arrive (need memory), the attendant searches for an available space large enough for your car (your data). They give you the space number (the memory address) and there you park. When you leave (free the memory), that space becomes available again. But unlike the Stack, spaces aren't freed automatically when you exit a function, you decide when.
Visual example:
Heap memory (simplified representation)
┌──────────────────────────────────────┐
│ [free] │
├──────────────────────────────────────┤
│ Array of 1000 integers (4KB) │ ← new int[1000]
│ Address: 0x10000 │
├──────────────────────────────────────┤
│ [free] │
├──────────────────────────────────────┤
│ User object {name, age} │ ← new User("Ana", 25)
│ Address: 0x20000 │
├──────────────────────────────────────┤
│ [free] │
├──────────────────────────────────────┤
│ Variable length string (50B) │ ← "Hello World..."
│ Address: 0x30000 │
├──────────────────────────────────────┤
│ [free] │
└──────────────────────────────────────┘
Each allocated block can be a different size, they can be scattered, and they persist independently of which function created them.
Key characteristics:
-
Large: The Heap is huge compared to the Stack. It's limited only by your available RAM (typically gigabytes). You can create an array of 100 million elements in the Heap without problem. The Stack only has 1-8 MB, but the Heap can use 8 GB, 16 GB, or whatever your system has available. This makes it ideal for large data structures: in-memory databases, giant caches, high-resolution image processing.
-
Flexible: Here's the magic of the Heap: you can create structures whose size you don't know until the program runs. Does the user want to upload an image? You don't know if it'll be 1 KB or 10 MB until they do. In the Heap, you simply request the exact amount of memory you need at that moment:
new byte[imageSize]
. On the Stack this is impossible because everything must have a known size at compile time. -
Slower: Requesting memory in the Heap is more expensive than on the Stack. Why? Because the operating system must search for free space that's large enough. Imagine an almost full parking lot: the attendant must walk through spaces until finding one that fits you. Worse yet, over time the Heap becomes fragmented: you have free memory scattered in bits between occupied blocks. You request 1 MB contiguous, but only have 500 KB here and 600 KB there. The system must move things or search more. This is called fragmentation management and consumes CPU time. Allocating in the Heap can be 100-1000x slower than on the Stack.
-
Manual or Automatic (depending on the language):
- In C/C++: You're responsible for requesting and freeing memory. You use
new
/malloc
to request and must usedelete
/free
to free. If you forget to free, you cause a memory leak: memory you're no longer using but remains occupied. With enough leaks, your program exhausts RAM and the operating system kills it. - In Java, Python, JavaScript: These languages have a Garbage Collector. It's a process that runs in the background, detecting which objects no longer have active references, and automatically frees that memory. You don't need to think about
delete
. But the GC isn't free: it must briefly pause your program to do cleanup. In normal applications this is imperceptible (~milliseconds). In high-frequency systems (video games, trading), these pauses can be problematic.
- In C/C++: You're responsible for requesting and freeing memory. You use
-
Persistent: This is a critical difference from the Stack. Data in the Heap is not automatically deleted when a function ends. If you create an object in the Heap inside a function and return a pointer/reference to it, the object survives after the function ends. This allows returning complex structures without copying them. It's extremely useful but also dangerous in C/C++: if you lose all references to an object in the Heap without freeing it first, that memory becomes inaccessible but occupied (memory leak).
// Objects and arrays are stored in the Heap
function createUser(name, age) {
// The object is created in the Heap
let user = {
name: name,
age: age,
hobbies: [] // the array is also in the Heap
};
// 'user' (the reference) is on the Stack,
// but the object it points to is in the Heap
return user;
}
let person = createUser("Ana", 25);
// The object persists in the Heap even after
// the function ends
console.log(person.name); // "Ana"
// JavaScript's Garbage Collector will free
// the memory when there are no more references to the object
Heap Usage Example
3. The Importance of Data Structure
We now understand what types of data exist and where they live in memory. Now the critical question: why does how we organize them matter?
Hardware cannot save poorly structured code. You can have the world's fastest processor, but if you organize data incorrectly, your application will be slow. The right data structure can make your code 50,000x faster than the wrong one, without changing a single line of business logic.
Real performance comes from organization
Imagine a product catalog with 1 million items. A user searches for a specific one:
- Unsorted array with linear search (O(N)): Worst case = check 1,000,000 elements → ~50 ms
- Binary search tree (O(log N)): Worst case = check ~20 levels → ~0.001 ms
The difference: 50,000x faster. Not from better hardware, but from organizing data intelligently. Multiply this by thousands of simultaneous users and you understand why companies like Amazon invest so much in efficient data structures.
There's no perfect universal structure
There's no structure that's O(1) for all operations. Each structure sacrifices efficiency in some operations to gain it in others:
- Arrays: Access O(1), but insertion in the middle O(N)
- Linked Lists: Insertion O(1) (if you have the position), but access O(N)
- Hash Tables: Search O(1), but no guaranteed order
- Binary Search Trees: Balance between search, insertion, and deletion, all O(log N)
The right question isn't "what's the best data structure?", but "what's the best for this specific problem?" If your application does 90% reads, optimize for reads (arrays, trees). If it does constant insertions at the beginning, optimize for that (linked lists). These decisions are based on the five fundamental operations we saw in the first part: access, search, insertion, deletion, and traversal.
Scale magnifies decisions
With 100 elements, almost any structure works. The difference between O(N) and O(log N) is imperceptible (microseconds). But with 10 million elements, that difference can be seconds vs milliseconds. In production with thousands of concurrent users, it's the difference between a system that responds and one that collapses.
This is why understanding data structures isn't academic, it's existential for real systems. A streaming system with millions of users can't afford inefficient structures. A search engine processing trillions of documents can't use linear search. A database with terabytes of information must organize data for efficient access.
Pragmatism also matters
Don't optimize prematurely. The simplest structure that solves your problem is often the right one. An unsorted array with linear search is perfectly valid if you process 50 elements once a day. Readable and maintainable code can save you more time than ultra-optimized code.
First make it work correctly, then measure where the real bottlenecks are (with profilers, production metrics), and only then optimize. As Donald Knuth says: "Premature optimization is the root of all evil".
Conclusion
We've covered the three fundamental pillars that determine how your code behaves in production:
1. Data types: They're not just syntactic labels. They determine memory consumption, performance, and behavior (pass by value vs reference). The difference between int
and long
, or between primitives and objects, has real consequences.
2. Memory management: Stack and Heap aren't academic abstractions. They're physical regions with radically different characteristics. Understanding the complete hierarchy (registers → cache → RAM → disk) and concepts like cache locality explains why code that's O(N) in theory can be 10-20x faster in practice.
3. Data structures: Hardware can't compensate for poor organization. The right structure can make your code 50,000x faster. There's no universal solution, each structure is optimized for specific operations.
Most importantly: these three pillars are deeply connected. The data types you choose determine where they live in memory. Where they live in memory affects performance. And how you organize them (the data structure) amplifies or destroys that performance.
These concepts aren't optional for professional software. They're the common language among competent developers. When a senior engineer talks about "avoiding allocations in the Heap inside the hot path", they're applying exactly these principles.
Now you have the conceptual map. In upcoming posts we'll dive into concrete structures: arrays, linked lists, stacks, queues, trees, and hash tables. We'll see implementations with code, specific trade-offs, and real production use cases. The theoretical knowledge you have now will become practical tools you'll use daily.