
* NVIDIA remains a cornerstone of the AI and GPU markets, but it faces headwinds from export controls, competition, and stock volatility. Its innovation in AI, quantum computing, and automotive tech, coupled with strong institutional backing, positions it for long-term growth, though short-term risks persist.
|
Linear Regression |
Logistic Regression |
|---|---|
| Here no activation function is used. | Here activation function is used to convert a linear regression equation to the logistic regression equation |
| Here no threshold value is needed. | Here no threshold value is needed. |
| Here we calculate Root Mean Square Error(RMSE) to predict the next weight value. | Here we calculate Root Mean Square Error(RMSE) to predict the next weight value. |
| Here dependent variable should be numeric and the response variable is continuous to value. | Here dependent variable should be numeric and the response variable is continuous to value. |
| It is based on the least square estimation. | It is based on the least square estimation. |
static data_type var_name = var_value; |
Stacks |
Queues |
|---|---|
| A stack is a data structure that stores a collection of elements, with operations to push (add) and pop (remove) elements from the top of the stack. | A queue is a data structure that stores a collection of elements, with operations to enqueue (add) elements at the back of the queue, and dequeue (remove) elements from the front of the queue. |
| Stacks are based on the LIFO principle, i.e., the element inserted at the last, is the first element to come out of the list. | Queues are based on the FIFO principle, i.e., the element inserted at the first, is the first element to come out of the list. |
| Stacks are often used for tasks that require backtracking, such as parsing expressions or implementing undo functionality. | Queues are often used for tasks that involve processing elements in a specific order, such as handling requests or scheduling tasks. |
| Insertion and deletion in stacks takes place only from one end of the list called the top. | Insertion and deletion in queues takes place from the opposite ends of the list. The insertion takes place at the rear of the list and the deletion takes place from the front of the list. |
| Insert operation is called push operation. | Insert operation is called enqueue operation. |
|
Parameters |
Triggers |
Procedures |
|---|---|---|
| Basics | A Trigger is implicitly invoked whenever any event such as INSERT, DELETE, or UPDATE occurs in a TABLE. | A Procedure is explicitly called by the user/application using statements or commands such as exec, EXECUTE, or simply procedure name |
| Action | When an event occurs, a trigger helps to execute an action automatically. | A procedure helps to perform a specified task when it is invoked. |
| Define/ call | Only nesting of triggers can be achieved in a table. We cannot define/call a trigger inside another trigger. | We can define/call procedures inside another procedure. |
| Syntax | In a database, the syntax to define a trigger: CREATE TRIGGER TRIGGER_NAME | In a database, the syntax to define a procedure: CREATE PROCEDURE PROCEDURE_NAME |
| Transaction statements | Transaction statements such as COMMIT, ROLLBACK, and SAVEPOINT are not allowed in triggers. | All transaction statements such as COMMIT and ROLLBACK are allowed in procedures |
|
Bluetooth |
Wifi |
|---|---|
| Bluetooth has no full form. | While Wifi stands for Wireless Fidelity. |
| It requires a Bluetooth adapter on all devices for connectivity. | Whereas it requires a wireless adapter Bluetooth for all devices and a wireless router for connectivity. |
| Bluetooth consumes low power. | while it consumes high power. |
| The security of BlueTooth is less in comparison to the number of wifi. | While it provides better security than BlueTooth. |
| Bluetooth is less flexible means these limited users are supported. | Whereas wifi supports a large number of users. |
| The radio signal range of BlueTooth is ten meters. | Whereas in wifi this range is a hundred meters. |
A stack overflow is a runtime error that occurs when the call stack in a program exceeds its maximum size, usually because of excessive or infinite recursion.
The call stack is a region of memory that stores:
Information about active subroutines (function calls)
Local variables and return addresses
Each time a function is called, a stack frame is pushed onto the call stack. When the function returns, that frame is popped off.
Infinite or deep recursion
def recurse():
recurse() # No base case → infinite recursion → stack overflow
recurse()
Large stack-allocated structures
If a function allocates a huge array on the stack (e.g., large local variable), it can overflow the stack.
Mutual recursion
Functions calling each other recursively without an exit condition.
Crash with error like:
Segmentation fault (Linux)
StackOverflowError (Java)
Stack overflow message (Windows)
Program freezes or crashes unexpectedly during execution.
Add base cases in recursive functions.
Use iteration instead of recursion if possible.
Allocate large data on the heap, not the stack.
Increase stack size (only if absolutely needed and you understand the risks).
The CUDA thread hierarchy is a fundamental concept in GPU programming, designed to organize and manage massive parallelism efficiently on NVIDIA GPUs.
Threads: The smallest units of work. Each thread runs a single instance of the kernel function and is mapped to a CUDA core on the GPU.
Blocks: Threads are grouped into blocks. Threads within a block can communicate and synchronize using shared memory and synchronization primitives like __syncthreads().
Grids: Blocks are organized into a grid. A grid can be one-, two-, or three-dimensional, and it contains all the blocks launched by a kernel.
Kernel Launch: When you launch a CUDA kernel, you specify the number of blocks and threads per block using the syntax <<<numBlocks, threadsPerBlock>>>.
Thread Identification: Each thread gets a unique identifier within its block (threadIdx), and each block gets a unique identifier within the grid (blockIdx).
Dimensions: Both blocks and grids can be organized in 1D, 2D, or 3D layouts, which is useful for processing data structures like vectors, matrices, or volumes.
Parallelism: The total number of threads is the number of blocks multiplied by the number of threads per block. This structure allows GPUs to efficiently scale to thousands or millions of parallel threads.
Scalability: The hierarchy allows programs to scale across GPUs with different numbers of cores and capabilities.
Collaboration: Threads within a block can cooperate and share data via shared memory, which is faster than global memory.
Flexibility: The multi-dimensional structure makes it easy to map computational problems to the GPU, especially for image, matrix, and scientific computations.
| Level | Purpose | Example Variables |
|---|---|---|
| Thread | Smallest unit, executes kernel code | threadIdx.x, .y, .z |
| Block | Groups threads, enables synchronization | blockDim.x, .y, .z |
| Grid | Groups blocks, manages large-scale execution | gridDim.x, .y, .z |
| Memory Type | Scope | Location | Speed | Lifetime | Typical Use Case |
|---|---|---|---|---|---|
| Global | All threads in all blocks (grid-wide) | Device DRAM (off-chip) | Slow | Application/kernel | Main data exchange, large datasets, host-device transfer |
| Shared | Threads within a block | On-chip (within GPU) | Very fast | Block lifetime | Inter-thread communication, caching, scratchpad |
| Local | Single thread | Off-chip (DRAM) or on-chip (cached) | Slow (similar to global) | Thread lifetime | Variables too large for registers, thread-private data |
Description: The main memory space on the GPU, accessible by all threads in all blocks. It is allocated using cudaMalloc and is used for large datasets and data transfer between the host and device.
Performance: Slower than shared memory and registers, as it resides in DRAM. Access can be optimized via coalescing, but uncoalesced access is much slower.
Lifetime: Persists across kernel launches until freed.
Description: On-chip memory shared by all threads within a block. It is declared using the __shared__ qualifier.
Performance: Much faster than global and local memory, as it is located on the GPU chip. Used for efficient communication and data reuse within a block.
Lifetime: Exists only for the duration of the block's execution.
Description: Private memory for each thread, used when variables don't fit in registers or are too large. It is managed automatically by the compiler.
Performance: Similar to global memory in speed, as it is typically stored in DRAM. Minimizing local memory use is advised for performance.
Lifetime: Exists only for the lifetime of the thread.
Tiling with Shared Memory
Divide the input matrices into smaller tiles (blocks) that fit into shared memory.
Each thread block loads a tile of both input matrices into shared memory, reducing global memory accesses.
Threads within a block cooperate to compute a tile of the output matrix, reusing data from shared memory.
Coalesced Global Memory Access
Ensure that threads in a warp access contiguous memory locations to maximize memory bandwidth utilization.
Avoiding Bank Conflicts
Structure shared memory usage so that threads access different banks, preventing serialization and maximizing throughput.
Block/Grid Configuration
Choose block and grid dimensions that maximize occupancy and align well with matrix dimensions.
Arithmetic Intensity
Maximize the number of arithmetic operations per memory load to hide memory latency.
Handling Non-Multiples of Block Size
Allocate and Transfer Data
Allocate device memory for input and output matrices.
Transfer data from host to device.
Kernel Launch
Launch a kernel with block and grid dimensions that match your matrix sizes and tile sizes.
Use shared memory to cache tiles of input matrices.
Tile Loading and Computation
Each thread block loads a tile of both input matrices into shared memory.
Synchronize threads (__syncthreads()) to ensure all tiles are loaded before computation.
Each thread computes a partial result for its output position, accumulating across all relevant tiles.
Write Back Results
Write the final result to global memory.
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
// Load tiles into shared memory
// Compute partial results
// Synchronize and accumulate
// Write to C
}
| Technique | Benefit |
|---|---|
| Tiling with Shared Memory | Reduces global memory access |
| Coalesced Access | Maximizes memory bandwidth |
| Bank Conflict Avoidance | Prevents shared memory bottlenecks |
| Proper Block/Grid Sizing | Maximizes GPU occupancy |
| Feature | BatchNorm (Batch Normalization) | LayerNorm (Layer Normalization) |
|---|---|---|
| Normalization Axis | Normalizes each feature (channel) across the mini-batch (i.e., across instances in a batch) | Normalizes each instance (sample) across all features (channels) |
| Batch Size Dependence | Performs best with larger, consistent batch sizes; less effective with small or variable batches | Independent of batch size; works with any batch size, including single samples |
| Inference/Training | Requires separate handling at inference (uses moving averages of batch stats) | Same operation at training and inference; no moving averages needed |
| Typical Use Cases | Widely used in CNNs and feedforward networks with large, stable batches | Common in RNNs, Transformers, and models with variable or small batch sizes |
| Computational Overhead | Higher due to computing batch statistics and normalizing per feature | Lower, as it computes statistics per instance and applies to features |
BatchNorm normalizes each feature (e.g., a channel in a CNN) by computing the mean and variance over all instances in the batch. This makes the activations for each feature have zero mean and unit variance across the batch, helping stabilize training and allowing higher learning rates. However, its effectiveness drops with small batch sizes because the statistics become noisy.
LayerNorm normalizes each instance (sample) by computing the mean and variance across all features (channels) for that instance. This makes the activations for each sample have zero mean and unit variance across its features, making it robust to batch size changes and suitable for variable-length inputs (like in NLP or RNNs).
| Aspect | BatchNorm | LayerNorm |
|---|---|---|
| Normalization Axis | Feature (channel) across batch | Instance across features |
| Batch Size Sensitivity | Yes (needs large, stable batches) | No (works with any batch size) |
| Inference Handling | Needs moving averages | Same as training |
| Use Cases | CNNs, large batch settings | RNNs, Transformers, small/variable batches |
ReLU (Rectified Linear Unit), while widely used and effective in many deep learning tasks, has several limitations:
Dying ReLU Problem:
If a neuron's input becomes negative, ReLU outputs zero. If this happens consistently during training, the weights may never update again (i.e., the neuron "dies").
This leads to parts of the network not learning at all.
Not Zero-Centered:
ReLU outputs range from 0 to ∞, which means activations are always positive or zero.
This can cause issues with gradient updates, leading to inefficient training because the gradients can consistently have the same sign.
Unbounded Output:
Since ReLU doesn't cap its output, large inputs can result in very large activations, which can lead to instability or exploding activations in some networks.
Gradient Saturation for Negative Inputs:
The gradient is zero for inputs less than 0, which can slow down or completely halt learning for those neurons.
Poor Performance with Noisy Data:
ReLU is less robust to noise in input data, especially when the data fluctuates around zero, leading to oscillations in learning or unstable gradients.
To address these limitations, variations like Leaky ReLU, Parametric ReLU (PReLU), ELU (Exponential Linear Unit), and GELU are often used.