Why do we need random number generation in statistics and data science?
Data scientists and statisticians rely on random number generation for several important purposes. Here’s what we’ll explore:
- They can be used to create data samples, which serves as a foundation for advanced statistical techniques. This includes Bootstrapping methods that involve resampling from existing data to create new samples and Monte Carlo Simulation approaches that generate synthetic data points based on probability distributions. These techniques are particularly valuable when researchers need to expand their sample sizes, validate statistical models, estimate uncertainty in their analyses, and conduct complex simulations to understand system behavior under various conditions. For example, these functions are particularly useful when real data is scarce for testing algorithms—in medicine, researchers can generate simulated patient data to test predictive models
- When designing neural networks, the initial weights are generally set randomly to avoid symmetries and achieve good learning outcomes. This randomization process is crucial because it helps prevent all neurons from learning the same features during training. By introducing random initial weights within a carefully chosen range, neural networks can start from different positions in the weight space, which promotes diverse feature detection and helps avoid the problem of neurons becoming “stuck” in suboptimal patterns. Additionally, random initialization helps break the symmetry between neurons in the same layer, allowing each neuron to specialize in detecting different patterns in the input data
- In decision trees, random numbers play a crucial role in feature selection and splitting criteria. During the tree construction process, random selection of features at each split point helps create more diverse and robust models by introducing an element of randomization. This technique, particularly prominent in random forest algorithms, prevents the model from becoming overly dependent on any single feature and helps mitigate overfitting by considering different feature combinations. The random selection process also enables the creation of multiple different decision trees from the same dataset, which can then be combined to form more accurate ensemble models.
- In Machine Learning model training processes, random numbers play a vital role in dataset partitioning. Practitioners typically divide their original dataset into separate training and testing sets using random sampling techniques when preparing data for model training and evaluation. This randomization ensures an unbiased distribution of data points across these sets, which is crucial for accurately assessing model performance. The random splitting process helps prevent any systematic bias that might occur from sequential or ordered data selection, and ensures that both the training and test sets are representative of the overall dataset distribution. This random division is often extended to include a validation set as well, creating a three-way split that enables more robust model evaluation and hyperparameter tuning.
Generating Random Numbers in Python
Python provides several ways to generate random numbers through different libraries: random (part of the standard library), NumPy, PyTorch, secrets, and os.
random
import random
print(random.random()) # Random number between 0 and 1
print(random.randint(1, 100)) # Integer between 1 and 100 (inclusive)
print(random.randrange(0, 100, 5)) # Integer between 0 and 100 (multiple of 5)
The random module also allows you to randomly select elements from a list or shuffle a list’s contents:
items = ["apple", "banana", "cherry"]
print(random.choice(items)) # Select a random element
print(random.choices(items, k=2)) # Select 2 elements with replacement
print(random.sample(items, 2)) # Select 2 elements without replacement
numbers = [1, 2, 3, 4, 5]
random.shuffle(numbers) # Shuffle the list elements
print(numbers)
numpy.random
NumPy’s random function provides multiple random number generation capabilities: generating values between 0 and 1 (random.rand), integers (random.randint), and manipulating lists through random selection (random.choice) or shuffling (random.shuffle). It also enables the generation of data according to common statistical distributions, including normal (random.normal) and uniform (random.uniform) distributions.
import numpy as np
print(np.random.rand()) # Float between 0 and 1
print(np.random.rand(3)) # Array with 3 floats
print(np.random.rand(2, 3)) # 2x3 matrix of floats
print(np.random.randint(1, 100)) # An integer between 1 and 100
print(np.random.randint(1, 100, 5)) # Array with 5 integers
arr = np.array([10, 20, 30, 40])
print(np.random.choice(arr)) # Random element
np.random.shuffle(arr) # Shuffle the array
print(arr)
print(np.random.normal(0, 1, 5)) # 5 numbers from normal distribution (mean=0, std.dev=1)
print(np.random.uniform(0, 10, 5)) # 5 numbers from uniform distribution [0,10]
torch
The PyTorch library supports both CPU and GPU processing
import torch
print(torch.rand(1)) # Float between 0 and 1
print(torch.rand(3, 3)) # 3x3 Matrix
print(torch.randint(0, 100, (5,))) # Tensor with 5 integers
print(torch.randn(5)) # Standard normal distribution
print(torch.normal(mean=0, std=1, size=(3,))) # Normal distribution with mean=0, std=1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(torch.rand(3, device=device)) # Random tensor on GPU
secrets
The secrets library generates cryptographically secure random numbers, unlike the pseudo-random numbers provided by the previous libraries. This makes it the ideal choice for generating passwords, security tokens, and cryptographic keys.
import secrets
print(secrets.randbelow(100)) # Number between 0 and 99
print(secrets.token_bytes(16)) # 16 random bytes
print(secrets.token_hex(16)) # 16 bytes in hexadecimal format
print(secrets.token_urlsafe(16)) # Secure URL token
os
The os library also provides truly random numbers by generating them from the system’s kernel.
import os
print(os.urandom(8)) # 8 byte casuali
Summary
Library | Main Function | Uses Seed? | Main Purpose |
---|---|---|---|
random | random.random() | Yes | Simulations, games |
numpy | np.random.rand() | Yes | Machine Learning, statistics |
torch | torch.rand() | Yes | Deep learning(CPU/GPU) |
secrets | secrets.randbelow() | No | Cryptography, passwords |
os | os.urandom() | No | Secure system random numbers |
Differences Between Pseudo-Random and True Random Numbers
Random number generators can be classified into two distinct categories: true random number generators (TRNG = True Random Number Generator) and pseudo-random number generators (PRNG). True random number generators derive their randomness from physical processes or phenomena that are inherently unpredictable, such as atmospheric noise, radioactive decay, or thermal fluctuations. In contrast, pseudo-random number generators use mathematical algorithms to generate sequences of numbers that appear random but are deterministic when given the same initial conditions or seed.
Within the Python ecosystem, this distinction is reflected in the implementation of various libraries: the random, numpy, and torch libraries implement pseudo-random number generators for their efficiency and reproducibility in scientific computing and machine learning applications, while the secrets and os libraries utilize system-level sources of entropy to provide true random numbers suitable for cryptographic purposes.
For pseudo-random number generation, NumPy employs either the Mersenne Twister (MT19937) algorithm or the newer Permuted Congruential Generator (PCG64), while PyTorch primarily uses the Philox algorithm alongside MT19937.
The MT19937 algorithm, for example, has an internal state of 624 32-bit integers. Every time it needs to generate a random number, it combines the numbers in the table with a right shift and XOR operation; when all numbers have been used, it performs a twist and generates a new table.
The PCG64 algorithm generates the random number by applying a formula and then transforms it with a shift and XOR operation.
The Philox algorithm generates the number by multiplying the existing value with a large constant and then performs shift and XOR operations.
A random number generator’s period is the maximum number of values it outputs before the sequence starts repeating. For instance, in the sequence 3,2,8,5,6,3,2,8,5,6,3,2,8, the period is 5 since the pattern repeats after every five numbers.
The periods of these random number generators are compared in the table below. While MT19937 has an extraordinarily long period, PCG64 and Philox offer faster performance despite their shorter periods.
Algorithm | Period |
---|---|
MT19937 | 2^19937 -1 |
PCG64 | 2^128 |
Philox | 2^256 |
True random number generators don’t rely on algorithms—instead, they harness system entropy. In computing, entropy refers to the degree of unpredictability and disorder within a system.
Sources of entropy include:
- Mouse movements: timing, position, and motion patterns provide unpredictable yet measurable data
- Keyboard input: the timing and patterns of keystrokes serve as unpredictable events
- Voltage fluctuations in electronic circuits
- Network activity: the timing of incoming data packets on internet and network connections
- Storage performance: variations in disk read latency and speed
The computer collects entropy data from various sources and continuously updates it in the system kernel. Specifically, Linux uses /dev/random and /dev/urandom, Windows uses CryptGenRandom(), and iOS uses SecRandomCopyBytes().
The secrets and os libraries draw from these system sources to generate truly random numbers.
Setting Seeds to Control Random Number Generation
Libraries that use pseudo-random number generation algorithms, specifically NumPy and PyTorch, let you “seed” the random number generator to produce consistent results across different runs.
There are several reasons why developers and data scientists may need to “fix” or control random number generation in their applications. During the debugging process, having consistent and predictable values makes it much easier to track down and identify potential errors in the code. When conducting scientific experiments or research that involves generating data samples, fixed random number generation ensures that the experiments are reproducible by other researcher. Additionally, when evaluating and comparing the performance of different algorithms or machine learning models, having consistent random numbers across all tests improves the validity of the comparisons by eliminating random variation as a confounding factor. These controlled conditions allow for more accurate and meaningful assessments of algorithmic performance.
This reproducibility is achieved by using the seed() function.
In NumPy, you can set the seed using the random.seed(x) function, where x is any number of your choice.
import numpy as np
np.random.seed(42)
print(np.random.rand(3)) # Generates a fixed sequence of numbers
np.random.seed(42)
print(np.random.rand(3)) # Reproduces the exact same sequence
In PyTorch, the seeding function is manual_seed(x)
import torch
torch.manual_seed(42)
print(torch.rand(3)) # Always generates the same numbers
torch.manual_seed(42)
print(torch.rand(3)) # Reproduces the same sequence
When a seed sequence is set, all subsequent random numbers generated by the program will follow that same sequence.
You can reset this sequence by changing the seed value to a different number:
np.random.seed(42)
print(np.random.randint(0, 100)) # Generate first number in sequence
np.random.seed(99) # Set new seed
print(np.random.randint(0, 100)) # Generate number from new sequence
Using 42 as a seed value is a common convention in the developer community. While any number can serve as a seed value, 42 has become particularly widespread.
This popularity originates from practical reasons: it’s easy to remember, and its widespread use makes it simpler to compare results between different developers.
Additionally, the number has cultural significance—it’s famously cited in Douglas Adams’ “The Hitchhiker’s Guide to the Galaxy” as “the ultimate answer to life, the universe and everything.” Using 42 has thus become a playful reference that developers often share.
Summary and Conclusions
In the fields of statistics, machine learning, and scientific research, random numbers play a crucial role in various applications. Python offers a comprehensive ecosystem for random number generation through two main approaches:
- Pseudo-random number generators (PRNG):
- Implemented in random, numpy, and torch libraries
- Ideal for scientific computing and reproducible research
- Can be controlled through seed functions for consistency
- True random number generators (TRNG):
- Available through secrets and os libraries
- Leverage system entropy for cryptographic security
- Essential for security-critical applications
The choice between PRNG and TRNG depends on your specific use case – use PRNGs when reproducibility is important, and TRNGs when true randomness is required for security purposes.