
A scalable email system requires distributed storage, load balancing, and efficient search. Components:
Storage: Use distributed databases (e.g., Apache Cassandra) to shard user data across clusters, ensuring high availability.
Load Balancers: Distribute incoming traffic (e.g., SMTP, HTTP) across servers to handle peak loads during events like Yahoo News alerts.
Search Indexing: Implement inverted indexes (e.g., Elasticsearch) for quick email retrieval.
Yahoo Mail leverages Hadoop for big data analytics to detect spam patterns and optimize storage. Caching (Redis/Memcached) reduces latency for frequently accessed emails, while Kafka handles real-time notifications for new messages.
//python
def is_palindrome(s):
s = ''.join(filter(str.isalnum, s)).lower()
return s == s[::-1]
Explanation: This function removes non-alphanumeric characters and checks equality with the reversed string. At Yahoo, such algorithms validate data integrity (e.g., ensuring user-generated content like Yahoo News comments adheres to formatting rules). Optimized string manipulation is critical for processing large datasets in Hadoop pipelines.
Indexing: Add indexes on frequently queried columns (e.g., user_id in Yahoo Finance’s stock tracking).
Query Refactoring: Avoid SELECT *; use LIMIT for pagination.
Caching: Cache results of repetitive queries (e.g., trending news on Yahoo Homepage).
Sharding: Distribute data across databases by region or user hash.
Yahoo uses MySQL with Vitess for sharding and scaling, ensuring low latency for services like Yahoo Sports during live events.
TCP: Connection-oriented, reliable (e.g., Yahoo Mail’s email delivery). Ensures packets arrive intact.
UDP: Connectionless, low-latency (e.g., Yahoo Livestream for real-time video). Drops packets to prioritize speed.
Yahoo’s ad bidding platform uses UDP for real-time auctions, while TCP underpins secure user authentication via OAuth.
Data Collection: Track user clicks, reading time, and shares.
Collaborative Filtering: Recommend articles liked by similar users.
Content-Based Filtering: Use NLP to analyze article keywords and match user interests.
Hybrid Model: Combine both methods using Apache Spark.
Yahoo uses TensorFlow for deep learning models that predict user preferences, while Hadoop processes historical data to refine recommendations.
//Java
// Java example using synchronized
public class Counter {
private int count = 0;
public synchronized void increment() {
count++;
}
}
Explanation: The synchronized keyword ensures only one thread accesses increment() at a time. Yahoo applies similar thread safety in ad revenue calculation systems to prevent data corruption. Distributed locks (e.g., Redis) are used in microservices for global consistency.
GET https://api.yahoo.com/news/v1/articles?category=sports
Stateless: Each request includes authentication tokens.
Cache-Control: Headers like max-age=3600 cache sports articles.
HATEOAS: Links to related endpoints (e.g., next_page).
Yahoo’s Weather API uses REST to return JSON data with city-specific forecasts, scaled via API gateways like Kong.
MapReduce processes large datasets by mapping data to key-value pairs and reducing them to aggregates. Yahoo used MapReduce for:
Search Indexing: Mapping web pages to keywords, reducing to ranked results.
Ad Analytics: Counting clicks per campaign.
Yahoo’s Hadoop clusters ran MapReduce jobs to analyze log files from Yahoo Mail, optimizing storage and spam detection.
OAuth 2.0 enables third-party apps to access user data without exposing passwords. Flow:
User redirects to Yahoo’s authorization server.
App receives an access token after user consent.
Token grants limited access (e.g., read Yahoo Contacts).
Yahoo’s OAuth 2.0 integrates with Yahoo Mail APIs, using scopes like email.read and token expiration for security.
users_east, users_west) for Yahoo Mail. Benefits include reduced latency and parallel query execution. Challenges include cross-shard transactions, resolved via distributed SQL engines like Apache Calcite. Consistent hashing maps data to nodes in a way that minimizes rebalancing when nodes join/leave. Yahoo uses it in:
CDNs: Cache content across edge servers.
Distributed Databases: Like Yahoo’s Sherpa (key-value store).
It ensures minimal data movement during scaling, critical for Yahoo’s real-time analytics platforms.
Tools: Use jvisualvm or Eclipse MAT to analyze heap dumps.
Code Practices: Avoid static collections, close resources in finally blocks.
Yahoo’s monitoring systems (e.g., Prometheus) track JVM metrics for services like Yahoo Finance, alerting engineers to spikes in memory usage.
SQL: Structured, ACID transactions (e.g., MySQL for Yahoo Mail metadata).
NoSQL: Flexible schema, horizontal scaling (e.g., Cassandra for Yahoo Sports analytics).
Yahoo employs both: MySQL for relational data and Hadoop/HBase for big data workloads like ad targeting.
Atomicity: Transactions succeed or fail entirely (e.g., Yahoo Wallet payments).
Consistency: Data meets predefined rules.
Isolation: Concurrent transactions don’t interfere.
Durability: Committed data survives crashes.
Yahoo uses MySQL with InnoDB (ACID-compliant) for billing systems, ensuring financial data integrity.
Hashing: Convert long URLs to short strings (e.g., Base62 encoding).
Storage: Use Redis for fast lookups.
Scalability: Shard databases by hash prefix.
Cache: CDN caching for high-traffic links.
Yahoo’s service includes analytics to track click rates, stored in Hadoop for reporting.
Apache ZooKeeper coordinates distributed systems via:
Leader Election: Critical for Hadoop NameNode failover.
Configuration Management: Sync settings across Yahoo’s microservices.
Distributed Locks: Prevent race conditions in ad bidding platforms.
Yahoo contributed to ZooKeeper’s development, using it in Hadoop and Kafka clusters.
-XX:+UseG1GC) for low-latency services like Yahoo Mail. Excessive GC pauses are mitigated via object pooling and off-heap memory, ensuring real-time ad auctions run smoothly. HTTPS encrypts data via TLS/SSL. Steps:
Yahoo’s server sends a certificate signed by a CA (e.g., DigiCert).
Client verifies the certificate and negotiates a symmetric key.
Data is encrypted via AES.
Yahoo enforces HTTPS for all services (Mail, Finance) using HSTS and TLS 1.3, protecting user sessions from MITM attacks.
A Bloom filter tests whether an element is in a set, with possible false positives. Yahoo uses it for:
Spam Detection: Quickly check if an email hash is in a spam list.
Cache Lookups: Avoid expensive DB queries for non-existent keys.
It reduces storage overhead in systems like Yahoo News’ duplicate content checker.
Eventual consistency guarantees that, given no new updates, all replicas will converge to the same state. Yahoo uses it in:
Distributed Databases: Apache Cassandra for Yahoo Mail’s global user base.
CDNs: Propagate content updates across edge servers.
Trade-offs include temporary mismatches, resolved via anti-entropy protocols.
Monitor Metrics: Check latency in Grafana dashboards.
Profile Code: Use APM tools like New Relic to identify slow functions.
Database Optimization: Analyze slow queries with EXPLAIN.
Cache: Add Redis caching for stock price data.
Yahoo’s SRE teams use chaos engineering to simulate failures and preemptively optimize endpoints.
Machine learning (ML) trains models to make predictions from data. Yahoo applications:
Personalization: Recommending news articles via collaborative filtering.
Ad Targeting: Predicting click-through rates with logistic regression.
Fraud Detection: Anomaly detection in Yahoo Wallet transactions.
Yahoo’s ML pipelines use Apache Spark and TensorFlow, processing data stored in Hadoop clusters.