This is an advanced version of the Web Crawler problem that introduces Concurrency. Instead of a single-threaded crawler, you must implement a solution that uses multiple threads to fetch and parse URLs in parallel. The goal is to speed up the process while maintaining the same rules (same hostname, no duplicate visits).
This is a classic Systems Design and Concurrency question asked by high-performance engineering teams at Apple, Databricks, and Google. It tests your knowledge of thread-safe data structures, synchronization (like Mutexes or Semaphores), and the "Producer-Consumer" pattern. It evaluates how you handle shared state (the "visited" set) in a multi-threaded environment.
The pattern is Concurrent BFS. You use a thread-safe Queue (or a standard queue with a lock) and a thread-safe Set. Multiple worker threads pick a URL from the queue, crawl it, and add new URLs back to the queue. You must also handle the termination condition: how do the threads know when there are no more URLs left to crawl?
Imagine a thread pool of 4 workers.
URL_A, parses it, finds URL_B and URL_C.URL_B and URL_C to the shared queue.URL_B and URL_C respectively to crawl them in parallel.visited set before processing to ensure no two threads work on the same URL.Learn about ThreadPoolExecutor in Python, java.util.concurrent in Java, or Worker Threads in JavaScript. Understanding how to "join" threads or wait for a group of tasks to complete is crucial for this interview pattern.
| Title | Difficulty | Topics | LeetCode |
|---|---|---|---|
| Nested List Weight Sum | Medium | Solve | |
| Keys and Rooms | Medium | Solve | |
| Nested List Weight Sum II | Medium | Solve | |
| Maximize the Number of Target Nodes After Connecting Trees I | Medium | Solve | |
| Pyramid Transition Matrix | Medium | Solve |