Magicsheet logo

Web Crawler Multithreaded

Medium
54.1%
Updated 6/1/2025

Web Crawler Multithreaded

What is this problem about?

This is an advanced version of the Web Crawler problem that introduces Concurrency. Instead of a single-threaded crawler, you must implement a solution that uses multiple threads to fetch and parse URLs in parallel. The goal is to speed up the process while maintaining the same rules (same hostname, no duplicate visits).

Why is this asked in interviews?

This is a classic Systems Design and Concurrency question asked by high-performance engineering teams at Apple, Databricks, and Google. It tests your knowledge of thread-safe data structures, synchronization (like Mutexes or Semaphores), and the "Producer-Consumer" pattern. It evaluates how you handle shared state (the "visited" set) in a multi-threaded environment.

Algorithmic pattern used

The pattern is Concurrent BFS. You use a thread-safe Queue (or a standard queue with a lock) and a thread-safe Set. Multiple worker threads pick a URL from the queue, crawl it, and add new URLs back to the queue. You must also handle the termination condition: how do the threads know when there are no more URLs left to crawl?

Example explanation

Imagine a thread pool of 4 workers.

  1. Thread 1 picks URL_A, parses it, finds URL_B and URL_C.
  2. Thread 1 adds URL_B and URL_C to the shared queue.
  3. Thread 2 and Thread 3 immediately pick URL_B and URL_C respectively to crawl them in parallel.
  4. All threads check the shared visited set before processing to ensure no two threads work on the same URL.

Common mistakes candidates make

  • Race Conditions: Updating the "visited" set or the queue without proper locking, leading to duplicate crawls.
  • Deadlocks: Threads waiting for each other or for a queue that will never be filled.
  • Busy Waiting: Threads looping constantly checking if the queue is empty instead of using condition variables to sleep until work is available.

Interview preparation tip

Learn about ThreadPoolExecutor in Python, java.util.concurrent in Java, or Worker Threads in JavaScript. Understanding how to "join" threads or wait for a group of tasks to complete is crucial for this interview pattern.

Similar Questions