Web Crawler Multithreaded

Medium

54.1%

Updated 6/1/2025

Asked by 10 Companies

Apple Databricks Anthropic Meta Snowflake Dropbox OpenAI Rubrik MongoDB Google

Topics

Breadth-First Search Depth-First Search Concurrency

Web Crawler Multithreaded

What is this problem about?

This is an advanced version of the Web Crawler problem that introduces Concurrency. Instead of a single-threaded crawler, you must implement a solution that uses multiple threads to fetch and parse URLs in parallel. The goal is to speed up the process while maintaining the same rules (same hostname, no duplicate visits).

Why is this asked in interviews?

This is a classic Systems Design and Concurrency question asked by high-performance engineering teams at Apple, Databricks, and Google. It tests your knowledge of thread-safe data structures, synchronization (like Mutexes or Semaphores), and the "Producer-Consumer" pattern. It evaluates how you handle shared state (the "visited" set) in a multi-threaded environment.

Algorithmic pattern used

The pattern is Concurrent BFS. You use a thread-safe Queue (or a standard queue with a lock) and a thread-safe Set. Multiple worker threads pick a URL from the queue, crawl it, and add new URLs back to the queue. You must also handle the termination condition: how do the threads know when there are no more URLs left to crawl?

Example explanation

Imagine a thread pool of 4 workers.

Thread 1 picks URL_A, parses it, finds URL_B and URL_C.
Thread 1 adds URL_B and URL_C to the shared queue.
Thread 2 and Thread 3 immediately pick URL_B and URL_C respectively to crawl them in parallel.
All threads check the shared visited set before processing to ensure no two threads work on the same URL.

Common mistakes candidates make

Race Conditions: Updating the "visited" set or the queue without proper locking, leading to duplicate crawls.
Deadlocks: Threads waiting for each other or for a queue that will never be filled.
Busy Waiting: Threads looping constantly checking if the queue is empty instead of using condition variables to sleep until work is available.

Interview preparation tip

Learn about ThreadPoolExecutor in Python, java.util.concurrent in Java, or Worker Threads in JavaScript. Understanding how to "join" threads or wait for a group of tasks to complete is crucial for this interview pattern.

Title	Difficulty	Topics	LeetCode
Nested List Weight Sum	Medium	Breadth-First Search Depth-First Search	Solve
Keys and Rooms	Medium	Breadth-First Search Graph Depth-First Search	Solve
Nested List Weight Sum II	Medium	Breadth-First Search Depth-First Search Stack	Solve
Maximize the Number of Target Nodes After Connecting Trees I	Medium	Breadth-First Search Depth-First Search Tree	Solve
Pyramid Transition Matrix	Medium	Breadth-First Search Depth-First Search Bit Manipulation	Solve

Web Crawler Multithreaded

Asked by 10 Companies

Topics

Web Crawler Multithreaded

What is this problem about?

Why is this asked in interviews?

Algorithmic pattern used

Example explanation

Common mistakes candidates make

Interview preparation tip

Similar Questions