Web Crawler

Medium

92.7%

Updated 6/1/2025

Asked by 7 Companies

Anthropic Meta Snowflake Dropbox OpenAI Rubrik Amazon

Topics

Breadth-First Search Interactive Depth-First Search String

Web Crawler

What is this problem about?

The "Web Crawler" problem asks you to simulate a basic web crawler. Given a starting URL and a HtmlParser interface, you need to find all URLs that are under the same hostname as the start URL. You must explore all reachable links recursively or iteratively without visiting the same URL twice.

Why is this asked in interviews?

This is a fundamental Graph Traversal problem. It tests whether you can identify that the Internet is essentially a directed graph where URLs are nodes and links are edges. Companies like Anthropic and Meta ask this to see if you can handle "Interactive" problems (where you call an API) and whether you can manage visited sets to avoid infinite loops.

Algorithmic pattern used

The core pattern is Breadth-First Search (BFS) or Depth-First Search (DFS). BFS is generally preferred for crawlers as it explores links layer by layer. You use a Queue for BFS and a Set to keep track of visited URLs to prevent redundant work and cycles.

Example explanation

Start URL: http://example.com/page1. Hostname: example.com.

Parse page1. It contains:
- http://example.com/page2 (Same host -> Add to queue)
- http://google.com (Different host -> Ignore)
Visit page2. It contains:
- http://example.com/page1 (Already visited -> Ignore)
- http://example.com/page3 (Same host -> Add to queue) Result: [http://example.com/page1, http://example.com/page2, http://example.com/page3].

Common mistakes candidates make

Hostname Extraction: Failing to correctly parse the hostname from the URL (e.g., stopping at the first / after http://).
Infinite Loops: Not using a "visited" set, causing the crawler to go back and forth between two pages that link to each other.
Memory Management: Forgetting that a real crawler could find millions of links; though in a coding test, the set is usually small enough.

Interview preparation tip

Get comfortable with string manipulation to extract hostnames quickly. In Python, url.split('/')[2] is a quick trick, but being able to explain the logic of finding the protocol and the first path separator is better.

Title	Difficulty	Topics	LeetCode
Lexicographically Smallest String After Applying Operations	Medium	Breadth-First Search Depth-First Search Enumeration String	Solve
Serialize and Deserialize N-ary Tree	Hard	Breadth-First Search Depth-First Search String Tree	Solve
Freedom Trail	Hard	Breadth-First Search Depth-First Search String Dynamic Programming	Solve
Nested List Weight Sum	Medium	Breadth-First Search Depth-First Search	Solve
Shortest Path in a Hidden Grid	Medium	Array Matrix Breadth-First Search Interactive Depth-First Search	Solve

Web Crawler

Asked by 7 Companies

Topics

Web Crawler

What is this problem about?

Why is this asked in interviews?

Algorithmic pattern used

Example explanation

Common mistakes candidates make

Interview preparation tip

Similar Questions