Magicsheet logo

Web Crawler

Medium
92.7%
Updated 6/1/2025

Web Crawler

What is this problem about?

The "Web Crawler" problem asks you to simulate a basic web crawler. Given a starting URL and a HtmlParser interface, you need to find all URLs that are under the same hostname as the start URL. You must explore all reachable links recursively or iteratively without visiting the same URL twice.

Why is this asked in interviews?

This is a fundamental Graph Traversal problem. It tests whether you can identify that the Internet is essentially a directed graph where URLs are nodes and links are edges. Companies like Anthropic and Meta ask this to see if you can handle "Interactive" problems (where you call an API) and whether you can manage visited sets to avoid infinite loops.

Algorithmic pattern used

The core pattern is Breadth-First Search (BFS) or Depth-First Search (DFS). BFS is generally preferred for crawlers as it explores links layer by layer. You use a Queue for BFS and a Set to keep track of visited URLs to prevent redundant work and cycles.

Example explanation

Start URL: http://example.com/page1. Hostname: example.com.

  1. Parse page1. It contains:
    • http://example.com/page2 (Same host -> Add to queue)
    • http://google.com (Different host -> Ignore)
  2. Visit page2. It contains:
    • http://example.com/page1 (Already visited -> Ignore)
    • http://example.com/page3 (Same host -> Add to queue) Result: [http://example.com/page1, http://example.com/page2, http://example.com/page3].

Common mistakes candidates make

  • Hostname Extraction: Failing to correctly parse the hostname from the URL (e.g., stopping at the first / after http://).
  • Infinite Loops: Not using a "visited" set, causing the crawler to go back and forth between two pages that link to each other.
  • Memory Management: Forgetting that a real crawler could find millions of links; though in a coding test, the set is usually small enough.

Interview preparation tip

Get comfortable with string manipulation to extract hostnames quickly. In Python, url.split('/')[2] is a quick trick, but being able to explain the logic of finding the protocol and the first path separator is better.

Similar Questions