The "Web Crawler" problem asks you to simulate a basic web crawler. Given a starting URL and a HtmlParser interface, you need to find all URLs that are under the same hostname as the start URL. You must explore all reachable links recursively or iteratively without visiting the same URL twice.
This is a fundamental Graph Traversal problem. It tests whether you can identify that the Internet is essentially a directed graph where URLs are nodes and links are edges. Companies like Anthropic and Meta ask this to see if you can handle "Interactive" problems (where you call an API) and whether you can manage visited sets to avoid infinite loops.
The core pattern is Breadth-First Search (BFS) or Depth-First Search (DFS). BFS is generally preferred for crawlers as it explores links layer by layer. You use a Queue for BFS and a Set to keep track of visited URLs to prevent redundant work and cycles.
Start URL: http://example.com/page1. Hostname: example.com.
page1. It contains:
http://example.com/page2 (Same host -> Add to queue)http://google.com (Different host -> Ignore)page2. It contains:
http://example.com/page1 (Already visited -> Ignore)http://example.com/page3 (Same host -> Add to queue)
Result: [http://example.com/page1, http://example.com/page2, http://example.com/page3]./ after http://).Get comfortable with string manipulation to extract hostnames quickly. In Python, url.split('/')[2] is a quick trick, but being able to explain the logic of finding the protocol and the first path separator is better.
| Title | Difficulty | Topics | LeetCode |
|---|---|---|---|
| Lexicographically Smallest String After Applying Operations | Medium | Solve | |
| Serialize and Deserialize N-ary Tree | Hard | Solve | |
| Freedom Trail | Hard | Solve | |
| Nested List Weight Sum | Medium | Solve | |
| Shortest Path in a Hidden Grid | Medium | Solve |