The Find Duplicate File in System interview question involves processing a list of directory paths and file information to identify files with identical content. You are given an array of strings where each string represents a directory path followed by one or more files and their contents. For instance, a string might look like "root/a 1.txt(abcd) 2.txt(efgh)". Your task is to group all file paths that share the exact same content and return these groups as a list of lists.
This Find Duplicate File in System coding problem is frequently asked by companies like Microsoft, Dropbox, and Google because it tests several core engineering skills. First, it evaluates your ability to parse and manipulate complex strings effectively. Second, it assesses your proficiency in using hash-based data structures to group data. Finally, it often leads to deeper discussions about file system architecture, hashing large files, and memory management—skills crucial for backend and system-level roles.
This problem follows the Array, Hash Table, String interview pattern. The most efficient approach involves using a hash map where the keys are the file contents (the text inside the parentheses) and the values are lists of the full file paths containing that content. By iterating through the input strings once, you can build this map and then filter out entries that contain only one path, as these are unique files.
Imagine you have the following directory information:
Step-by-step processing:
When solving string-parsing problems, always clarify the format. Ask the interviewer if filenames can contain parentheses or if paths are always valid. In a real-world scenario, you wouldn't hash the entire content if the files are huge; instead, you'd compare file sizes first or use a rolling hash.
| Title | Difficulty | Topics | LeetCode |
|---|---|---|---|
| Word Subsets | Medium | Solve | |
| Vowel Spellchecker | Medium | Solve | |
| Group Shifted Strings | Medium | Solve | |
| Evaluate the Bracket Pairs of a String | Medium | Solve | |
| Find and Replace Pattern | Medium | Solve |