Magicsheet logo

Find Duplicate File in System

Medium
93.3%
Updated 6/1/2025

Find Duplicate File in System

What is this problem about?

The Find Duplicate File in System interview question involves processing a list of directory paths and file information to identify files with identical content. You are given an array of strings where each string represents a directory path followed by one or more files and their contents. For instance, a string might look like "root/a 1.txt(abcd) 2.txt(efgh)". Your task is to group all file paths that share the exact same content and return these groups as a list of lists.

Why is this asked in interviews?

This Find Duplicate File in System coding problem is frequently asked by companies like Microsoft, Dropbox, and Google because it tests several core engineering skills. First, it evaluates your ability to parse and manipulate complex strings effectively. Second, it assesses your proficiency in using hash-based data structures to group data. Finally, it often leads to deeper discussions about file system architecture, hashing large files, and memory management—skills crucial for backend and system-level roles.

Algorithmic pattern used

This problem follows the Array, Hash Table, String interview pattern. The most efficient approach involves using a hash map where the keys are the file contents (the text inside the parentheses) and the values are lists of the full file paths containing that content. By iterating through the input strings once, you can build this map and then filter out entries that contain only one path, as these are unique files.

Example explanation

Imagine you have the following directory information:

  1. "root/docs a.txt(hello) b.txt(world)"
  2. "root/backup c.txt(hello)"

Step-by-step processing:

  • From the first string, we extract: path root/docs/a.txt with content "hello" and root/docs/b.txt with content "world".
  • From the second string, we extract: path root/backup/c.txt with content "hello".
  • In our hash map, the key "hello" will now point to ["root/docs/a.txt", "root/backup/c.txt"].
  • The key "world" points to ["root/docs/b.txt"].
  • The final result only includes groups with at least two files: [["root/docs/a.txt", "root/backup/c.txt"]].

Common mistakes candidates make

  • Inefficient String Splitting: Using repeated string concatenations in a loop instead of builders, which can lead to O(N^2) complexity.
  • Missing Path Details: Forgetting to join the directory path with the filename correctly.
  • Ignoring Scalability: Failing to discuss how to handle files that are too large to fit in memory or how to optimize hashing for massive datasets (e.g., using file sizes or partial hashes first).

Interview preparation tip

When solving string-parsing problems, always clarify the format. Ask the interviewer if filenames can contain parentheses or if paths are always valid. In a real-world scenario, you wouldn't hash the entire content if the files are huge; instead, you'd compare file sizes first or use a rolling hash.

Similar Questions