Automating & Organising 170k Files With Python & ChatGPT

Recently, some hard drives came into our lives. These hard drives contained multiple backups of personal computers over a 10-year period. The hard drives contained thousands of files, mostly media and document file types, but also a whole heap of random .ico, .ini, .dll system/cache files.

However, amidst the trash, there were mostly childhood memories that severely needed organizing. This is where the wonders of AI come in and make a very long job very short, well, shorter… and inspired a personal tech project in me.

The personal tech project revolves around finding the fastest and hackiest way to accomplish this task, avoiding spending weeks manually moving and deleting crap from these drives.

Using ChatGPT, I was able to create a Python script that moves media files from a source directory to a destination directory based on their creation year. This allowed me to instantly extract the file types I want and organize them neatly in the destination.

The full script is on my Github PhotoPy But here’s an overview:

  1. It imports the necessary modules: os, shutil, and datetime.
  2. The move_media_files function is defined, which takes the source directory and destination directory as parameters.
  3. A list of media file extensions is defined (e.g., .docx, .rtf, .doc, .txt, .DOC).
  4. Two counters, total_files and processed_files, are initialized to keep track of the number of files.
  5. The script traverses the source directory recursively using os.walk to count the total number of media files to be processed. It checks if each file’s extension matches the media file extensions list and increments the total_files counter accordingly.
  6. The script traverses the source directory again using os.walk to process the media files. It checks if each file’s extension matches the media file extensions list.
  7. For each media file, it determines the file’s creation year by calling the get_file_creation_year function.
  8. If the creation year cannot be determined, it prints a message and skips the file.
  9. It creates a destination subdirectory in the destination directory based on the file’s creation year, using os.makedirs with exist_ok=True to ensure the directory is created if it doesn’t exist.
  10. It constructs the destination path for the file by joining the destination subdirectory path with the original file name.
  11. It checks if the file already exists in the destination directory. If it does, it prints a message, deletes the file from the source directory, and continues to the next file.
  12. If the file doesn’t exist in the destination directory, it moves the file from the source path to the destination path using shutil.move.
  13. The processed_files counter is incremented, and the progress is printed using the print_progress function.
  14. The get_file_creation_year function is defined to retrieve the creation year of a file using os.stat and datetime.fromtimestamp.
  15. The print_progress function is defined to print the progress of the file processing, including the number of processed files, total files, and estimated remaining time.
  16. The source directory and destination directory paths are specified.
  17. The move_media_files function is called with the source and destination directory paths to initiate the media file moving process.

Before organization, I had long-winded paths and directory structures similar to the one below, but even more obtuse.

pythonFiles (ParentDir)
├── .vscode (SubDir)
├── lib
│   ├── jedilsp
│   │   ├── bin (SubSubDir)
│   │   ├── docstring_to_markdown
│   │   ├── importlib_metadata
│   │   └── files?
│   │       ├── someFolder
│   │       └── files?

Then after organizing everything it was much more managable and easy to navigate.

photosOrganized~ (ParentDir)
├── 2006
│   └── media files
├── 2007
│   └── media files
├── 2008
│   └── media files
├── 2009
│   └── media files
├── 2010
│   └── media files

So, this script copied all the data to my newly organized location. However, it took hours—literal hours, or I’d even go as far as saying two days—to fully complete. But I didn’t have to do anything; I just had to sit and wait.

This delay can be attributed to

  1. File I/O Operations: The script performs file I/O operations, such as checking file existence, deleting files, and printing information. These operations can be relatively slow when repeated for a large number of files. Each file deletion and printing operation contributes to the overall execution time.

  2. Traversing Subdirectories: The script utilizes os.walk to traverse the specified directory and its subdirectories. When there are a substantial number of subdirectories and files within them, it can take a longer time to iterate over each file and perform the desired operations.

To address this issue, I could have implemented the following measures:

  1. Minimize I/O Operations: Instead of deleting files immediately, you can store the paths of the files that need to be deleted in a list and perform the deletion in batches. This reduces the number of individual I/O operations, improving overall performance.

  2. Parallelize the Processing: If the system has multiple processors or cores, you can parallelize the file processing operations to make use of the available resources more efficiently. This can be achieved using multiprocessing or multithreading techniques to process multiple files simultaneously.

  3. Implement Caching: If you need to perform repeated operations, such as checking file existence, you can implement caching mechanisms to avoid redundant I/O operations. Caching can help improve performance by reducing the number of disk accesses.

  4. Optimize Algorithmic Complexity: If possible, consider optimizing the algorithm used for file processing. Depending on the specific requirements, there might be more efficient algorithms or data structures that can be employed to achieve the desired outcome.

But, those considerations are for the next run when I tackle my grandad’s archive, which is probably 2-3 times the size.

Now the files are ready for ‘manual verification’ as, since it’s old data, I don’t fully trust automation to know exactly what to keep vs. nuke. So the process of going through each folder and removing the crap vs. keeping the gold begins.

Except, I can also hack/speed up that process a bit as well by deleting common file types I know are useless, like .ico, .dll, .html, .comf, .part, etc. So I was able to get another script which will clean up all directories based on an array of file types. Github: Cleanup.py

In total, when I started this project, I had over 170,000 varying files. After organization and deletion, the folders now sit with just over 70,000 media/documents, data that’s valuable to us and less clutter.

That is all for this project. I will be sure to update here once I tackle my next mundane problem in the most technical way I can think of.

Sláinte!