Reading Google Photos From a Takeout Export


For the Maps Timeline project, I want to be able to import all of my Google Photos and show them

File Structure Overview

Google Takeout exports photos in a .tgz file with the following structure:

- Takeout
   - Google Photos
     - Photos from 2015
     - Photos from 2016
     - Photos from 2017
     ...
     - <Album 1>
     - <Album 2>
     - <Album 3>
     ...

, where Photos from <year> are literal names and <Album>s, etc. are whatever the names of your Google Photos albums are. Photos live canonically in the Photos from <year> folders, and are duplicated in album folders.

Each folder contains media files (e.g., .jpgs and .mp4s) and each media file has a correlating .json file with metadata. The metadata file is named based on the media file. Specifically, it is the media files name, with a variation of the string .supplemental-metadata.json affixed to the end of the media file. The variation is that the substring 'supplemental-metadata' is shortened, if necessary, to limit the entire metadata file name at a maximum of 51 characters.

:question: What happens if the string is long enough that even the substring 'supplemental-metadata' doesn’t fit?

Folders can also contain edited version of media files, which are stored as, e.g., <mediafilename>-edited.jpg for <mediafilename>.jpg. We’re going to ignore those for now.

Even though photos in album folders are duplicative of the canonical ones in the Photos from <year> folders, the metadata files are not.

Reading

Here’s pseudocode to read a Google Photos Takeout .tgz file:

read(tarfile):
    media_paths = (
        pathname in tarfile.pathnames where (
            pathname is not a json file
            pathname doesn't contain '-edited.'
        )
    )
    for media_path in media_paths:
        capture hash into path->hash map
        if is canonical:
            store media + hash
    for media_path in media_paths:
        metadata_path = media_path + part of '.supplemental-metadata.json'
        query existing media by hash -> raise if not found
        store metadata(as either canonical or non-canonical)