For the Maps Timeline project, I want to be able to import all of my Google Photos and show them
File Structure Overview
Google Takeout exports photos in a .tgz file with the following structure:
- Takeout
- Google Photos
- Photos from 2015
- Photos from 2016
- Photos from 2017
...
- <Album 1>
- <Album 2>
- <Album 3>
...
, where Photos from <year> are literal names and <Album>s, etc. are
whatever the names of your Google Photos albums are. Photos live canonically
in the Photos from <year> folders, and are duplicated in album folders.
Each folder contains media files (e.g., .jpgs and .mp4s) and each media
file has a correlating .json file with metadata. The metadata file is named
based on the media file. Specifically, it is the media files name, with a
variation of the string .supplemental-metadata.json affixed to the end of
the media file. The variation is that the substring 'supplemental-metadata'
is shortened, if necessary, to limit the entire metadata file name at a
maximum of 51 characters.
:question: What happens if the string is long enough that even the substring
'supplemental-metadata' doesn’t fit?
Folders can also contain edited version of media files, which are stored as,
e.g., <mediafilename>-edited.jpg for <mediafilename>.jpg. We’re going to
ignore those for now.
Even though photos in album folders are duplicative of the canonical ones in
the Photos from <year> folders, the metadata files are not.
Reading
Here’s pseudocode to read a Google Photos Takeout .tgz file:
read(tarfile):
media_paths = (
pathname in tarfile.pathnames where (
pathname is not a json file
pathname doesn't contain '-edited.'
)
)
for media_path in media_paths:
capture hash into path->hash map
if is canonical:
store media + hash
for media_path in media_paths:
metadata_path = media_path + part of '.supplemental-metadata.json'
query existing media by hash -> raise if not found
store metadata(as either canonical or non-canonical)