My eight-year quest to digitize 45 videocassettes. Part 2

The first part describes the difficult quest to digitize old family videos and break them into separate scenes. After processing all the clips, I wanted to organize their viewing online as convenient as on YouTube. Since these are personal memories of the family, they cannot be posted on YouTube itself. We need a more private hosting that is both convenient and secure.

Step 3. Publish

ClipBucket, an open source YouTube clone that you can install on your own server

First of all I tried ClipBucket, which calls itself an open source YouTube clone that you can install on your server.

My eight-year quest to digitize 45 videocassettes. Part 2

Surprisingly, ClipBucket doesn't have any installation instructions. Thanks to outside management я automated the installation process through Ansible, a server configuration management tool.

Part of the difficulty was that the ClipBucket installation scripts were completely broken. At that time I worked at Google and under the terms of the contract did not have the right to contribute to the open source clone of YouTube, but I posted a bug reportfrom which it was easy to make the necessary corrections. Months passed, and they still did not understand what the problem was. Instead, they added everything more bugs in every release.

ClipBucket worked on a consulting model - they released their code for free and charged for help with deployment. Gradually it dawned on me that a company that makes money from paid support is probably not very interested in having customers install the product themselves.

MediaGoblin, a more modern alternative

After a few months of frustration with ClipBucket, I reviewed the options available and found media goblin.

My eight-year quest to digitize 45 videocassettes. Part 2
media goblin is a standalone media sharing platform

MediaGoblin has a lot of goodies. Unlike ClipBucket in unsightly PHP, MediaGoblin is written in Python, a language I have a lot of experience writing code with. Eat command line interface, which makes it easy to automate video downloads. Most importantly, MediaGoblin comes in Docker image, which eliminates any problems with the installation.

Docker is a technology that creates a self-contained environment for an application that works anywhere. I am using Docker in many of my projects.

The Surprising Difficulty of Redockerizing MediaGoblin

I assumed that deploying a MediaGoblin docker image would be a trivial task. Well, it didn't quite work out that way.

In the finished image, there were two necessary functions:

  • Authentication
    • MediaGoblin creates a public media portal by default, and I needed a way to restrict outsiders' access.
  • Transcoding
    • Every time you upload a video, MediaGoblin tries to re-encode it for optimal streaming. If the video is originally ready for streaming, transcoding degrades the quality.
    • MediaGoblin Provides disabling transcoding via configuration options, but it's not possible to do this in an existing Docker image.

Well, no problem. The Docker image comes with open source, so you can rebuild it yourself.

Unfortunately, the Docker image is no longer built from the current MediaGoblin repository. I tried to sync it with the version from the last successful build, but that didn't work either. Even though I used exactly the same code, the MediaGoblin external dependencies changed, breaking the build. Dozens of hours later, I ran the 10-15 minute MediaGoblin build process over and over again until it finally worked.

The same thing happened a few months later. In total, over the past couple of years, the MediaGoblin dependency chain has broken my build several times, and the last time it happened was just when I was writing this article. I ended up posting own fork of MediaGoblin c hard coded dependencies and explicitly specified library versions. In other words, instead of the dubious claim that MediaGoblin works with any version celery >= 3.0, I installed specific dependency on version celery 4.2.1, because I tested MediaGoblin with this version. It looks like the product needs reproducible build mechanismbut I haven't done it yet.

Anyway, after many hours of struggle, I was finally able to build and configure MediaGoblin in a Docker image. It was already easy skip unnecessary transcoding ΠΈ install Nginx for authentication.

Step 4. Hosting

Since MediaGoblin was running Docker on my local machine, the next step was to deploy to a cloud server so the family could watch the video.

MediaGoblin and the video storage problem

There are many platforms that take a Docker image and host it on a public URL. The catch is that in addition to the app itself, 33 GB of video files had to be published. It was possible to hard-code them into a docker image, but it turned out to be cumbersome and ugly. Changing one line of configuration would require a redeployment of 33 GB of data.

When I used ClipBucket, I solved the problem with gcsfuse - a utility that allows the operating system to upload directories to the Google Cloud cloud storage as regular paths to the file system. I hosted the video files on Google Cloud and used gcsfuse to show them as local files in ClipBucket.

The difference was that ClipBucket ran in a real virtual machine, while MediaGoblin ran in a Docker container. Here, mounting files from the cloud storage turned out to be much more difficult. I spent dozens of hours solving all the problems and wrote about it whole blog post.

My eight-year quest to digitize 45 videocassettes. Part 2
The initial integration of MediaGoblin with Google Cloud storage, which I told in 2018

After several weeks of adjusting all the components, everything worked. Without making any changes to the MediaGoblin code, I cheat to get it to read and write media files to Google cloud storage.

The only problem was that MediaGoblin began to work indecently slowly. It took a whopping 20 seconds to upload video thumbnails to the home page. If you jumped forward while watching a video, MediaGoblin paused for an endless 10 seconds before resuming playback.

The main problem was that the videos and pictures went to the user in a long, roundabout way. They had to go from Google cloud storage through gcsfuse to MediaGoblin, Nginx - and only then did they get into the user's browser. The main bottleneck was gcsfuse, which is not optimized for fast performance. Developers warn about large delays in the utility right on the main page of the project:

My eight-year quest to digitize 45 videocassettes. Part 2
Warnings about poor performance in the gcsfuse documentation

Ideally, the browser should pull files directly from Google Cloud, bypassing all intermediate layers. How do I do this without going deep into the MediaGoblin codebase and adding complex Google Cloud integration logic?

sub_filter trick in nginx

Luckily I found an easy solution though slightly ugly. I added to the default.conf configuration in Nginx such a filter:

sub_filter "/mgoblin_media/media_entries/" "https://storage.googleapis.com/MY-GCS-BUCKET/media_entries/";
sub_filter_once off;

In my setup, Nginx acted as a proxy between MediaGoblin and the end user. The above directive tells Nginx to search and replace all MediaGoblin HTML responses before serving them to the end user. Nginx replaces all relative paths to MediaGoblin media files with URLs from Google cloud storage.

For example, MediaGoblin generates this HTML:

<video width="720" height="480" controls autoplay>
  <source
    src="/mgoblin_media/media_entries/16/Michael-riding-a-bike.mp4"
    type="video/mp4">
</video>

Nginx changes the response:

<video width="720" height="480" controls autoplay>
  <source
    src="https://storage.googleapis.com/MY-GCS-BUCKET/media_entries/16/Michael-riding-a-bike.mp4"
    type="video/mp4">
</video>

Now everything is working as it should:

My eight-year quest to digitize 45 videocassettes. Part 2
Nginx rewrites responses from MediaGoblin so that clients can request media files directly from Google cloud storage

The best part about my solution is that it doesn't require any changes to the MediaGoblin code. The two-line Nginx directive seamlessly integrates MediaGoblin and Google Cloud, even though the two services don't know anything about each other.

Note: This solution requires the files in Google Cloud Storage to be readable by everyone. To reduce the risk of unauthorized access, I use a long random bucket name (for example, mediagoblin-39dpduhfz1wstbprmyk5ak29) and verify that the bucket's access control policy does not allow unauthorized users to display the contents of the directory.

Final product

At this point, I had a complete, working solution. MediaGoblin happily ran in its own container on the Google Cloud Platform, so it didn't need to be patched or updated frequently. Everything in my process was automated and reproducible, allowing simple edits or rollbacks to previous versions.

My family really liked how easy it is to watch videos. With the help of the Nginx hack described above, working with video became as fast as on YouTube.

The view screen looks like this:

My eight-year quest to digitize 45 videocassettes. Part 2
Contents of the catalog of family videos by tag "Best"

Clicking on the thumbnail brings up the following screen:

My eight-year quest to digitize 45 videocassettes. Part 2
Viewing an individual clip on a media server

After many years of work, I was incredibly pleased to give relatives the opportunity to watch our videos in the same convenient interface as on YouTube, which I originally wanted.

Bonus: Cost reduction to less than $1 per month

You watch home videos infrequently, only every few months. My family collectively generated about 20 hours of traffic per year, but the server was running 15/99,7. I paid $XNUMX monthly for a server that was down XNUMX% of the time.

At the end of 2018, Google released a product Cloud Run. The killer feature was running Docker containers so quickly that the application could respond to HTTP requests. That is, the server could remain in standby mode - and start only when someone wanted to go to it. For infrequently used apps like mine, costs have gone from $15 a month to a few cents a year.

For reasons I don't remember, Cloud Run didn't work with my MediaGoblin image. But with the advent of Cloud Run, I remembered that Heroku offers a similar service for free, and their tools are much more convenient than Google's.

With a free application server, the only expense is data storage. Google's standard regional storage costs 2,3 cents/GB. The video archive is 33 GB, so I only pay 77 cents a month.

My eight-year quest to digitize 45 videocassettes. Part 2
This solution costs only $0,77 per month

Tips for those who are going to try

Obviously, the process took me a long time. But I hope this article will help you save 80-90% of your home video digitization and publishing efforts. In a separate section you can find detailed step by step guide throughout the process, but here are some general tips:

  • Save as much metadata as possible during the digitizing and editing phase.
    • Valuable information is often recorded on video cassette labels.
    • Record which clip was taken from which cassette and in what order.
    • Write down the date of shooting, which may be indicated on the video.
  • Consider paying for professional digitizing services.
    • You will extremely it is difficult and expensive to match them in terms of digitization quality.
    • But stay away from a company called EverPresent (let me know if you need more details).
  • If you do the digitization yourself, purchase an HDD.
    • Uncompressed standard definition video takes 100-200 MB per minute.
    • I kept everything on my Synology DS412 + (10 TB).
  • Write metadata in some common format that is not tied to a specific application.
    • Clip descriptions, time codes, dates, etc.
    • If you save metadata in an application-specific format (or worse, don't save at all), you won't be able to redo the work if you decide to use another solution.
    • While editing, you see a lot of useful metadata on the video. You will lose them if you don't save them.
      • What's happening on the video?
      • Who is registered there?
      • When was it recorded?
  • Tag your favorite videos.
    • To be honest, most home video content is pretty boring.
    • I apply the β€œbest of” tag to my favorite clips and open them when I want to watch funny videos.
  • Organize a comprehensive solution as early as possible so that the process goes immediately from start to finish.
    • I tried to digitize all the cassettes first, then edit all the cassettes, etc.
    • Too bad I didn't start with one cassette and do all the work with it. Then I would understand what decisions and at what stages affect the final result.
  • Minimize recoding.
    • Every time you edit or re-encode a clip, you degrade its quality.
    • Digitize raw footage at maximum quality, then transcode each clip exactly once into the format that browsers natively play.
  • Use the simplest possible solution for posting video clips.
    • In hindsight, MediaGoblin seems like an overly complex tool for a fairly simple scenario of generating web pages with a static set of video files.
    • If I were to start over, I would use a static site generator such as Hugo, Jekyll or Gridsome.
  • Make a montage.
    • Video editing is a fun way to combine the best moments from multiple videos.
    • The main thing in editing is music. For example, the theme is amazing Slow snow from The National, this is my personal discovery.

Source: habr.com