Google Code-in 2017

Google Code-in Logo

This year, I signed up for Google Code-in upon finding out about it from a friend. It’s the holidays in Singapore so I thought this would be something fun to fill in the time. At this point, I’ve completed 2 tasks. It’s been a pretty good experience so far; The mentors are incredibly friendly and helpful, and I found some fun tasks to do. In this blog post, I’ll be writing about my experience in completing my second task.

The Task

Simply put, my task was to make a program/script to translate .srt files into different languages, via a website called DeepL.

.srt files

.srt files are subtitle files, used to display subtitles in video players. They contain subtitles in groups. Each group has an identifier (a simple integer that starts from 1 and increments

DeepL

DeepL logo from deepl.com

DeepL is a translation tool that is smarter than Google Translate. It has the ability to translate within the context of the entire sentence.

What I built

Code sample

In the end, I managed to complete the task by building a simple command-line tool using Go. My submission was accepted after two reviews by my mentors and a demonstration. The tool takes an .srt file as input, along with the language to translate from and he language to translate to, and write the result to another .srt file. The script would use the undocumented DeepL API (the same one the website uses) in order to perform the translations.

Building the tool

In this section, I will attempt to explain how I built my solution and some of the challenges I faced.

Looking for libraries

Writing the code to parse .srt files and connect with the DeepL API would be impractical for me within the 7 days I was given, so I looked for existing libraries. I found and used go-astisub for parsing the .srt files. However, I failed to find any libraries. After digging around a little more, I found this repository with the code I needed. I copied the code into a separate file and stripped it down to only the components I needed.

The initial challenge

One of the main challenges I faced is that text must be sent to DeepL in complete sentences. However, a single sentence can be split over multiple chunks of an .srt file.

Two sections, but only one sentence

This meant that my program needs a clever way to turn the sections of text into sentences of text, send them one-by-one to DeepL, and then turn the resulting sentences back into sections. Another problem is that the resulting sentence often doesn’t have the same number of words as the initial one. Take this example (From English to Polish):

Do or do not there is no try.

Nie próbuj lub nie.

If the English sentence is split over multiple sections, how am I to split the Polish sentence back into those sections?

The solution

I solved this problem by using proportions. Taking the example above, you can see that the English sentence contains 8 words while the Polish sentence contains 4. This means that the Polish sentence is half as long as the English one. If it was originally split into 2 and 6 word sections, it would now be split into 1 and 3. So,

Do or

do not there is no try.

would translate to

Nie

próbuj lub nie.

Of course, this is a rudimentary solution, as the meaning of the sentence would be broken up differently. However, in the time and knowledge that I have, this seems to work fairly well.

Pipeline/Data flow

After I came up with this tool, building the rest of the tool was pretty straightforward. First, the .srt file would be loaded. Then, only the text of the .srt file is extracted, and turned into sentences. After that, the sentences would be sent up to DeepL and translated. The most confident (best) translation of each sentence is taken and split back into sections, via the proportion method discussed earlier. Then, the resulting text is added back into the original .srt file and written out to the output file.

DeepL rate limiting

Of course, the DeepL API was rate limited. At first, I kept hitting the rate limit, as I was sending the whole 250+ sentences of the test file up at one time. To solve this, I added a ‘rate’ flag to the command-line tool, which allows the user to control the rate at which sentences are sent to DeepL.

Submitting the task

To submit this task, I created a GitHub repository to store my code and submitted the link to my mentors. I also built several releases for different operating systems (Go support cross-compilation). When I first submitted the task, the example I gave only contained 4 sections. So the task was marked as “More work needed” and returned to me. So, I downloaded a much larger .srt file which took several minutes to translate (due to rate limiting) and submitted that as an example. Only then was the task accepted and that glorious white tick appeared.

Glorious white tick

Conclusion

Overall, this was a very enjoyable task to build. It was even more satisfying to see my work accept by the wonderful mentors. Unfortunately, my first Google Code-in will also be my last, as I have reached the age limit for participants. Despite that, I have really enjoyed Google Code-in 2017.