Google Code-in 2017
This year, I signed up for Google Code-in upon finding out about it from a friend. It’s the holidays in Singapore so I thought this would be something fun to fill in the time. At this point, I’ve completed 2 tasks. It’s been a pretty good experience so far; The mentors are incredibly friendly and helpful, and I found some fun tasks to do. In this blog post, I’ll be writing about my experience in completing my second task.
Simply put, my task was to make a program/script to translate .srt files into different languages, via a website called DeepL.
.srt files are subtitle files, used to display subtitles in video players. They contain subtitles in groups. Each group has an identifier (a simple integer that starts from 1 and increments
DeepL is a translation tool that is smarter than Google Translate. It has the ability to translate within the context of the entire sentence.
What I built
In the end, I managed to complete the task by building a simple command-line tool using Go. My submission was accepted after two reviews by my mentors and a demonstration. The tool takes an .srt file as input, along with the language to translate from and he language to translate to, and write the result to another .srt file. The script would use the undocumented DeepL API (the same one the website uses) in order to perform the translations.
Building the tool
In this section, I will attempt to explain how I built my solution and some of the challenges I faced.
Looking for libraries
Writing the code to parse .srt files and connect with the DeepL API would be impractical for me within the 7 days I was given, so I looked for existing libraries. I found and used go-astisub for parsing the .srt files. However, I failed to find any libraries. After digging around a little more, I found this repository with the code I needed. I copied the code into a separate file and stripped it down to only the components I needed.
The initial challenge
One of the main challenges I faced is that text must be sent to DeepL in complete sentences. However, a single sentence can be split over multiple chunks of an .srt file.
This meant that my program needs a clever way to turn the sections of text into sentences of text, send them one-by-one to DeepL, and then turn the resulting sentences back into sections. Another problem is that the resulting sentence often doesn’t have the same number of words as the initial one. Take this example (From English to Polish):
Do or do not there is no try.
Nie próbuj lub nie.
If the English sentence is split over multiple sections, how am I to split the Polish sentence back into those sections?
I solved this problem by using proportions. Taking the example above, you can see that the English sentence contains 8 words while the Polish sentence contains 4. This means that the Polish sentence is half as long as the English one. If it was originally split into 2 and 6 word sections, it would now be split into 1 and 3. So,
do not there is no try.
would translate to
próbuj lub nie.
Of course, this is a rudimentary solution, as the meaning of the sentence would be broken up differently. However, in the time and knowledge that I have, this seems to work fairly well.
After I came up with this tool, building the rest of the tool was pretty straightforward. First, the .srt file would be loaded. Then, only the text of the .srt file is extracted, and turned into sentences. After that, the sentences would be sent up to DeepL and translated. The most confident (best) translation of each sentence is taken and split back into sections, via the proportion method discussed earlier. Then, the resulting text is added back into the original .srt file and written out to the output file.
DeepL rate limiting
Of course, the DeepL API was rate limited. At first, I kept hitting the rate limit, as I was sending the whole 250+ sentences of the test file up at one time. To solve this, I added a ‘rate’ flag to the command-line tool, which allows the user to control the rate at which sentences are sent to DeepL.
Submitting the task
To submit this task, I created a GitHub repository to store my code and submitted the link to my mentors. I also built several releases for different operating systems (Go support cross-compilation). When I first submitted the task, the example I gave only contained 4 sections. So the task was marked as “More work needed” and returned to me. So, I downloaded a much larger .srt file which took several minutes to translate (due to rate limiting) and submitted that as an example. Only then was the task accepted and that glorious white tick appeared.
Overall, this was a very enjoyable task to build. It was even more satisfying to see my work accept by the wonderful mentors. Unfortunately, my first Google Code-in will also be my last, as I have reached the age limit for participants. Despite that, I have really enjoyed Google Code-in 2017.