A way to Edit Youtube auto captioning

youtubecc

Youtube can auto generate subtitles. It has a perfect timing. But sometimes, it has some error. We can update the transcripts if it is the our own video. If it is someone else’s, it is not allowed to edit in the Youtube. Amara is a great website that makes it possible to edit the subtitles of youtube videos. It generates a new subtitles on top of the original one. All you need is copy the embed code from Amara and insert into your websites. It seems a great way to edit the youtube captioning. However, we have following problems if we take this method to edit youtube transcripts.

  • Amara need a SRT file to start with. So we have to download the SRT file from Youtube. We can use www.ccsubs.com to get the SRT file from Youtube. We can upload the SRT file to Amara.
  • But we ends up having a messy timing sync in Amara. So we have to solve to timing problem in addition to the work of editing the transcripts.

I have come up with my own solution to the above problems. My solution takes following steps:

  1. Download the timedtext xml file from Youtube.
  2. Convert the xml into the srt file.
  3. Solve the timing problem.

Download the timedtext xml file from Youtube

Open the youtube video in the Chrome. Right click, select ‘Inspect’.

       

Click ‘Network’ tab, shown in bellow.

       

Then click the ‘cc’ in the youtube video, like below:

You will see new generated lines in the Network tab. Choose one start with ‘timedtext?’ and right click and ‘Open link in new tab’.

    

We can see the xml file now. We just need to ‘save as’ this page into you local computer. This is the xml file we are going to use in our next step.

Convert the xml into the srt file

I programmed a python code which can convert a timedtext xml file into the srt file. For detailed code, please see xml2srt.py in my Github repository. To run the code, we need following syntax:

    python xml2srt.py [xml-file1] [xml-file2] [xml-file3] …

For example, if we have following 2 xml files, it can be runned by:

    python xml2srt.py wk3-1.xml wk3-2.xml

After the above process, srt files will be generated with same filenames. For instance, the wk3-1.srt and wk3-2.srt files are generated with previous case.

Solving the timing problem

I wrote a another python program code to solve the timing sync problem. For more detailed code, please refer to srt2std.py in my Github repository. To run the code, we need following syntax:

    python srt2std.py [srt-file1] [srt-file2] [srt-file3] …

For example, if we have following 2 srt files which were generated from above process, it can be runned by:

    python srt2std.py wk3-1.srt wk3-2.srt

Afterward, new srt files will be generated. For instance, the wk3-1-2.srt and wk3-2-2.srt files are generated with above example.

Now, we can upload these srt files into Amara, and start editing the transcripts. Timing sync is perfect now. You just need to correct some of the words that are incorrectly captioned by Youtube machine. After finish editing the transcript, we just need to copy the embed code from Amara, and then insert the embed code into the website html you prefer.

   

That’s all. Feel free to update the code according to the personal need. It is possible to merge step 2 and step 3. Or if you can download the timedtext xml just with a youtube ID, that would be great. Hope this is helpful.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s