Scene change detection during encoding and key frame extraction code
In a previous post we explained how to generate thumbnails for each scene change using Edit Decision Lists from your non-linear editor of choice. Sometimes, though, you won't have an EDL along with your video assets (for instance, when you are using off-airs or videos edited by somebody else). In those situations we can still create a thumbnail for each scene change thanks to how frames are structured in digital compression.
According to Wikipedia, a GOP structure specifies the order in which intra- and inter-frames are arranged. A Group Of Pictures can contain the following frame types:
- I-frame (intra coded picture) - reference picture, which represents a full image and which is independent of other picture types. Each GOP begins with this type of picture.
- P-frame (predictive coded picture) - contains motion-compensated difference information from the preceding I- or P-frame.
- B-frame (bidirectionally predictive coded picture) - contains difference information from the preceding and following I- or P-frame within a GOP.
The GOP structure is often referred by two numbers, for example M=3, N=12, which equals IBBPBBPBBPBBI. The first one tells the distance between two anchor frames (I or P) and the second the distance between two I-frames (GOP length).
In order to use key frames as a scene change detection method we need to use flexible GOP structures, with minimum (min-keynt) and maximum (keyint) values when encoding our video.
For example, a minimum setting that is the same as the frame rate of the video will prevent the encoded video from having two subsequent key frames within a second of each other. Please note that if your video has scenes shorter than a second long you won't be able to detect all scene changes unless you reduce this setting.
Similarly, a maximum setting ensures that a key frame is inserted at least every X number of frames. A recommend setting is to set this as 10 times the frame rate, which equates to 10 seconds of video between key frames. We can set this to infinite to never insert non-scenecut key frames although this might cause problems when seeking (if you try to skip to a part of the video without a key frame, there won't be any video until the next key frame is reached).
In addition, we need to define the threshold for scenecut detection. The encoder calculates a metric for every frame to estimate how different it is from the previous frame. If the value is lower than the threshold, a scenecut is detected.
Once we have encoded the video we can then run the following ffmpeg (I've used 0.8-win64-static) parameters:
ffmpeg -vf select="eq(pict_type\,PICT_TYPE_I)" -i yourvideo.mp4 -vsync 2 -s 73x41 -f image2 thumbnails-%02d.jpeg -loglevel debug 2>&1 | grep "pict_type:I -> select:1" | cut -d " " -f 6 - > keyframe-timecodes.txt
What follows -vf in a ffmpeg command line is a Filtergraph description. The select filter selects frames to pass in output. The constant of the filter is “pict_type” and the value “ PICT_TYPE_I”. In short, we are only passing key frames to the output.
-vsync 2 prevents ffmpeg to generate more than one copy for each key frame.
-f image2 writes video frames to image files. The output filenames are specified by a pattern, which can be used to produce sequentially numbered series of files. The pattern may contain the string "%d" or "%0Nd".
-loglevel debug 2 > keyframe-timecodes.txt outputs:
[select @ 0000000001A88BE0] n:0 pts:0 t:0.000000 pos:1953 interlace_type:P key:0 pict_type:I -> select:1.000000 [select @ 0000000001A88BE0] n:1 pts:40000 t:0.040000 pos:4202 interlace_type:P key:0 pict_type:P -> select:0.000000
Finally, I can convert “keyframe-timecodes.txt” into a chapter navigation list and use the thumbnails to navigate the video:
<a onclick="jwplayer().seek(0); return false" href="#"><img src="thumbnails-01.jpeg"></a> <a onclick="jwplayer().seek(1.36); return false" href="#"><img src="thumbnails-02.jpeg"></a>