How to Optimize LL-HLS for Low Latency Streaming

In our  Comprehensive Guide to Low Latency, we have covered how LL-HLS works and how the end-to-end solution should look as well as suitable use cases and THEO’s recommendations for LL-HLS implementations. In this guide, we are going to focus on
how to tune LL-HLS better for Low Latency Streaming with an introduction to HESP and where it improves LL-HLS.

How to optimize LL-HLS for low latency streaming


LL-HLS builds on the successful HLS method for streaming video to - originally - Apple devices. Whereas HLS, much like its DASH counterpart, adopts segments (typically a few to 10 seconds) as the basic unit to fetch video content, LL-HLS allows fractions of a segment to be individually addressed and fetched.

This has direct implications on the latency and zapping times. The latency is not defined by the segment size, but by the part sizes since the video parts can be fetched once a part is available and not segment per segment. This makes LL-HLS suited for low latency applications where end-to-end latencies of a few seconds are required and playback closely follows the live event. The smaller parts also allow it to start more rapidly while keeping the latency small because the player can start playback before the live segment is completely available. Moreover, as we will explain in this guide, in the right conditions, video can start playback with a part and not only at segment boundaries.


Depending on the use case and the desired latency, bandwidth consumption, and scalability, each of the low latency streaming protocols may be the best option. In this first section, we'll go through the most important encoding and packaging parameters as well as buffer size and discuss their impact on latency, video quality, bandwidth consumption and the resiliency to the network variations.


The GOP size, or size of your Group of Pictures is one of the main encoding parameters that have a direct impact on video bitrate and video quality and an indirect impact on end-to-end latency. It determines how often a keyframe (or IDR frame) will be available. In LL- HLS, the player requires a keyframe to start decoding, meaning it can start the playback only at GOP boundaries. Longer GOPs cause higher start-up delay and higher latency.


Apple’s recommended GOP size is 2 seconds. Typical LL-HLS implementations support LL-HLS with 3-second end-to-end latency when the GOP is set to 1 second. However, small GOP sizes come at the cost of higher bandwidth consumption. The smaller the GOP size is, the more frequent the keyframes would be. Depending on the video, keyframes can be 10 times larger than P frames and small keyframe intervals will increase the video bitrate and hence the bandwidth consumption.

In this table, you can see how video bitrate changes with different GOP sizes. For having a comprehensive view, four different types of videos have been tested:

  • A movie (Tears of Steel)
  • An animation (Big Buck Bunny)
  • A bike race TV program
  • A static screen streaming

For this calculation, the parameter factor CRF (Constant Rate Factor) is kept the same for all GOP sizes forcing the encoder to keep the same video quality in all GOP sizes. As we can see in higher GOP sizes, we can keep the same video quality while using less bandwidth.

Bitrate reduction in large GOP size can differ depending on the type of video. For example, in the static video type (e.g. screen streaming), we have up to 70% reduction in bandwidth consumption from GOP 0.5 seconds to GOP 10 seconds. For other videos, we still have up to a 20% reduction in video bitrate.


GOP size also has an impact on the video quality. The larger the GOP size is, the higher the video quality will be. Because for the same bitrate we can put more details in the P frames when the GOP size is larger We studied how the GOP size affects the video quality. To measure the video quality, we use the VMAF metric. Below is a brief explanation of VMAF.

What is VMAF?

Video Multimethod Assessment Fusion (VMAF) is a video quality metric designed by Netflix consolidating four different metrics:

  • Visual Information Fidelity (VIF): considers fidelity loss at four
    different spatial scales
  • Detail Loss Metric (DLM): measures detail loss and impairments
    which distract viewer attention
  • Mean Co-Located Pixel Difference (MCPD): measures the
    temporal difference between frames on the luminance component
  • Anti-noise signal-to-noise ratio (AN-SNR)

VMAF score is ranged between 0 and 100 (100 being identical to the reference video). 6 VMAF points represent a noticeable difference. The VMAF default model is used in this test.

In the below table we depicted how GOP size affects the video quality in different video types. For each encoded video, a VMAF score in comparison to the reference video has been calculated. Depending on the video type, the VMAF score drop in lower GOP sizes is different. Except for the static video streaming, in all the rest of the videos, there is a significant VMAF drop between GOP 10 sec and GOP 0.5 sec. Big Buck Bunny drops by 8 points between GOP 10 sec and GOP 1 sec, which is a noticeable quality degradation.

Please note that in this test the aim was to see the impact of different GOP sizes in the VMAF scores. All videos are encoded at max bitrate 4Mbps. There could be the case that the chosen 4Mpbs is not the highest VMAF scored bitrate for its resolution, but matching the highest VMAF score for each resolution in different videos is out of the scope of this test.

Based on the VMAF points we see that for some types of videos such as static screen streaming, the quality does not improve that much with large GOP size while you still gain a huge reduction in the bandwidth consumption in large GOP sizes (Table 1). On the other hand, for another type of video such as Big Buck Bunny, the video quality improves up to 15 VMAF points (GOP 10 seconds with respect to GOP 0.5 seconds) which is a considerable amount since every 6 VMAF points is a visually noticeable difference. We also have another pattern for the Tears of Steel video where the VMAF improvement is below 6 VMAF points (between GOP 1sec and GOP 10 sec). In this case, you still have ~20% bitrate reduction in the largest GOP size (Table1).


In LL-HLS live streaming, if we increase the GOP size to decrease the bandwidth consumption and increase the video quality, we need to sacrifice the short zapping time and/or the latency.

The player requires a keyframe to start decoding, meaning that a large GOP will impact the zapping time and latency of the stream. It can either wait for the following GOP, implying a long startup time and low latency or it can start playback of the current GOP, implying short startup times, but potential latencies of up to the GOP size. Having large GOPs with only one keyframe every 6 seconds, for example, will mean that the player can start playback on a position once every six seconds.

This doesn’t mean your zapping time will be six seconds, but it might require your player to start at a higher latency. With the 6 seconds example, starting playback immediately implies that the average additional latency at the start will be 3 seconds, and in the worst case it can reach up to 6 seconds.


Based on this explanation, small GOP sizes seem extremely attractive. However, if you have a lot of keyframes, it increases inefficiency in compression, which means you will use more bandwidth and streaming quality will go down at the same bitrate. This effect becomes large when GOP sizes fall below 2s. In case you are interested in lower bandwidth consumption and reasonable start-up time, the recommendation from THEO’s side would be to set your keyframe interval to 2 to 3 seconds. On the other hand, if your priority is to have small start-up delays and low latency, the GOP size should be smaller and should be set in a way that all parts start with a keyframe. We go over part size in the next section.


In LL-HLS, the player is not limited to start the playback at segment boundaries and can start the playback at every independent part (the parts that start with a keyframe).

"The part size has a direct influence on the end-to-end Latency in LL-HLS. The smaller the part size is, the lower the latency will be. But it is not that simple."

Apple says that the parts can be as low as 200msec. But we need to keep in mind that in LL-HLS, the player must start the playback with a keyframe. If the part does not start with a keyframe (which is the case when part size is smaller than the GOP size), the player should either seek back to a point where a part starts with a keyframe or wait for the next keyframe to start the playback. For example, consider GOP size of 2 seconds, part size of 500 msec and playback request is sent at the middle of a 6-second segment. The player needs a keyframe for starting the playback. It must wait for the following keyframe in the next third part which means at least 1.5 seconds zapping time or seek back to two parts behind which will bring additional 1-second latency to the end-to-end latency.


Ideally, the part size and the GOP size should be equal to have the least zapping time because in that case we have all parts marked a “independent” and the player can start the playback at any part boundaries. But having a smaller part size will lead to a lower minimum buffer size and so lower latency. However, too small part size will cause overhead because of too many HTTP requests that should be handled.

If you can guarantee the perfect network condition and your main focus is to have the lowest end-to-end latency, we recommend using 400 msec part size. If instead the network condition is variable and you need to have a smooth playback during network ups and downs and also benefit from extra-low zapping time, we recommend setting your keyframe interval and part size to 1 second as it strikes a balance between latency and viewer experience at start-up.


The segment size in LL-HLS does not directly impact the latency as it does in traditional HLS. In general, it is nice to have longer segments that allow for larger GOP size which means higher video quality and lower bandwidth consumption. On the other hand, in LL-HLS large segment size impacts the amount of the parts which you need to list in your playlist. As a result, it affects the size of the playlist (and how much data must be loaded in parallel with the media data). Having long segments can as a result significantly increase the size of the playlist, causing overhead on the network and impacting streaming quality. Segments can’t be too small either since that imposes a smaller GOP size and therefore lower video quality and higher bandwidth consumption.


Segment size should be equal to or larger than your GOP size. It cannot be too small due to consequent poor video quality and it cannot be too large because of the LL-HLS limitations mentioned above. Apple’s recommendation for segment size is 6 seconds for LL-HLS which is a good balance between video quality and overhead in the network. In HESP you won’t have such limitations for large GOP size and long segments which leads to better video quality and lower bandwidth consumption.


There is always a trade-off between a secured smooth playback in all (network) conditions and achieving the lowest possible latency. To cope with network and other variations, LL-HLS maintains a buffer to handle the jitter and unforeseen hiccups in the video transmission. The larger the buffer, the higher the tolerance for network issues, but also the higher the latency. In LL-HLS we have a default of 3 part durations in the buffer.

For example, when you have parts of 400ms, this will mean your buffer will target size of 1.2s. Based on our tests, and with correct settings for the part and GOP size, with slightly higher part size, for example, around 1 second, we notice that the buffer size can be slightly decreased without impact on user experience. However, as a baseline, it is envisaged never to have a buffer of fewer than 2 parts.

However, the network condition is not always perfect. Besides jitter, we also encounter drops and variations in the network capacity. To cope with this varying network bandwidth, ABR is needed. In order to make sure the ABR is working effectively, the buffer size should be long enough to be able to accommodate the quality switch, just in time before any glitch or rebuffering happening in the playback. Let’s consider the worst-case; If the buffer size is 2 seconds, the segment is 6 seconds, the GOP size is 3 seconds, and the network bandwidth drops to half of the video bitrate near the end of the segment. The player would need to download a new part from lower quality that starts with a keyframe. Because we are near to the end of the segment and the GOP size is 3 seconds, which means that neither the current part nor the previous part contains a keyframe and the player should download the third prior part to be able to switch the quality down. So, you would need to download 3 seconds of data while you have only 2 seconds of the buffer. If you reduce the GOP size to 2 seconds, you may still get stalls during the ABR switch.

Therefore, you need to increase the buffer size to make sure you can have a smooth quality switch. A larger buffer size means longer latency. You would think of reducing GOP size to smaller values to have a proper ABR switch down without stalling but as discussed earlier, smaller GOP size comes with lower video quality and higher bandwidth consumption which brings an extra challenge to the ABR itself.


For Low latency / fast startup streaming with LL-HLS, it is important to have a clear understanding of the impact of each parameter on the final result. End-to-end latency depends directly on the part size. On the other hand, the zapping time depends directly on the GOP size and it can not go lower than that even with smaller part size. So the lowest latency you get from the smallest part size, but that does not bring the shortest zapping time necessarily (for example when the part is shorter than the GOP and it is not one of its divisors e.g. 1/2 or 1/3 or... of the GOP). Small part sizes (smaller than GOP) are not really helpful during the quality switch for ABR as the quality switch can happen only at independent parts which correspond to the GOP boundaries.

Therefore, the ideal situation to have the lowest zapping time and latency is to have the part and GOP size equal and as small as possible. A GOP size lower than 1 sec does not really make sense because of the poor video quality and high bandwidth needs, therefore a good value would be 1 second in order to achieve the lowest zapping time, latency and smooth ABR switches with 2 seconds buffer (2 parts). However, the GOP size of 1 second could be demanding for the bandwidth consumption. THEO's recommendation would be a GOP size of 2 seconds with the part size of 1 second and buffer size of 3 seconds which is a good combination for reasonable video quality, bandwidth consumption, latency and zapping time.


HESP is a next-generation online video delivery technology outperforming the current generation protocols for low latency streaming at scale. It is an Ultra-low latency streaming protocol delivered over HTTP/1.1 with Chunked Transfer-Encoding and Range requests with a minimalistic manifest with low-frequency update requirement. There are two complementary streams required;

  • The Initialization stream, which contains keyframes that makes it
    possible to start the playback at any given moment and not necessarily
    at the beginning of a segment or at a keyframe interval, and
  • The Continuation stream, which contains the IPB frames and can be
    played right after the keyframe from the Initialization stream.

HESP offers a broadcast-like experience with sub-second latency and zapping time on any device or platform. It also delivers very low bandwidth consumption compared to other ultra-low latency streaming protocols such as WebRTC. Being delivered over HTTP, it is compliant with standard CDNs and offers low-cost scaling.


As mentioned above, HESP is based on using two streams for each quality/track: 1.) Initialization stream to rapidly start new streams; 2.) Continuation stream for use in normal operation.

What is Initialization Stream?

The initialization stream consists of initialization packets corresponding to each frame position. The initialization packets are individually addressable. They contain an IDR frame corresponding to the frame position making it possible to start the playback at any given frame and they are contained in an ISOBMFF format.

What is Continuation Stream?

The continuation stream is packaged in CMAF-CTE, albeit with specific configurations for low latency and can start playback immediately after an initialization frame, allowing for very fast channel start and switch times. It is addressed using byte-range requests and is served using Chunked Transfer Encoding for low latency.

The segments in the continuation stream can be lengthy without any limitation for the low latency and fast zapping time, making it possible to have a large GOP size and hence lower bandwidth consumption and higher video quality.


In order to implement HESP, only two components of the video value chain need tailoring: the packager and the player. HESP works with regular encoders and also regular CDNs, as long as these support CTE and byte ranges.


HESP provides sub-second end-to-end latency together with large GOP sizes (10-12 seconds). Thanks to the initialization stream, the quality switch in ABR is not limited to the GOP boundaries and it can happen at any given moment. This means HESP is not limited to a small GOP size. Thus, the GOP size can be kept large while having a small buffer size (HESP has sub-second target buffer) and so it is possible to have low latency and smooth quality switch at any time without risk of rebuffering.

By setting the same latency target as LL-HLS in HESP (~3sec) you would have more margin to encode the video more efficiently resulting in a lower video bitrate for the same video quality and so you could save bandwidth consumption.

As described earlier, LL-HLS cannot really exploit the small part size as there are also other consequences to be taken into account; no matter how small the part size is in LL-HLS, you are limited to the keyframe interval to be able to switch the quality in bad network conditions. In HESP, on the other hand, starting the playback is not limited to the GOP boundaries. Therefore, you do not need to sacrifice video quality (smaller GOP) to have the lowest end-to-end latency.

While LL-HLS cannot really exploit the small part size to have low latency in bad network conditions, HESP offers a small buffer size, low latency, large GOP size, and higher video quality all at the same time.

Any questions left? Contact our THEO experts.

Subscribe by email