THEO Blog

Operating HESP with low encoding costs

The High Efficiency Streaming Protocol (HESP) comes with a lot of advantages. It allows for sub second latency over standard HTTP CDNs (and the cost to scale benefit they bring), with unrivalled channel change times. The high QoE HESP delivers is the result of it’s main difference with alternative protocols: the use of Initialization Streams, which are streams containing IDR frames at a very high rate. By using an IDR from these Initialization Streams to kick-start the so-called Continuation Streams (which is similar to your run-of-the-mill HLS or MPEG-DASH segments) starting playback of a new track can happen in a few 100ms, which in turn enables fast channel change and allows to reduce buffer sizes and latency dramatically. Initialization Streams do come at a cost: even though they are not streamed to end viewers, they need to be encoded. In this article, we’ll dive into the actual impact on encoding of HESP compared to normal (LL-)HLS and (low latency) MPEG-DASH encoding.

Comparing HESP from a cost perspective

The HESP specification has been published through the IETF since earlier this summer. Ever since, we’ve been seeing a significant increase in interest from the industry. Arguments most often heard are the need to increase the overall quality of experience and ever growing viewer expectations. It’s an arms race where those services strive for snappy startup times, high quality viewing and low latency delivery (we all want to see it first) to grow and maintain their footprint within the troops of streaming services popping up left and right. As in every arms race, keeping your cost in check is a crucial aspect in order to keep up. Even though we would all want it, budgets are not infinite.

When comparing HESP with other streaming protocols, we see several advantages from a cost perspective. Streaming protocols which usually target sub second latencies such as WebRTC are often session based. This mostly results in stateful streaming servers being needed. As many experienced through scaling our RTMP servers in the past, this can be very painful and costly. This pain and cost (and the fact RTMP’s UDP is often blocked by firewalls) is for a large part why HTTP based streaming protocols such as HLS and MPEG-DASH took over: they are stateless which allows them to scale over standard HTTP CDNs. HESP can also benefit from its HTTP based transport approach to scale over those same CDNs. So far so good.

Most HTTP based low latency protocols however have another disadvantage: GOP sizes dramatically impact the end-to-end latency which can be achieved for low latency HLS and MPEG-DASH profiles. Reducing the GOP size has an impact on your compression efficiency. While the type of content greatly influences the size of this decrease, the huge volumes of viewing minutes popular services see and the associated bandwidth cost force us to monitor even small percentages. Here HESP brings another advantage as its latency is not linked to its GOP size: (the often large) independent frames can be placed where it makes sense for the content, and no set intervals are needed, allowing for a highly optimal quality-to-bit ratio. In some tests, bandwidth savings costs of about 20% become possible compared to HLS or MPEG-DASH. A percentage which is very significant.

There is however another side to that medal. The reason why HESP is able to decouple its GOP size from its latency is the existence of Initialization Streams. For every quality at which the content is presented, there are two feeds which need to be generated. In a worst case scenario, simple logic would dictate this in practice doubles the encoding cost. With 4k encoders being expensive and plenty of services seeing massive usage on some live channels, but only select audiences on others, the question is if this additional encoding cost is worth it: will the cost encoding of using HESP balance out with the cost saved on delivery scaling and bandwidth and be worth the increase in QoE?

Why are these Initialization Streams needed?

Encoding HESP Initialization Streams comes at a cost. The question of course is what the size of this cost is, and if we can influence it. In order to do this, it is important to understand why exactly these streams are needed.

Initialization Streams offer the ability to start decoding content fast. They contain independent frames, which in normal streams are available only at the start of a GOP, meaning once every 2-10 seconds depending on content and configuration. As a (video) decoder can only start playback at one of these independent frames, one such frame is needed to start playback. This is important for stream startup, but also when seeking or when switching through alternative qualities using ABR. For HLS and MPEG-DASH (but the same can be said for other protocols such as WebRTC as well), a client has two choices: either 

  1. wait for the next independent frame to appear in the stream, or 
  2. download the last known independent frame and start playback from there. 

In the first scenario, this will mean that startup time will be impacted. Even for a small GOP size of two seconds, this means an average increase in startup time of one second, and a worst case scenario where two additional seconds are added to the startup time. This is usually not an approach one wants to make given the importance of startup time in viewer churn, especially not if your GOP size is larger than two seconds (just imagine the viewer churn if all viewers would have to wait for 5 additional seconds before playback starts!).

In the second scenario, it would mean you will increase your latency. Again, for a small GOP of two seconds you will see an average latency increase of one second, and a worst case scenario where the latency increases with another two seconds. While that does not seem bad at all in a world where average latencies with HLS and MPEG-DASH streams are north of 20 seconds, this is bad in case you want to reduce latency: every second counts. In the case of HESP which offers sub second latencies, this would mean the latency would increase by 250-500%.

Thanks to the Initialization Streams however, a player can at any point in time retrieve an independent frame, inject this in the decoder, and continue playback from the “normal” Continuation Stream. The advantage of this is not to be underestimated:

  1. Upon the start of playback, we can start at any point in time, at the right latency.
  2. When facing network issues, we can rapidly switch to a lower bitrate stream, reducing the need of large buffers which need to account for startup time of a new stream. This in turn has an impact on the latency at which we can stream.
  3. In case of buffer underruns, we can recover quickly by restarting playback in a fraction of a second, reducing possible stall durations.
  4. For viewers wanting to select a viewing angle, we can instantly switch to the other feed and provide an optimal viewer experience, without any delay or black frame insertion.
  5. Upon server issues, failover to alternative CDNs can be executed fastly and efficiently, minimising (and even eliminating altogether) downtime from a viewer perspective.
  6. When the network quality increases and throughput goes up, we can very quickly switch towards a higher bitrate stream and provide a better quality towards our viewers.

For the last scenario, one could argue that switching up does not have to be instantaneous (but it should not take extremely long either). As a result, when we look at commonalities between these cases, we see that in an ideal scenario we can swiftly switch towards:

  • Our default quality at which we want the player to start playback, covering cases 1 and 4.
  • A bitrate lower than the current playback bitrate (but preferably not just the lowest) which allows us to cover cases 2, 3 and 5.

One could argue that as a result, we need Initialization Streams with the full frame rate for only a select number of bit rates offered in the ladder, and that for others no Initialization Streams, or Initialization Streams with less independent frames, would suffice. In the next section, we’ll explore exactly this, as well as the impact on user experience and cost.

Tweaking your Initialization Streams

It is not a requirement that every quality in your bitrate ladder contains an Initialization Stream at the full frame rate of the Continuation Stream. A frame of the Initialization Stream is needed to do fast startup. If only a sparse initialization feed is available, startup will be slower. As a result one could argue that in an ideal scenario the full frame rate is required. However, not every stream/quality necessarily needs to start up fast. For example:

  • in case the network goes down, it can be acceptable to wait for a few moments before switching  to a higher bandwidth,
  • when a viewer initiates a viewing session, you do not know the available network bandwidth and hence do not need every quality, but might want to pick a solid default where you want to start.

As it seems, most often we need to be able to start fast on lower bandwidth streams. This is in contrast with the costs to encode these Initialization Streams. Higher bitrate encodes are often more expensive compared to lower bandwidth encodes. As a result, considering not to encode the Initialization Stream for the top quality (or only at a fraction of the frame rate), can have a big impact on the encoding cost. The top quality stream is usually not required for fast startup in common cases: the available bandwidth will be unknown and starting up at too high a bandwidth will take an unneeded long time, and be followed by a drop in quality. As a result, excluding it from the encoding setup (and reusing IDR frames from the Continuation Stream to create a sparse Initialization Stream) makes a lot of sense from a cost perspective.

There are a number of options available:

  1. Generate a full ladder of Initialization Streams.
  2. Generating only the lowest bitrate Initialization Stream to allow for rapid switching down and low latency playback.
  3. Generating a lowest bitrate Initialization Stream and (one or more) sensible default bandwidth Initialization Streams to allow for low latency playback and fast startup on those default bandwidths.
  4. Generating the initialization feeds of (2) or (3) and sparse Initialization Streams for other bandwidths with a reduced number of frames per second to allow for faster switching towards those bandwidths.

Comparing playback with full frame rate Initialization Streams and sparse Initialization Streams

Based on our testing, we can see that the difference in terms of encoding capacity needed between these options is rather significant. In order to validate this, our team set up a test ABR ladder ranging from 360p to 1080p at 30fps, with:

  • 1080p30@4000kbps
  • 720p30@2500kbps
  • 540p30@1000kbps
  • 360p30@400kbps

We measured general CPU load across a number of different scenarios where our encoder would take in one live RTMP feed and produce new RTMP feeds for the initialization and Continuation Streams towards an HESP packager. We set up a comparison where no Initialization Streams would be generated, which is the case for other streaming protocols like HLS and MPEG-DASH as well.

What we see is that generation of all Initialization Streams increases the encoding needs by almost 70%. In contrast, the generation of just a low bandwidth stream reduces this increase significantly and we see an increase of about 3%.

Initialization Stream setup

Encoding CPU usage

Impact

1080p

720p

540p

360p

/

/

/

/

100.0%

High latency, high startup times

HLS & MPEG-DASH reference

30fps

30fps

30fps

30fps

168.9%

Low latency, low startup times, maximal additional encoding

/

30fps

30fps

30fps

130.7%

Low latency, higher ABR up switching time to 1080p (one GOP size)

/

/

30fps

30fps

115.2%

Low latency, higher ABR up switching time to 1080p and 720p, ABR must switch down to 540p or 360p

/

/

/

30fps

103.1%

Low latency, higher ABR up switching time, ABR must switch down to 360p

10fps

10fps

10fps

30fps

127.8%

Low latency, +100ms ABR up switching time

5fps

5fps

5fps

30fps

115.2%

Low latency, +200ms ABR up switching time

2fps

2fps

2fps

30fps

106.1%

Low latency, +500ms ABR up switching time

1fps

1fps

1fps

30fps

103.4%

Low latency, +1000ms ABR up switching time, ABR must likely switch down to 360p

1fps

30fps

1fps

30fps

123.4%

Low latency, fast up switching & startup for 720p, +1000ms for other qualities

1fps

1fps

30fps

30fps

114.8%

Low latency, fast up switching & startup for 540p, +1000ms for other qualities

Table 1.Comparison of encoding usage across different Initialization Stream configurations 

Especially interesting is when we look at the scenarios where full Initialization Streams are present for some streams, and sparse Initialization Streams are generated for others: here we see an increase of only 15 to 23%. Based on tests, we would recommend profiles such as the 1/1/30/30 profile, or combinations such as 1/5/30/30 where we expect the total increase in encoding cost to be around 18-20%. When compared with generic sparse frame generation for all but the lowest bandwidth, these numbers start looking very good, with an increase of only 3% while impact on switching up remains at a minimum. While in these scenarios the cost for encoding would still go up, this is an increase which should be easily compensated by HESP’s benefits in cost to scale, improvements in GOP size (and cost reduction in egress traffic), and rise in QoE.

As a conclusion we can see that while generating a full set of Initialization Streams, it is interesting to look at specific requirements to trim down on the number of Initialization Streams, or generate sparse Initialization Streams. Tweaking these parameters can be crucial for services which operate large numbers of streams with limited numbers of viewers per stream and will allow you to easily keep your costs in check. If you have any questions on this, don’t hesitate to reach out to our team!

Questions about HESP? Contact us today!

Subscribe by email