Intro

Porn is a big industry. There aren’t many sites on the Internet that can rival the traffic of its biggest players.

And juggling this immense traffic is tough. To make things even harder, much of the content served from porn sites is made up of low latency live streams rather than simple static video content. But for all of the challenges involved, rarely have I read about the developers who take them on. So I decided to write about my own experience on the job.

What’s the problem?

A few years ago, I was working for the 26th (at the time) most visited website in the world—not just the porn industry: the world.

At the time, the site served up porn streaming requests with the Real Time Messaging protocol (RTMP). More specifically, it used a Flash Media Server (FMS) solution, built by Adobe, to provide users with live streams. The basic process was as follows:

  1. The user requests access to some live stream
  2. The server replies with an RTMP session playing the desired footage

For a couple reasons, FMS wasn’t a good choice for us, starting with its costs, which included the purchasing of both:

  1. Windows licenses for every machine on which we ran FMS.
  2. ~$4k FMS-specific licenses, of which we had to purchase several hundred (and more every day) due to our scale.

All of these fees began to rack up. And costs aside, FMS was a lacking product, especially in its functionality (more on this in a bit). So I decided to scrap FMS and write my own RTMP parser from scratch.

In the end, I managed to make our service roughly 20x more efficient.

Getting started

There were two core problems involved: firstly, RTMP and other Adobe protocols and formats were not open (i.e., publically available), which made them hard to work with. How can you reverse or parse files in a format about which you know nothing? Luckily, there were some reversing efforts available in the public sphere (not produced by Adobe, but rather by osflash.org who’ve since taken them down) on which we based our work.

Note: Adobe later released “specifications” which contained no more information than what was already disclosed in the non-Adobe-produced reversing wiki and documents. Their (Adobe’s) specifications were of an absurdly low quality and made it near impossible to actually use their libraries. Moreover, the protocol itself seemed intentionally misleading at times. For example:

  1. They used 29-bit integers.
  2. They included protocol headers with big endian formatting everywhere—except for a specific (yet unmarked) field, which was little endian.
  3. They squeezed data into less space at the cost of computational power when transporting 9k video frames, which made little to no sense, because they were earning back bits or bytes at a time—insignificant gains for such a file size.

And secondly: RTMP is highly session oriented, which made it virtually impossible to multicast an incoming stream. Ideally, if multiple users wanted to watch the same live stream, we could just pass them back pointers to a single session in which that stream is being aired (this would be multicasting). But with RTMP, we had to create an entirely new instance of the stream for every user that wanted access. This was a complete waste.

file

My solution

With that in mind, I decided to re-package/parse the typical response stream into FLV ‘tags’ (where a ‘tag’ is just some video, audio, or meta data). These FLV tags could travel within the RTMP with little issue.

The benefits of such an approach:

  1. We only needed to repackage a stream once (repackaging was a nightmare due to the lack of specifications and protocol quirks outlined above).
  2. We could re-use any stream between clients with very few problems by providing them simply with an FLV header, while an internal pointer to FLV tags (along with some sort of offset to indicate where they’re at in the stream) allowed access to the content.

I began development in the language I knew best at the time: C. Over time, this choice became cumbersome; so I started learning Python while porting over my C code. The development process sped up, but after a few demos, I quickly ran into the problem of exhausting resources. Python’s socket handling was not meant to handle these types of situations: specifically, in Python we found ourselves making multiple system calls and context switches per action, adding a huge amount of overhead.

Improving performance: mixing Python and C

After profiling the code, I chose to move the performance-critical functions into a Python module written entirely in C. This was fairly low-level stuff: specifically, it made use of the kernel’s epoll mechanism to provide a logarithmic order-of-growth.

In asynchronous socket programming there are facilities that can provide you with info whether a given socket is readable/writable/error-filled. In the past, developers have used the select() system call to get this information, which scales badly. Poll() is a better version of select, but it's still not that great as you have to pass in a bunch of socket descriptors at every call. 

Epoll is amazing as all you have to do is register a socket and the system will remember that distinct socket, handling all the gritty details internally. So there's no argument-passing overhead with each call. It also scales far better and returns only the sockets that you care about, which is way better than running through a list of 100k socket descriptors to see if they had events with bitmasks--which you need to do if you use the other solutions.

But for the increase in performance, we paid a price: this approach followed a completely different design pattern than before. The site’s previous approach was (if I recall correctly) one monolithic process which blocked on receiving and sending; I was developing an event-driven solution, so I had to refactor the rest of the code as well to fit this new model.

Specifically, in our new approach, we had a main loop, which handled receiving and sending as follows:

file

  1. The received data was passed (as messages) up to the RTMP layer.
  2. The RTMP was dissected and FLV tags were extracted.
  3. The FLV data was sent to the buffering and multicasting layer, which organized the streams and filled the low-level buffers of the sender.
  4. The sender kept a struct for every client, with a last-sent index, and tried to send as much data as possible to the client.

This was a rolling window of data, and included some heuristics to drop frames when the client was too slow to receive. Things worked pretty well.

Systems-level, architectural, and hardware issues

But we ran into another problem: the kernel’s context switches were becoming a burden. As a result, we chose to write only every 100 milliseconds, rather than instantaneously. This aggregated the smaller packets and prevented a burst of context switches.

Perhaps a larger problem lied in the realm of server architectures: we needed a load-balancing and failover-capable cluster—losing users due to server malfunctions is not fun. At first, we went with a separate-director approach, in which a designated ‘director’ would try to create and destroy broadcaster feeds by predicting demand. This failed spectacularly. In fact, everything we tried failed pretty substantially. In the end, we opted for a relatively brute-force approach of sharing broadcasters among the cluster’s nodes randomly, equaling out the traffic.

This worked, but with one drawback: although the general case was handled pretty well, we saw terrible performance when everyone on the site (or a disproportionate number of users) watched a single broadcaster. The good news: this never happens outside a marketing campaign. We implemented a separate cluster to handle this scenario, but in truth we reasoned that jeopardizing the paying user’s experience for a marketing effort was senseless—in fact, this wasn’t really a genuine scenario (although it would have been nice to handle every imaginable case).

Conclusion

Some statistics from the end-result: Daily traffic on the cluster was about a 100k users at peak (60% load), ~50k on average. I managed two clusters (HUN and US); each of them handled about 40 machines to share the load. The aggregated bandwidth of the clusters was around 50 Gbps, from which they used around 10 Gbps while at peak load. In the end, I managed to push out 10 Gbps/machine easily; theoretically1, this number could’ve gone as high as 30 Gbps/machine, which translates to about 300k users watching streams concurrently from one server.

The existing FMS cluster contained more than 200 machines, which could’ve been replaced by my 15—only 10 of which would do any real work. This gave us roughly a 200/10 = 20x improvement.

Probably my greatest take-away from the project was that I shouldn’t let myself be stopped by the prospect of having to learn a new skill set. In particular, Python, transcoding, and object-oriented programming, were all concepts with which I had very sub-professional experience before taking on this project.

That, and that rolling your own solution can pay big.

1 Later, when we put the code into production, we ran into hardware issues, as we used older sr2500 Intel servers which could not handle 10 Gbit Ethernet cards because of their low PCI bandwidths. Instead, we used them in 1-4x1 Gbit Ethernet bonds (aggregating the performance of several network interface cards into a virtual card). Eventually, we got some of the newer sr2600 i7 Intels, which served 10 Gbps over optics without any performance kinks. All the projected calculations refer to this hardware.

Looking to hire top engineers? Check out Toptal's Python developers!
Editor's note: want posts just like this delivered straight to your inbox? Subscribe below to receive our latest engineering articles.
Subscribe to our engineering blog for the latest tips

Comments

Jordan Fried
Nice post Gergely!
Vlad Shvedov
Interesting and very close to problems i had, when i was involved into VoD business. Of course we had smaller traffic, but as one of good decisions we made on the route to performance increase, was switching to Wowza's reverse-engineered RTMPE implementation.
Michael Sparks
You're everywhere I look Jordan. Quit it!
Gergely Kalman
Thanks buddy!
Bradley Barrows
I feel your pain I was just working on a P2P live streaming solution which takes one RTMP stream as input and broadcasts it to all the other peers watching that stream. We actually had to deal with exactly the same issue. Saving the initial Header packets and re sending them to new peers was our solution as well.. There were also a number of other technical issues we ran into related to using this format in ways it wasnt meant to be used and the lack of technical documentation but with the help of one of the creators of an awesome python RTMP server we figured most everything out..
obeattie
Very interesting post. As an aside, what tool did you use to make those nice diagrams? Photoshop?
Norman Timmler
I'm wondering if you used FLVTool2 as a reference for the crappy FLV format? The time I wrote FLVTool2, I was probably the first who reverse engineered the FLV protocol. I remember all the hours railing against the Adobe engineers ;)
Luboš Volkov
Very glad you like those diagrams, all of them are done in the Photoshop.
João Magalhães
Loved the post. Couldn't agree more... reinventing the wheel is not always a bad idea.
juarezbochi
Nice post! Are you planning to open source your solution?
Chris
What tool you've used to make the graphs of the post ? Thanks !
Luboš Volkov
We have used Adobe Photoshop. You welcome.
Luis Martinho
Top notch post. Thanks for sharing.
Luke™
Excellent writeup, thanks for sharing. I worked on reverse engineering RTMP back in the day and can understand your frustration with the protocol. It make absolutely no sense for video streaming and was quite frustrating to work with. I think the idiosyncrasies stem from its roots. The story I heard is that one of the macromedia employees built it while on holiday as a side project. I imagine the guy got a perverse kick out of using 29bit ints. It was never intended for video, it was initially designed for shared objects. Video came later and was clearly shoe horned into the protocol. If you wanted an example of how not to design a video streaming protocol RTMP would be a good candidate. That said I guess its design unintentionally made Adobe more money. :-)
Martin Woo
Read the title again "How I made Porn 20x more efficient with Python" kinky stuff, huhh :D Nice post, btw :)
Nicolas Noirbent
Since you switched to an entirely home-made solution, any reason you kept using RTMP as the transport protocol, instead of switching to HTTP ? I know RTMP offers more options since it is a 2-way protocol compared to HTTP request/response model, but for massive broadcasting of live video content I'm not sure you need anything more. Not to mention using HTTP frees you of the eldritch abomination that is RTMP, and fits much more nicely into any conventional website stack.
Tom
It's 2013 and I still can't watch porn with html5 on firefox, please fix .. I don't use flash and getting off by watching pictures instead movies is too much of a task. Thank, customer.
Gergely Kalman
Unfortunately no, it's the IP of the company I worked for. The software would be a bit dated nowadays anyway.
Gergely Kalman
Thank you, I did actually use it to look up the format along with other implementations like ffmpeg. Adobe engineers are a peach. One day a guy from them came to fix stuff and we were chatting away and I mentioned that swf verification did not mean squat. He said that it's impossible to bypass it. I did it in front of his eyes in 5 minutes (not that difficult to copy a few bytes actually). He was in awe of what I did. That was an eye-opening experience of the talent involved.
Gergely Kalman
I'd rephrase that as: Throwing out proprietary protocols in favour of open and more flexible solutions. But reinventing is a good idea sometimes as well.
Gergely Kalman
Unfortunately flash was only capable of broadcasting the stream using RTMP only, os we had to keep it on the broadcaster side. Also a lot of RPCing was involved, and we wanted our business logic to stay on the FMSs. Since streaming was the most resource-consuming task we could make this step with little to no hassle as far as business logic is concerned.
Gergely Kalman
I think in part the protocol was obfuscated by design. While your explanation makes perfect sense, they could've just easily introduced a new type of packet in RTMP and got rid of the byte-shifting crazyness when transmitting fat packets.
neurofuzzy
Your icon for users in these graphs made me giggle, considering the kind of video they would be watching.
Jordan Fried
Haha good to hear!
randfb
Pornography destroys lives and puts people in bondage. Accept Jesus as your Lord and Savior and He will show you how to use your talents to bless the world.
杨 一航
Your post makes me surprised. I study the part of code of "rtmpd" and understand that how difficult to handle 10 Gbps per machine. As you known, the video quality is very important. I find that some online chat websites use a plugin in to inhance their video quality. It seems that they change the encode type of video and audio while they also use flash. And the plugin is IE only. Can you guess what technique they use?
Cristian
Gonzo is the best way ! :) http://perutops.com/b/humillacion-extrema-una-nueva-tendencia-porno_1002667.html
marconijr
Great post, what book/resources would you recommend when learning transcoding or the overall process of how video streaming works?
Enrique Conci
The wheel seems to be the same.. but flexible, open and faster. Great work!
Gergely Kalman
Well, I've analyzed ffmpeg's source and read wikipedia. At the time I knew almost nothing about that, even though I had the advantage of having seen lots of network protocols and file formats. Generally though, it's not really hard if you're not going extremely deep. Wikipedia can be inaccurate though, but looking at the code always helps.
Gergely Kalman
I'm not sure. In house we used flash to try and measure the bandwidth and cpu availability and choose the format and encoding accordingly.
comments powered by Disqus
Subscribe
Subscribe to Engineering Articles by Email
Trending articles
Relevant technologies