How did Facebook pull off its Friends Day videos that sparked so much conversation last week?
Software engineer Peter Lai and technical program manager Nancy Lu explained the process of creating and deploying Facebook’s Friends Day videos in a blog post, saying that the three challenges they faced were curating photos for each user, rendering them into videos and making sure those videos appeared atop users’ News Feeds on the social network’s 12th birthday, Feb. 4.
Lai and Lu detailed Facebook’s efforts in areas including load testing, efficiency improvement, capacity planning, keeping the social network’s infrastructure safe, ranking the photos to be included and the rollout process. Highlights follow.
On the infrastructure challenges, they wrote:
Thanks to some thoughtful planning on product design, product development and infrastructure preparation, we were able to pull this off in a pretty efficient, performant and stable way without disrupting the normal Facebook traffic.
To help with capacity allocation and distribution, we needed to take into consideration resource usage per video generated (CPU cores, memory and time), storage IOPS (input/output operations per second) for writing and reading videos, network bandwidth limitations at the cluster and backbone level and power availability in each region. Thorough load testing allowed us to guide peak loads and identify and work around constraints to generate these videos without affecting the reliability of our production infrastructure. In the end, the load tests’ results matched closely to predictions. Along the way, we discovered and eliminated bottlenecks to achieve the maximum throughput from allocated capacity.
Friends Day videos were created and delivered through our shared infrastructure. We used the same production systems that are used to run many services on our site, which include compute tiers, storage tiers and the same network used internally and externally to serve Facebook. Instead of dedicating permanent capacity for Friends Day, we used an automated system that allowed us to keep production services safe. Based on the capacity requirement calculated from the load tests, the system automatically chose machines that wouldn’t disrupt other existing services. By continuously monitoring and detecting abnormalities in power consumption, CPU usage and network bandwidth, the system automatically blacklisted offending machines and redirected jobs to another available machine somewhere else. This flexible capacity allowed us to operate silently at this scale. On top of that, a throttling mechanism on video creation and delivery was put in place to allow us to pull the product back as needed if issues arose. All of these automated systems and close eyes on our dashboards during launch day kept our infrastructure safe.
And Lai and Lu described the challenges of curating the most appropriate content for users:
While one team was figuring out how to serve this many videos, another was working on how to generate the content of the videos. We learned in years past that, despite our best intentions, this is tricky territory. We think nothing is more important than making sure our users have a good experience, so we worked really hard, using signals available to us, to make sure that we didn’t show something unpleasant to the people receiving these videos.
Those signals came from a variety of places. If someone had used breakup checkup, the ex got nixed from consideration for that person’s video. Same with anyone people had blocked or marked as “I don’t want to see anything from this person.” We also factored in likes, comments and tagged people.
At this scale, we know we might not have picked the exact right images for everyone, but we hope the video was easy to edit into something that celebrated your friendships. And, overall, we’re really excited that an effort at this scale rolled out as reliably as it did. We’re always honored and humbled when we can help people celebrate their friends.
Readers: What did you think of your Friends Day videos?