Do types actually make developers more productive? Or is it just more typing on the keyboard? To answer that question we’re revisiting Diff Authoring Time (DAT) – how Meta measures how long it takes to submit changes to a codebase. DAT is just one of the ways e measure developer … | Continue reading
Today’s rapidly evolving landscape of use cases that demand highly performant and efficient network infrastructure is placing new emphasis on how in-line amplifiers (ILAs) are designed and deployed. Meta’s ILA Evo effort seeks to reimagine how an ILA site could be deployed to imp … | Continue reading
We’re sharing details about Glean, Meta’s open source system for collecting, deriving and working with facts about source code. In this blog post we’ll talk about why a system like Glean is important, explain the rationale for Glean’s design, and run through some of the ways we’r … | Continue reading
Meta has been on a years-long undertaking to translate our entire Android codebase from Java to Kotlin. Today, despite having one of the largest Android codebases in the world, we’re well past the halfway point and still going. We’re sharing some of the tradeoffs we’ve made to su … | Continue reading
How did the Threads iOS team maintain the app’s performance during its incredible growth? Here’s how Meta’s Threads team thinks about performance, including the key metrics we monitor to keep the app healthy. We’re also diving into some case studies that impact publish reliabilit … | Continue reading
How do you take a mixed reality (MR) headset from idea to finished product? Alfred Jones, VP of hardware engineering at Meta Reality Labs, joins Pascal Hartig (@passy) on the latest episode of the Meta Tech Podcast for a discussion on the realities (no pun intended) of building M … | Continue reading
We’re explaining the end-to-end systems the Facebook app leverages to deliver relevant content to people. Learn about our video-unification efforts that have simplified our product experience and infrastructure, in-depth details around mobile delivery, and new features we are wor … | Continue reading
Ten years after the introduction of PEP 484, we surveyed the current state of the Python type system and the tools developers are using. [...] Read More... The post Typed Python in 2024: Well adopted, yet usability challenges persist appeared first on Engineering at Meta. | Continue reading
Andromeda is Meta’s proprietary machine learning (ML) system design for retrieval in ad recommendation focused on delivering a step-function improvement in value to our advertisers and people. This system pushes the boundary of cutting edge AI for retrieval with NVIDIA Grace Hopp … | Continue reading
AI plays a fundamental role in creating valuable connections between people and advertisers within Meta’s family of apps. Meta’s ad recommendation engine, powered by deep learning recommendation models (DLRMs), has been instrumental in delivering personalized ads to people. Key t … | Continue reading
Cryptographic monitoring at scale has been instrumental in helping our engineers understand how cryptography is used at Meta. Monitoring has given us a distinct advantage in our efforts to proactively detect and remove weak cryptographic algorithms and has assisted with our gener … | Continue reading
At Meta, we’re always looking for ways to enhance the productivity of our engineers and developers. But how exactly do you measure developer productivity? On this episode of the Meta Tech Podcast Pascal Hartig (@passy) sits down with Sarita and Moritz, two engineers at Meta who h … | Continue reading
Your contact list is fundamental to the experiences you love and enjoy on WhatsApp. With contacts, you know which of your friends and family are on WhatsApp, you can easily message or call them, and it helps give you context on who is in your groups. But losing your phone could m … | Continue reading
At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP. We look forward t … | Continue reading
At the Open Compute Project (OCP) Global Summit 2024, we’re showcasing our latest open AI hardware designs with the OCP community. These innovations include a new AI platform, cutting-edge open rack designs, and advanced network fabrics and components. By sharing our designs, we … | Continue reading
Data for Good at Meta is open-sourcing the data used to train our AI-powered population maps. We’re hoping that researchers and other organizations around the world will be able to leverage these tools to assist with a wide range of projects including those on climate adaptation, … | Continue reading
At Meta, React and React Native are more than just tools; they are integral to our product development and innovation. With over five thousand people at Meta building products and experiences with React every month, these technologies are fundamental to our engineering culture an … | Continue reading
This episode of the Meta Tech Podcast is all about Bento, Meta’s internal distribution of Jupyter Notebooks, an open-source web-based computing platform. Bento allows our engineers to mix code, text, and multimedia in a single document and serves a wide range of use cases at Meta … | Continue reading
We’re sharing more about the role that reinforcement learning plays in helping us optimize our data centers’ environmental controls. Our reinforcement learning-based approach has helped us reduce energy consumption and water usage across various weather conditions. Meta is revamp … | Continue reading
The Quantum Apocalypse is coming. The advent of quantum computers has raised real questions about the future of data privacy over the internet. Someday, advances in quantum computing will make it possible to decrypt sensitive data that was encrypted using today’s complex cryptogr … | Continue reading
At Meta, we’ve been diligently working to incorporate privacy into different systems of our software stack over the past few years. Today, we’re excited to share some cutting-edge technologies that are part of our Privacy Aware Infrastructure (PAI) initiative. These innovations m … | Continue reading
We are introducing a new metric— real-time server fleet utilization effectiveness —as part of the RETINAS initiative to help reduce emissions and achieve net zero emissions across our value chain in 2030. This new metric allows us to measure server resource usage (e.g., compute, … | Continue reading
How PyTorch powers AI training and inference Learn about new PyTorch advancements for LLMs and how PyTorch is enhancing every aspect of the LLM lifecycle. In this talk from AI Infra @ Scale 2024, software engineers Wanchao Liang and Evan Smothers are joined by Meta research scien … | Continue reading
In this talk from AI Infra @ Scale 2024, Joel Colburn, a software engineer at Meta, technical lead Junqiang Lan, and software engineer Jack Montgomery discuss the second generation of MTIA, Meta’s in-house training and inference accelerator. They cover the co-design process behin … | Continue reading
Llama 3 is Meta’s most capable openly-available LLM to date and the recently-released Llama 3.1 will enable new workflows, such as synthetic data generation and model distillation with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed so … | Continue reading
Delivering new AI technologies at scale also means rethinking every layer of our infrastructure – from silicon and software systems and even our data center designs. For the second year in a row, Meta’s engineering and infrastructure teams returned for the AI Infra @ Scale confer … | Continue reading
We launched Meta AI with the goal of giving people new ways to be more productive and unlock their creativity with generative AI (GenAI). But GenAI also comes with challenges of scale. As we deploy new GenAI technologies at Meta, we also focus on delivering these services to peop … | Continue reading
AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 405B. This week at ACM SIGCOMM 2024 in Sydney, Australia … | Continue reading
We are open-sourcing DCPerf, a collection of benchmarks that represents the diverse categories of workloads that run in data center cloud deployments. We hope that DCperf can be used more broadly by academia, the hardware industry, and internet companies to design and evaluate fu … | Continue reading
What happens when a team of mechanical engineers get tired of looking at flat images of 3D models over Zoom? Meet the team behind Caddy, a new CAD app for mixed reality. They join Pascal Hartig (@passy) on the Meta Tech Podcast to talk about teaching themselves to code, disruptin … | Continue reading
The key to developer velocity across AI lies in minimizing time to first batch (TTFB) for machine learning (ML) engineers. AI Lab is a pre-production framework used internally at Meta. It allows us to continuously A/B test common ML workflows – enabling proactive improvements and … | Continue reading
Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization. The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability. Failure rates, which are mostly t … | Continue reading
Meta’s advertising business leverages large-scale machine learning (ML) recommendation models that power millions of ads recommendations per second across Meta’s family of apps. Maintaining reliability of these ML systems helps ensure the highest level of service and uninterrupte … | Continue reading
The history of Rust at Meta goes all the way back to 2016, when we first started using it for source control. Today, it has been widely embraced at Meta and is one of our primary supported server-side languages (along with C++, Python, and Hack). But that doesn’t mean there weren … | Continue reading
We’re sharing how we streamline system reliability investigations using a new AI-assisted root cause analysis system. The system uses a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations. Our … | Continue reading
We’re introducing parameter vulnerability factor (PVF), a novel metric for understanding and measuring AI systems’ vulnerability against silent data corruptions (SDCs) in model parameters. PVF can be tailored to different AI models and tasks, adapted to different hardware faults, … | Continue reading
At Meta, we support real-time communication (RTC) for billions of people through our apps, including WhatsApp, Instagram, and Messenger. We are working to make RTC accessible by providing a high-quality experience for everyone – even those who might not have the fastest connectio … | Continue reading
As we continue to focus our AI research and development on solving increasingly complex problems, one of the most significant and challenging shifts we’ve experienced is the sheer scale of computation required to train large language models (LLMs). Traditionally, our AI model tra … | Continue reading
Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. A year ago, however, as the industry reached a critical inf … | Continue reading
MobileConfig enables developers to centrally manage a mobile app’s configuration parameters in our data centers. Once a parameter value is changed on our central server, billions of app devices automatically fetch and apply the new value without app updates. These remotely manage … | Continue reading
At Meta, Bento, our internal Jupyter notebooks platform, is a popular tool that allows our engineers to mix code, text, and multimedia in a single document. Use cases run the entire spectrum from what we call “lite” workloads that involve simple prototyping to heavier and more co … | Continue reading
In recent years, Meta’s data management systems have evolved into a composable architecture that creates interoperability, promotes reusability, and improves engineering efficiency. We’re sharing how we’ve achieved this, in part, by leveraging Velox, Meta’s open source execution … | Continue reading
Today, the internet (like most digital infrastructure in general) relies heavily on the security offered by public-key cryptosystems such as RSA, Diffie-Hellman (DH), and elliptic curve cryptography (ECC). But the advent of quantum computers has raised real questions about the lo … | Continue reading
When Threads first launched one of the top feature requests was for a web client. In this episode of the Meta Tech Podcast, Pascal Hartig (@passy) sits down with Ally C. and Kevin C., two engineers on the Threads Web Team that delivered the basic version of Threads for web in jus … | Continue reading
[...] Read More... The post Building new custom silicon for Meta’s AI workloads appeared first on Engineering at Meta. | Continue reading
Meta’s family of apps serves trillions of image download requests every day. And if you’re into high-quality images, you’ve probably noticed that Instagram and Threads have added support for high dynamic range (HDR) photos. Now people on Threads and Instagram can upload and share … | Continue reading
Threads has entered the fediverse! As part of our beta experience, Threads users aged 18+ with public profiles can now choose to share their Threads posts to other ActivityPub-compliant servers. People on those servers can now follow federated Threads profiles and see, like, repl … | Continue reading
Bandwidth estimation (BWE) and congestion control play an important role in delivering high-quality real-time communication (RTC) across Meta’s family of apps. We’ve adopted a machine learning (ML)-based approach that allows us to solve networking problems holistically across cro … | Continue reading