Why Google threw out FP and FCP in favor of LCP (Correlation study)

David Sommers

Published in

Code for RentPath

5 min readMay 25, 2021

If you know what Web Vitals are and the difference between LCP and FID then skip ahead to the next section.

Acronym warning!

CWV = Core Web Vitals => Subset of Web Vitals, which Google created to provide unified guidance on quality signals. The core ones matter enough to Google that they’ll soon adjust your search rankings based on them.
RUM = Real User Measurement => Measurements from the browsers of end-users on how your site performs.
FP = First Paint => Point at which the first pixel renders on a screen.
FCP = First Contentful Paint => Point when the browser first renders any content from the document object model (DOM). The difference between FP and FCP.
LCP = Largest Contentful Paint =>Point at which the largest image or text block is visible within the viewport.

Gathering data with 3 independent systems

RentPath has a number of systems to ensure we are measuring traffic appropriately. If you only have one system telling you there’s a problem, how do you know it’s right?

Whether by choice or through simply not retiring monitoring systems, we have three synthetic testing mechanisms and three-ish RUM platforms.

Synthetic:

Speedcurve Synthetic
Calibre for Pull Requests and Review Environments
Homegrown Grafana monitor around an internal Lighthouse API

RUM:

Speedcurve LUX
Chrome User Experience Report (CrUX)
Perfume.js
To a lesser extent, Google Analytics Site Speed

If you’re like, wow — that’s a lot of stuff running on your site. You’re right! Luckily all the synthetic tools don’t impact users.

As for RUM, code does run within the user’s browsers but you can’t fix what you don’t measure so it’s a necessary evil. Additionally, Speedcurve LUX and Perfume.js are sampled so not everyone is impacted. Separately, CrUX is already being gathered by all Chrome users directly by the browser and reported directly to Google.

We ultimately need multiple systems though because it’s similar to triple modular redundancy in an aircraft’s autopilot system. If one system says we have a problem but the other two disagree, we’ll wait for more data points to determine if the problem is with the monitoring system or the actual aircraft. If two or three systems throw warnings, we’ve got a problem.

Perfume.js

This blog post covers RUM data captured using Perfume.js.

Perfume.js gathers a ton of data for only an additional 2KB of JavaScript in your bundle. It leverages Performance APIs to gather timing data and all the metric acronyms listed above. Additionally, it records several metrics at intervals so you can troubleshoot if the problem is towards the beginning of the user’s page load or later.

In our case, we’re sending all the data as events to Google Analytics 360 which then syncs the data to BigQuery.

Turning data into insights

Using Google Colaboratory, I created a Python Jupyter notebook that pulls the data out of BigQuery by our page types (home page, search page, etc).

Something like this:

Now let’s see how many events we have:

Hmm, we have 40 million recorded sessions but only 38 million have values. Spot checking the data, it looks as though some values didn’t record. So let’s just chop off those missing values with a quick dropna().

Let’s pivot the data by metric using:

Correlation time

Let’s set up our graphing:

Then run through a few scenarios. Let’s try the map view on desktop:

And the search page on mobile:

Cool colors but what does it all mean!?!

1.0 is the highest correlation possible and 0.0 is the lowest. When you look at the graph, you’ll see the X and Y axis match diagonally from the top left to the bottom right. In the image below “perf_tbt” on the Y-axis is a 1.0 perfect match to “perf_tbt” on the X-axis because it's the same metric.

Besides metrics perfectly matching up to themselves, what else is strongly correlated?

Correlated metrics

FP and FCP are exactly the same across desktop and mobile with 1.0 scores (very strongly correlated). This supports the fact that Google no longer refers to FP anymore and just FCP.
LCP and FP/FCP are strongly correlated for us on Mobile (0.82) and Desktop (0.97).
Cumulative Layout Shift (CLS), First Input Delay (FID) and Total Blocking Time (TBT) didn’t correlate to anything.
Perfume.js measure TBT at 5 sec, 10 sec and final measurements. These were all strongly correlated so if you have limits on the number of events you can record, I wouldn’t record all of them. Same with LCP. However, CLS was not strongly correlated between the initial value and the final score. Though Google is changing how CLS is calculated due to the measurement fluctuating over time after the page is loaded. Our pages are also single page apps (SPA) which impacts CLS over time.

Conclusion

The Core Web Vitals of LCP, FID and CLS are not correlated which makes for three great independent metrics to strive for. At this point, I can say mathematically, it makes sense for Google to shift our view from the first thing you see (FP/FCP) to the largest piece of content a user sees (LCP). I’d rather track 3 independent core metrics targeting user happiness than lose track of 6 browser metrics that already relate to one another.

On the flip side, since LCP and FCP are strongly correlated then why do we need LCP again? We had an incident where LCP was triggered on a Google Map tile as the “largest piece of content” and penalized our Lighthouse scores for months because the Google Map is really slow at loading. We hacked something in to “trick” LCP into avoiding the map tile until a page redesign could shift the LCP away from the map. If it was still FCP, we wouldn’t have had a problem. Look for a future blog post on that later.