As is the case with a lot of economic statistics in the UK, it is hard to find accurate data on how may people are employed in different industries. That’s because it is difficult to find accurate data on the number of people employed by individual companies.
This is despite the fact that every company is required to report average employee numbers under the Company Act 2006. We recently found that only 55.45% of companies in one of our largest datasets (Net Zero) had reported their employee count. This presents a major challenge to anyone that is trying to analyse the regional distribution of jobs in the sector.
We’re always playing with data and creating new models, and in this post we talk about one recent experiment and what it told us about employment data.
The Office for National Statistics (ONS) publishes employment data for different sectors. These are used to benchmark other datasets and are often used in the ‘human in the loop’ element of our work. We recently used the Low Carbon and Renewable Energy Economy (LCREE) Survey QMI as part of our quality assurance process for our Net Zero database.
About the Low Carbon and Renewable Energy Economy Survey
The ONS gathers its low carbon and renewable energy economy employment data through a sample-based survey of 24,000 businesses. They use the Inter-Departmental Business Register (IDBR) as the sampling frame. The design is a stratified single-stage random sample with the target population being stratified by industry, employment size and UK country.
The survey consistently achieves a response rate of above 80%. The stratified sampling enables a well-balanced sample representing the entirety of the UK economy as closely as possible.
However, there are limitations. The main one, as with all large scale surveys, is the time lag between collecting and publishing the results. It takes the ONS 12 months to gather data, prepare the report and publish. This means that there is always a lag in the employment data. This makes it hard to see how may people are employed in different industries.
The data is also weighted to represent businesses that aren’t part of the sample. The higher and lower values are removed to avoid skewing measures of central tendency, like the mean and median.
The ONS uses “imputation techniques to estimate the values of missing data caused by non-responses”. They use item non-response imputation (where estimates are based on other available values) or unit nonresponse imputation (where estimates are based on the growth over time).
The statistics behind the ONS figures
Surveying the entire Net Zero industry isn’t feasible, so the ONS uses a sample to infer the numbers. Once they’ve collected and prepared the data they choose a confidence interval and estimate the missing figures for the rest of the sector. This process is simple, established and common.
Despite the logic of using confidence intervals, we couldn’t help being a little dissatisfied with the inaccuracy of this inference. We began to experiment with a new method to see if we could create a more complete dataset.
A new method: MICE, KNN and Maximum Likelihood
We attempted to build a more robust, real-time dataset by trialing four different statistical imputations;
- 1. Multiple Imputation by Chained Equations (MICE)
- 2. K-Nearest Neighbours (KNN)
- 3. Maximum Likelihood Estimation (MLE)
- 4. Median Imputation (MI).
Our objective was to tell a regional story about employment in the the low carbon and renewable energy industry in the UK.
Out of the four approaches, MICE and MLE produced the highest employee figures. MICE suggested that the UK’s low carbon and renewable energy economy employs 1.69 million people full time. KNN suggested 641,000 and MI 473,000. Not a great result. We couldn’t use any of the approaches to create our regional view – but by no means a disaster.
We had found two things out.
Smaller companies tend to be less likely than large companies to report their employee data
Exploring our data, we found that the 1st quartile of reported employee numbers equals to 2, the 3 quartile to 21 (leaving us with an interquartile range of 19) and the median to 6. Further, the three most frequently reported employee values are 2 FTE (8.2%), 1 FTE (7.3%) and 3 FTE (4.8%).
Smaller companies mainly consist of active directors
Let’s have a look at Blue Tidal Energy. Blue Tidal Energy has not reported any employees but their website says that they employ three people. All three are active directors at Companies House. To test the second assumption, we calculated the sum of all full time employees in our list (approximately 446,000) as well as the sum of all active directors (26,011).
We then excluded the top 20% of companies based on FTE numbers and calculated the exact same values for this subset of all companies.
There were 31,000 full time employees and 10,800 active directors reported. Interestingly, the misrepresented FTE number decreased from a staggering 96.44% to only 34.80% in the second trial.
This simple verification just described proves that filling the missing employee numbers can be approached by filling in the number of active directors. This way, we expect a 65.2% accuracy of the estimated employee numbers.
The question of benchmarking
There are clearly issues with official employment data in the UK. They make it difficult to assess how many people are employed in different industries. The lag between sampling and publishing, along with modelling techniques and our finding that smaller companies are less likely to report their figures mean that the data should be used cautiously. While imputation may work on a national level a regional view of employment is more difficult to get right.
The Data City creates real time data sets, and we benchmark our data against a range of external sources to ensure high quality and to give our clients confidence in using it. Like everyone in the field of industrial analysis, the provision of out of date data is something we have to work around and we’re continually modelling and testing approaches to develop a new industry standard.