We have been focusing internally on refining our approach to effectively communicate the quality of our data.
We’re proud of our data. We’ve worked hard to deeply understand the intricacies and make sure that it is accurate, correct, and reflective of reality. Now, we want to help users to understand both the decisions that we’ve made, and how these have led to the high-quality data that underpins everything that we do. By doing so, we hope to instil the same trust and confidence in our data that drives our work every day.
Internally, we have focused heavily on assessing our data based on these measures of quality: precision, accuracy, and coverage.
Let’s start this series of discussions by focusing on the high quality of our company-to-website matches.
Why do high quality company-to-website matches matter?
At The Data City, we tell you what companies do. We do this by matching companies to their website and analysing the content using Machine Learning (ML). This allows us to build our Real-time industrial classifications (RTICs). We use the text from each website and feed them into our proprietary ML technology to reliably classify companies. The correct website matched to the correct company is fundamental. You can read more about our ML on our knowledge base.
We’re fortunate enough in the UK to have an open business register in Companies House that allows us to access information on each company, e.g. Company name, registered address, directors etc. We start with Companies House and enhance the data by matching to websites. Since Companies House does not mandate the submission of company websites, we are responsible for managing this ourselves.
Accurate matching and data extraction from company websites enhance several of our data points. For example, our innovation score and location data rely on website text. To maintain reliability in subsequent analysis of these data points, it is essential to achieve comprehensive coverage and high precision.
Match accuracy and QA Results
Taking 1.58m to be the estimated number of operationally active UK companies with a possible website match (the origin of this value will be discussed in detail in the next section), the graphs below help us to understand the coverage and quality of our current website matches.
Of these 1.58m companies, we have matched 1.47m (93%) to a website (illustrated in the top bar).
Manual QA of almost 1,000 of these 1.47m matches has indicated that 92% are correctly matched.
For our company-website matches, we have automated quality metrics of our confidence in the match. The confidence ranges from Very High to Low. As you can see in the bottom bar chart, for a large proportion we exhibit a very high confidence, which is a positive sign.
We have several improvements on the way such as: updating the quality of our underlying web text, approaching companies in a group structure more strategically, improving our blacklist and experimenting with an ML approach.
What do these results really show?
At The Data City, we aspire to match every company to their website with 100% accuracy.
To do this, we need to know…
(1) “How many companies are there in the UK, that are actually doing something?” (i.e. that we really are interested in)
(2) “How many of these might have a website?”
During the analysis below, we compare our data to the Business population estimates to verify our data holds up against official statistics. Our starting point for our data is to take the latest monthly copy of all companies on Companies House. Before applying any filters, there are over 5.64m companies. The Business population estimates report a similar value: 5.5m.
Next, we determine which companies qualify as operationally active employers, meaning they are actively conducting business and employing individuals to support those operations.
The ONS recorded 2.6 million private sector businesses, and we can replicate this in our data. We’re stricter with our definition by applying several filters to our companies:
- Removing those in liquidation
- Removing those in administration
- Removing those filing dormant accounts (based on their most recent filing)
- Must have at least one employee
- Must have a Companies House company status set only to ‘Active’
This results in 2.31m operationally active employers.
We’re able to match the ONS’ 2.6m value if our filters are set to include companies which are also in liquidation, in administration and/or are filing dormant accounts. This confirms our analysis is representative of the business population.
When we compare our 2.31m figure to Table C in the Business Population Estimates, which lists 1.43m businesses in the UK private sector, there’s a substantial discrepancy.
When we apply a filter of setting our operationally active employers to have at least two employees, we achieve a similar value: 1.35m. However, in the BPE, it importantly states:
“with no employees” category comprises sole proprietorships and partnerships with only a self-employed owner-manager(s) and companies with one employee, assumed to be a working proprietor.
Having established which companies are both employers and operationally active, we should now consider the likelihood of these companies having a website.
In publishing the results of the UK Business Data Survey, the UK government has provided valuable research indicating that 68% of businesses have a website. Applying this percentage to 2.31 million businesses produces an estimate of 1.58 million, which aligns with our existing coverage and domain expertise as a likely maximum number of website matches.
If you’re still with us, hopefully you can start to understand some of the challenges of working with company data, and more specifically, understanding the quality of this kind of data. But hopefully you can also begin to see how seriously we approach understanding the quality of our own data.
We are continually aiming to improve. But completing this blog post and shifting our focus to increasing our quality has spearheaded our development team to have more measurable KPIs to improve our website matching. By working in the open, we can hold ourselves accountable to improving our data to you, our users.