Assigning websites to companies is a foundation of what we do at The Data City. The more accurate and precise our website matching is, the better our Real-Time Industrial Classifications (RTICs), Real-Time Standard Industrial Classifications (RSICs), and your Smart Lists are.
V6 of our Industry Engine represents our biggest ever step forward in website matching. For nearly a decade we and our customers have been manually checking website matches and we’ve written new algorithms and trained new models on top of this data. You should expect many fewer wrong matches, more correct matches, and less time between reporting the correct website for a company and seeing that reflected in the Industry Engine.
We are scraping the web more often and finding websites for new companies more quickly. In most cases a website that you report before 5pm should be fixed by 8am and we will reflect the latest content on websites we’ve already correctly matched much more quickly. For the first time you’ll see that we can move beyond the idea that one company has one website. And since our new website matching works for the whole world, you should expect the same in our global products.
How does this intersect with group structure?
Companies all around the world have complicated group structures. One such example is Lannis Limited, the 300th largest employer in Britain and a company you’ve never heard of.
But you probably have heard of their subsidiaries which include Iceland Foods, The Food Warehouse, and Piccolino Restaurants.
In V5 of the Industry Engine we matched Lannis Limited to the website of its largest subsidiary, Iceland Foods. In V6 we remove this match – Lannis Limited deliberately has no website – and only assign iceland.co.uk to the supermarket chain.

We have done two things to help navigate this change. First, we list websites of subsidiary companies.
Second, we show the company’s position within the group of companies it is part of in a choice of simplified, brand-centred, and detailed views.
In the simplified view, we can see that Lannis Limited has a parent company, WD FF Limited, which also doesn’t have a website.

In our brand view, we can see that there are three distinct customer facing brands and five websites within the broader group.

And in our detailed group view, all 45 companies and their associated brands and websites spread across 9 levels of corporate structure are reported.

Improving how we understand and share group structure has long been one of the top feature requests we’ve received. The complexity of group structure, the associated websites and brands, and accounting conventions makes this difficult.
V6 represents a big step forward. We’re already finding it useful in our own analysis and in our website matching and we’d love to hear what you think. We’re already working on considering linked companies and ensuring that financial accounts are apportioned as well as possible across group structures.
How we measure our quality
With ten years of accumulated work on website matching quality, including thousands of user reports, we are now able to use more advanced processes to automatically estimate and track the quality of our website matching.
By keeping behind 10% of our known correct and incorrect websites matches we test how well our models are selecting the correct websites for companies they never saw during training. That lets us produce quality control summaries every night.

Two headline statistics in our daily model reports are accuracy and precision.
Precision tells us how likely a website match that you see in our product is to be correct. In our latest releases we regularly beat 97% precision.
Accuracy is a broader measure. It considers that sometimes we won’t match small companies with basic websites to that website.
Our model is trained to prioritise accuracy and precision with equal weighting, balancing coverage of all companies (accuracy) with the frustration we all feel when we see a mismatch.
Where we assign a blank website to a company (incorrectly unmatched) this will almost always be a very small company, with a poor quality website. These are unlikely to be key companies within a sector or those that contribute substantially to aggregate economic statistics for the sector.
Our updated techniques allow these company to website matches to be added back in when reported and approved, usually overnight. Making reports like these always improves the quality of our website matching algorithms.
How confident in the new matches are we?
Our updated techniques still produce a confidence rating for every website match which we label low, medium, high and very high. These estimates are now based on much more data and a better calibrated technique. Over 75% of our matches are high and very high confidence in this new method. Less than 3% of matches are in the low confidence band.

We suggest manually checking website matches in your lists where they have a low confidence.
We think you’ll barely ever find a very high or high confidence that is wrong. If you do, it’s likely that the second best match we found will be the right one and likely that the two websites were so similar that the wrongly matched website didn’t change our understanding of what the company does.
In both cases if you report the correct websites we will probably have fixed them by the next morning, and within a week at worst. And because our model learns from every report, your report will fix other website matches too.
How many companies are matched to how many websites?
We know that previously, we had a high number of websites matched to many companies.
Our 1,000 most frequently matched domains were matched to 56,000 companies. Now that figure is 36,000 companies, and the matches are cleaner. A big part of this is that we are now more accurately matching websites to just the right company in a group and not to every company in the group.
We have eliminated a small but previously persistent amount of over matching to news websites like the liverpoolecho.co.uk and birminghammail.co.uk which had 193 and 188 companies matched respectively. These two websites now correctly have no matched companies as both newspapers are produced by Reach PLC (00082548) whose corporate website reachplc.com is correctly identified.
The comparison of number of companies matched to every website is shown in the following set of figures.
We see a small reduction in the number of websites matched to a single company, reflecting a great focus on precision in our new methods.

Reductions are bigger for websites matched to between 2 and 99 companies. This reflects our greater success at matching websites to just a few companies within a group structure, instead of the majority of a group structure as in the past.

Our biggest reduction in website matches is the case of a single website being matched to over 100 companies. In addition to a better understanding of group structure, this reflects a focus on removing business directory aggregation sites as aggressively as possible.

Where examples of a single website matched to over 100 companies remain, they are merited. For example, our most matched website is specsavers.co.uk. We match it to 1260 companies.
This is correct as there are over 1500 legal entities within the Specsavers group structure and roughly 940 stores who are all separately registered.
How this affects RTICs
We add companies to RTICs every month. When we assign a website to a company for the first time we calculate the RTICs for that website and add the company to those RTICs following internal checks. When a company is assigned a new website – the company’s old RTICs are removed and new RTICs are calculated for the new website, again subject to internal checks.
With V6, especially due to the improvements we’ve made to group structure, we are fixing and adding more website matches than ever and RTICs will change more than usual. But this won’t happen straight away. We’ll be working through our RTICs in the coming months to add new companies and remove ones where we’ve corrected a website match, who have stopped operating in that sector, or who have stopped operating completely.
You’ll notice new companies being added first so you can track new entries to your RTICs in real time, as always. It will take us a bit longer to remove companies whose website matches we fix. To explain this, let’s work through the example of the Net Zero RTIC and its Carbon Capture vertical.
Drax is the UK’s largest power station. Formerly it burned coal, but today it burns wood pellets. It has long been at the forefront of British experiments and trials on carbon capture and storage.
DRAX GROUP PLC, company number 05562053, is the top company of a group of 43 companies in a tree up to six levels deep. We do a better job than ever of c in v6 and that means that fewer websites are assigned to companies in the structure. Where an RTIC contains Drax or any of the companies in the group we may need to change which members of the Drax group we include in a group. In the case of the Carbon Capture vertical this will mean focusing on the Drax Research and Innovation company group and its Drax CCS Limited subsidiary.

Behind the scenes we’ve introduced full RTIC versioning. That means that as we roll out these updates you will be able to compare the RTIC summary statistics between versions. We are finalising the UI to make this seamless and will release this update in a few weeks.
That means that as we update our RTICs you’ll be able to stick with the old version of the RTIC if you’re running long-term longitudinal studies. Or you can flip to the newest version of our RTICs to make sure you’re getting the most up to date and accurate definition of our rapidly evolving sectors.
What’s else and what next?
V6 of The Industry Engine is our biggest release since we introduced our instant classifier technology. There are even more features than we’ve outlined here in our release notes, such as the new and much faster process for building Smart Lists (previously ML Lists).
The first change you’ll notice is the new sign on system. Many of our customers need more security and we’ve enabled two factor authentication for all users. You can also use Google or Microsoft identity providers which are standard for many users and make deeper security even easier.
In addition to visible changes like this, we’ve made big improvement behind the scenes. This will enable more of the features you’ve asked for to be released in the coming months.
- We’re working on a new map view that’s perfect for showing a business ecosystem at a glance, in fact we’ve already rolled out this in our US product.
- We’re building on the company and RTIC timeline features and finalising a new Watchlist feature so you can track changes to companies, places, and sectors more quickly and easily.
- We’re upgrading our search and filtering interface so you can explore, find, and select companies, places, and sectors even faster than today.
- We’re simplifying lists and making it easier to put whatever companies you like into collections for easy tracking and analysis.
- We’re making it easier to manage organisations and API usage.
But while you wait for that we want to hear how you find our improvements in V6. We think you’ll love the improvements in data quality and the new features.
Not a customer yet? You can sign up for a free trial and experience V6 of the Industry Engine yourself.