Uncategorised

URL Matching

The Data City is overcoming the limitations of SIC codes with our RTIC methodology. Companies registered on Companies House choose their own classification but are often limited with their options or they do so many things, choosing one (or two) SIC codes is somewhat difficult. TDC has improved data provided by Companies House, a significant part of which involves matching their website. Once the website is matched, we scrape the website for their website text and extract key bits of information such as location information. We use the website text in our classification stage of building a list to produce accurate and reliable RTICs.

Method

For each company in Companies House, we pool together a list of potential websites from a range of multiple sources. We scrape the potential websites and then an algorithmic matching process that combines all our sources of data into a best guess website for every company. The best guess website is decided through a logical scoring process which looks for company information within the text. The score must exceed a specific threshold to be matched and the best guess website may be no match. Our custom built web-scraper handles redirects elegantly and has the ability to crawl a single website in 0.2s.

Back in 2018, our scraper only crawled up to a maximum of 7 pages per domain. This not only limited our ability to match companies accurately but also to classify companies correctly. Today, our scraper crawls up to 75 pages per domain. In 2019, we improved the information we collect on companies to support our matching and therefore increase our confidence in the match. Today, we have over 13 reasons as to why each company may be matched to a website each with their own weighted score depending on the field.

Throughout the years, we’ve been:

  • Improving the quality of the text scraped
  • Adding input sources for potential matches
  • Improving the methodology
  • Increasing the speed for matching
  • Increasing the speed for scraping
  • Harnessing more information on incorrect matches and/or missing matches
  • Adding websites to a blacklist

Quality

We constantly check URLs in the product, report fixes, and make improvements. We have performed formal QA at various scales.

In 2018, we assessed false positive rate for a large amount of companies during list creation for a specific project. This was using a pre-release (v. < 1) version of The Data City platform.

In 2020, we performed a rigorous analysis of 99 companies and their URL matching quality for a gold-standard AI sector list. In v1.7 we achieved a 85% true positive rate. This increased to 91% in v2.0.

These two types of previous evaluations are importantly different.

  • Evaluation 1 assessed the quality of URL matching in a list produced by The Data City.
  • Evaluation 2 assessed the accuracy and coverage of The Data City platform to report URLs for known companies.

They were also importantly performed on versions of The Data City database with very different numbers of companies matched to URLs.

In the latest version of the product, we performed our largest ever QA process on URL matching quality performed on a list produced by The Data City platform. We are investigated roughly 7000 companies in 24 industry verticals of the FinTech sector. We chose this sector as it is particularly volatile.

For ~5,000 companies (72%), we are >99% confident they are correctly matched.

For ~1,100 companies (16%), we are >95% confident they are correctly matched.

For ~900 companies (12%), we are > 90% confident they are correctly matched.

Accuracy:

We estimate our website matching to be 95% correct.

Aspects to further explore

Some of the elements of the algorithm could do with a little further consideration.

  • The scores are somewhat arbitrarily assigned to the reasons (“factors”) for matching. Could/should the scores be adjusted to better represent the factors which are more likely to drive a good match?
  • Is the threshold appropriate, or is there a more appropriate or optimum threshold?
  • How many companies have an overridden score? Is this the right thing to do?

This could be achieved by exploring whether the data generated as part of the above processes could be employed as training sets to:

  • Identify which feature points are the most useful in determining a match (or alternatively in determining a false match), to support a weighting system.
  • Explore whether updated matches appeared further down in the list of candidate URL matches and whether any improved weighting system might actually have selected them as the top match

Finally, there is also interest in dropping some of the matching sources, if possible. The scope for this will also be explored as part of this research, specifically:

  • Quantifying the value of each URL match sources

Estimating coverage

There are some key statistics the UK government releases which help us to estimate how many businesses are actually possible to website match.

Using the business population estimates,

“the UK private sector business population comprised 3.1 million sole proprietorships (56% of the total), 2.1 million actively trading companies (37%) and 353,000 ordinary partnerships (6%)”.

The Data City’s database is mainly focused on the 2.1m actively trading companies. We do not know how many of these businesses do not have any website but we can assume there is a proportion of businesses which do not have a website and therefore we cannot ever match them. Our current database contains 1.7m matches thus this is around 81%. Realistically, we’re matching above 81% due to the reason just described.

Using the Value Added Tax (VAT) annual statistics,

Considering the Annual UK VAT Statistics 2021 to 2022 and considering the worksheet which is a National Statistics table of VAT population by trader status

There are around 1.85m incorporated companies. Therefore our matching coverage is around 92%.

Using Companies House management information tables,

Considering the data within the Companies House management information 2022 to 2023 excel file –> Table 11: Annual accounts registered at Companies House by accounts type for the year 2022-23, there are:

  • 82,806 companies submitting full accounts.
  • 65,442 companies submitting small accounts.
  • 54 companies submitting medium accounts.
  • 23,466 companies submitting group accounts.
  • 1566095 submitting micro-entity accounts.

totalling roughly 1.73m companies. These are the companies we believe to be actively trading.

The real answer is probably somewhere in between the three estimates.

About the author