A billion domains in the database: marketing or real protection?

A billion domains in the database: marketing or real protection?

Today it is difficult to imagine life without the Internet. Every day, people open their browser to read the news, find out the weather forecast, listen to music, watch movies, and chat with friends. Surfing the Internet can be either a purposeful search for the necessary information, or a random "wandering" through links and sites.

The number of websites on the world Wide web is growing daily: According to the Siteefy portal, there are more than 1.1 billion websites on the world Wide Web, and their number continues to grow daily. According to Mediascope, 86% of Russians use the Internet, spending an average of about 4.5 hours a day on it, with 51% of this time on social media. At the same time, in a corporate environment, the active exchange of information using web applications, corporate messengers, and the use of business applications for work activities has become a daily routine. Often, the "personal" habits of users are transferred to the corporate environment, and disorderly surfing or social media communication affect the work activities and productivity of employees, as well as the security of corporate information systems.

To ensure protection against malicious resources and control the productivity of working hours, there is a mechanism for categorizing web resources used in SWG (Secure Web Gateway) class cybersecurity solutions. These solutions provide access control to web resources, content filtering by category, threat blocking, and data download and transfer control. They take into account the roles of employees, their work schedule and tasks, without interfering with business processes and without creating an excessive burden on administrators. 

Categorization of web resources is one of the basic mechanisms that can profile access. Using the example of the domestic Solar WebProxy SWG system, let's consider approaches to creating a categorization database and evaluating how high-quality filtering and categorization of resources on the Internet is achieved.

The modern high-performance SWG solution Solar WebProxy uses the webCAT module, which provides continuous updating of site categories, including obtaining up-to-date information on compromised and malicious resources.

However, should the database of categorized resources really contain all the resources available on the Internet to ensure security and access control? Let's try to figure it out together with Olga Sharapatova, senior analyst at Solar WebProxy, Solar Group.

Website development
Hundreds of thousands of websites appear every day, but not all of them remain online for a long time. Every website has an idea at the creation stage, is being developed, filled with content, promoted and maintained, and may eventually be closed.

At any stage of the site's development, a failure may occur - for example, the owner may forget to pay for hosting, and then the site will be put up for sale. In this case, the fate of the site is in the hands of its new owner, who bought the domain name.

The site can also be hacked, which threatens the loss of personal data, the introduction of malicious code, manipulation of content and reputational risks.

At the end of the lifecycle, the site first becomes unavailable, then it is deleted, and after that the domain name is put up for sale and can be purchased by a new owner. 

The volume of the database — what should it be?
It is widely believed that the more resources are contained in the database, the better, because this ensures maximum coverage and, as a result, security. Yes, but there are nuances...

1. Dead domains.

After the end of the resource lifecycle, it becomes completely unavailable. Users will not be able to access it due to an access error. The volume of "dead sites" on the Internet is huge, and it is simply pointless to store data about them — this increases the database, requiring more infrastructure capacity of the client, without creating real value for solving a business problem.

2. Resources without content.

The registered resources initially have no content, and sometimes this situation can persist for quite a long time. A resource without content has no business value for users, which means that its presence in the database makes no practical sense.
Separately, it is worth noting the technical domains, which may also have critically little content. However, for them, the availability and correct definition of categories is critically important, since without it, the operation of important Internet resources may be disrupted. In cases where access to a resource of a certain category must be provided, the technical resources associated with it must also be available.

3. Whether the main domain has subdomains.
Each site has a second-level domain name, and this domain name can have many subdomains.:

Figure 1: An example of a domain name.
Figure 1: An example of a domain name.
For example, you can take forum branches, when each topic is located on a separate subdomain, or any commercial site with a representative office in different cities. For example, the domain t2.ru It has the following subdomains: msk.t2.ru , spb.t2.ru . Technically— these are different domain names, but they actually have the same subject matter. By assigning categories to each such subdomain, we "breed" duplicate categories and "inflate" the database with meaningless entries. It is more efficient to inherit the categories of the parent domain, but only if their categories really match.

What determines the quality of the database of categorized resources?
A distinctive feature of content on the Internet in a broad sense is its variability. The content on websites is diverse and heterogeneous, and sometimes it is completely empty – the site may be newly created and unfilled content, or it may be parked for further sale. For users, regulators, vendors, and all those who use the categorization of web resources in any way, the content of such sites may change unpredictably. It is impossible to predict what information will appear on a web resource that is currently, for example, under development. For this reason, it is important to regularly check websites for correctly assigned categories.

A critical criterion for the quality of a database of categorized resources is the relevance of the categories. Web resources may become outdated, mutate, become vulnerable, or change ownership and thematic focus. For example, a news portal may turn into an advertising platform, and an online store may be attacked by hackers and begin distributing malicious content.

As can be seen from the above points, a formally large database does not provide real threat coverage. To ensure the protection and control of Internet access
, it is important to regularly and timely "audit" and verify the categories contained in the categorizer database, while simultaneously clarifying the categories and protecting users from sudden threats due to changed content or malicious activity.

You should also consider the need for more frequent updates of the categorization database, during which the database will be cleared of "garbage": "dead domains", resources without content, or subdomains of the same type.
At the same time, an unreasonably large database volume can complicate and slow down the processes of updating and updating categories, and this carries risks for the correct operation of content filtering policies.

The base of categorized resources should be optimal, not maximum in volume, and a simple assessment of the quality of categorization based on the formal volume of categorized resources does not meet the real needs of customers.

Optimality and relevance are the key to success
When choosing a solution for content filtering, one of the most striking indicators is often in focus - the size of the database. Many companies focus on "billions of categorized resources" in their marketing materials to demonstrate scale. However, it is important to understand that a large amount of data is a demonstration of coverage, but not a guarantee of quality. The key factor for reliable protection is not the number of resources in the database, but their optimal, up-to-date and accurate categorization.

Focusing solely on the "big number" can lead to the opposite effect. In fact, customers risk getting a slow system, an illusion of security, unnecessary infrastructure costs and, as a result, a deterioration in the quality of analytics and threat response.

Thus, the determining factors for effective content filtering are not abstract billions, but:

user traffic coverage

relevance of categories

the speed of updates performed.

The higher the density of useful data in the database, the faster, more accurate and more effective the protection works.

IP telephony and video conferencing