September 11, 2023
The randomest domain names: entropy as an indicator of TLD threat level
The randomest domain names: entropy as an indicator of TLD threat level

This article was co-written by David Barnett and Richard Ferguson.

 

Introduction

Domain registrations and abuse have had something of a renaissance in recent years, with increases in the numbers of people working from home and shopping online giving rise to countless opportunities for scammers. However, with almost 1,600[1] different top-level domains (TLDs, or domain extensions) to choose from, it can be difficult for brand owners to identify which TLDs to register across – indeed, the annual cost of owning a domain portfolio can soon spiral. Beyond the simple consideration of which TLDs are the ‘best fit’ for a brand’s area of interest based on name alone (e.g. .shop for an online retailer), a statistical analysis of the most extensively abused TLDs can also provide further insights.

This post analyses a wide set of TLDs to assess whether patterns in the length and randomness of domain names show any correlation with other independent estimates of the level of threat associated with different domain extensions.

 

Primer

The universe of registered domains includes large numbers in which the domain name consists just of long, apparently random strings of characters. Several previous studies have suggested that these types of domains are often associated with fraudulent or malicious activity, such as phishing (where the domains can be used in the generation of deceptive URLs) or the distribution of malware. In many cases, these domain names arise using automated domain name generation algorithms and associated automatic registrations, by bad actors[2],[3].

The existence of domains potentially set up for underhand purposes can be analysed through consideration of a parameter known as Shannon entropy, which provides a measure of the amount of information stored in a string of characters – broadly, long domain names, and/or those containing large numbers of distinct characters (such as the random domain names discussed here), will have high entropy[4].

The entropy of domains differs between TLDs, with some showing a markedly greater frequency of long, random domain names than others. For example, in a previous blog post[5], we discussed how the set of new .zip domains contains many more high-entropy (long, random) names than other TLDs. All other factors being equal, this might suggest that TLDs such as .zip are more prone to abuse by online bad actors.

 

Analysis

For the study, we consider the set of domain zone files published by ICANN[6], which covers gTLDs (.com, .net, etc.) and new-gTLDs (.top, .xyz, .online, etc.). In total, the dataset covers approximately 1,050 TLDs. For each TLD, the mean domain name entropy value, across all domains registered with that extension, is calculated (noting that small TLDs – where fewer than 100 domains are registered – have been excluded from the analysis, as the results are deemed to be of lower significance; this leaves a dataset of 576 TLDs). The results are shown in Table 1 and Figures 1 and 2.

 

TLD

Mean entropy

N

bayern

3.578820

60,318

crs

3.556059

1,144

man

3.548192

361

nrw

3.543092

36,313

xn--mgbca7dzdo

3.533396

117

gov

3.524858

19,542

goog

3.470524

543

med

3.461878

69,735

page

3.461800

102,978

eus

3.444771

27,950

mov

3.419044

6,724

esq

3.417947

3,565

amsterdam

3.416103

41,989

rsvp

3.415646

4,572

channel

3.408561

631

swiss

3.404208

37,801

dev

3.396982

769,971

app

3.394302

1,274,223

abudhabi

3.390945

2,060

zip

3.389665

30,223

google

3.380865

318

top

3.362711

4,512,204

komatsu

3.359931

133

day

3.353672

20,345

kyoto

3.326108

2,042

nexus

3.323493

2,250

how

3.320968

7,987

radio

3.319183

5,793

soy

3.317902

3,467

phd

3.312976

2,793

 

Table 1: Top 30 TLDs with greatest mean domain name entropy (N = no. of domains in dataset).

 

 

Figure 1: Top 30 TLDs with greatest mean domain name entropy.

 

 

Figure 2: Bottom 30 TLDs by mean domain name entropy.

 

The highest-entropy TLDs can indeed be seen through visual inspection to contain disproportionately high numbers of long, random domain names, with significant numbers of 32-character examples (Figure 3). The reason for this exact number (compared with the absolute maximum possible number for a SLD[7] of 63 characters) is not clear; however, it was the greatest length historically considered to be ‘good practice’[8] for a domain name and can (depending on usage and provider) be a value beyond which functionality limitations may apply. The value may also be related to the type of algorithm(s) used to automatically generate the domain names, or the functionality available through the registrars utilised.

The alphabetical list of .bayern domains (the highest-entropy TLD in the dataset), for example, begins:

 

000.bayern

0008cp8d8h7jgqmddh0kciot4gousac0.bayern

002s0ldfq8l8uo0qr63fbtnjirgc2058.bayern

003v242nno6b91ppgtfr54rc820dvkqu.bayern

0057tcga35h7en9cro4vtbqr2sual0ju.bayern

0070fq4boldtihbvangusggq5r4jc8u7.bayern

0077bcqmb64p5odoa0pfhedmuv8nrdo9.bayern

007dqkp5jvh8qn7b8m5i3tlrgcm3t5cl.bayern

007dv5edpr3rgpam4lnlq6v6147hdbub.bayern

0081mlfvlec3qj5m508633l9sjvbsiph.bayern

00846bmbh82ovq0n1kr78jc97c3dhh7e.bayern

009a705ptm7dfi1uk37kfmkp5dqec1lo.bayern

00a71os7ja4mrjcg32hvs4tcgephthpr.bayern

00amv24rasudpcoj4ddniqujf4qd00ha.bayern

00b8jv3gs972inad2cipm20gqvohmn0v.bayern

00bu3lvu54afr3egplojrpamqu4onhck.bayern

00clcm817v8sra5aqpcru0u8t5lrcjti.bayern

00dfkkjfmhpqll6ladjs3tqlpaqhuijc.bayern

00espnkvp4ohdq7dm35o7v4po4rpm4bp.bayern

00f2n0s19mqn3s34ij3rpnju85arfth8.bayern

 

 

Figure 3: Numbers of .bayern domains, by domain name (SLD) length.

 

It is also instructive to compare the mean entropy for each TLD with previous estimates of the general level of risk associated with that TLD, considering factors such as the frequency of their use in phishing, spam, and malware. In one such study[9], TLDs were allocated a normalised ‘threat frequency’ score (between 0 and 1), based on threat statistics taken from a range of independent datasets. Figure 4 shows a comparison between the mean entropy of the domains for each TLD, and the threat score from this previous study, for all TLDs present in both datasets.

 

 

Figure 4: Comparison between mean domain name entropy (this study) and normalised threat frequency score (previous study) for each TLD.

 

Whilst there is no strong correlation between the two datasets (though there is a weak positive correlation, with a coefficient of +0.07), there is a suggestion that the highest-entropy TLDs (those with a mean entropy value of > 3.2) do tend to sit at the higher end of the risk spectrum (threat score > approx. 0.2). This is at least suggestive of some self-consistency in terms of the assertion that higher-entropy domain names (and the TLDs with which they are more frequently associated) tend to be more likely to be linked to a range of classes of fraudulent and malicious activity.

 

Conclusions

Previous research suggests that long, random (high entropy) domain names are more likely to associated with automated algorithmic registrations, and to be used for malicious activity. It is also noteworthy that many of the most suspicious domain names are (exactly) 32 characters in length.

Certain domain extensions are associated with greater proportions of high entropy domains, and the top 30 TLDs (by mean entropy) includes a number of popular extensions like .top (4.5m domains), .app (1.3m) and .page (103k). The additional finding that many of these same TLDs are generally found more frequently to be associated with phishing, spam, and malware is suggestive of a correspondence between mean domain entropy and overall level of risk for a particular TLD.

Quantitative studies such as this can help inform and validate brand protection strategies, especially when overlaid with qualitative analysis (such as consideration of what string the domain extension itself actually is, in terms of a keyword or description). This assessment provides guidance not just on which domains to register, but also which domain extensions warrant attention when monitoring, and prioritisation when enforcing.  The Internet isn’t getting any smaller, but combining metrics can help with zoning in on targets.

 

[1] https://www.iana.org/domains/root/db

[2] https://circleid.com/posts/20230703-an-overview-of-the-concept-and-use-of-domain-name-entropy

[3] https://www.splunk.com/en_us/blog/security/random-words-on-entropy-and-dns.html

[4] https://www.linkedin.com/pulse/investigating-use-domain-name-entropy-clustering-results-barnett/

[5] ‘Un-.zip-ping and un-.box-ing the risks associated with new TLDs’ 

[6] https://czds.icann.org/home

[7] The SLD (second-level domain name) is the part of the domain name before the dot

[8] https://docs.oracle.com/cd/E19683-01/806-4077/6jd6blbdi/index.html

[9] https://circleid.com/posts/20230117-the-highest-threat-tlds-part-2

Tags
Online Brand Enforcement /  Domains /  Tech

Found this article interesting today?
Send us your thoughts: