87 million domains pagerank and harmonic centrality
When we search "Common Crawl" on google, knowledge graph states that "Common Crawl is a nonprofit 501 organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month. Common Crawl was founded by Gil Elbaz."
On their most recent blog post which is at URL http://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/ , there is a file open to the public,
cc-main-2018-aug-sep-oct-domain-ranks.txt.gz (1.89 GB) providing 87 million domains most recent harmonic centrality and pagerank values.
Preview of common crawl 2018 august,september,october domain ranks file
Below you can see a preview from this file. The column names are modified for ex: #pr_val becomes pr_val.
# | harmonicc_pos | harmonicc_val | pr_pos | pr_val | host_rev | n_hosts |
---|---|---|---|---|---|---|
0 | 1 | 24993276.0 | 2 | 0.012750407517909759 | 'com.facebook' | 7348 |
1 | 2 | 24671056.0 | 1 | 0.01721003017179015 | 'com.googleapis' | 1904 |
2 | 3 | 23453366.0 | 3 | 0.010760718435453807 | 'com.google' | 3328 |
3 | 4 | 22371572.0 | 4 | 0.008252403881908257 | 'com.twitter' | 1138 |
4 | 5 | 22136836.0 | 5 | 0.006785530343703675 | 'com.youtube' | 2985 |
... | ... | ... | ... | ... | ... | ... |
87,160,627 | 87160628 | 0.0 | 87122763 | 4.481314067744052e-09 | 'zw.org.mdc' | 1 |
87,160,628 | 87160629 | 0.0 | 87122764 | 4.481314067744052e-09 | 'zw.org.partnersforlife' | 1 |
87,160,629 | 87160630 | 0.0 | 87122765 | 4.481314067744052e-09 | 'zw.org.yamurai' | 1 |
87,160,630 | 87160631 | 0.0 | 87122766 | 4.481314067744052e-09 | 'zw.org.yard' | 1 |
87,160,631 | 87160632 | 0.0 | 87122767 | 4.481314067744052e-09 | 'zw.org.youthalive' | 1 |
Statistics of common crawl 2018 august,september,october domain ranks data
- Pagerank
mean, min, max of pr_val = 1.1473075614418515e-08, 4.48131407e-09, 1.72100302e-02
- Harmonic centrality
mean, min, max of harmonicc_val = 9421776.2697027, 0. , 24993276.
- Number of hosts (subdomains)
mean, min, max of n_hosts = 10.363855461718083, 1.0000000e+00, 2.6061259e+07
- Correlations
correlation(pr_val, harmonicc_val) = 0.00432823
correlation(pr_val, n_hosts) = 0.02463352
correlation( harmonicc_val, n_hosts) = 0.00068831
Data visualization
Distribution of pagerank
The graph below presents the plot of the count of pagerank values. It shows us that the distribution of pagerank on 87 million domains is highly right skewed meaning the majority of the domains have very low pagerank.
Distribution of number of hosts
The graph below presents the plot of the count of n_hosts column values. It shows us that the distribution of number of hosts (subdomains) of 87 million domains is highly right skewed meaning the majority of the domains have low number of subdomains
Closer look at n_hosts with limited number of hosts between 100 and 2000, we observe the same type of distribution.
Distribution of harmonic centrality
The graph below presents the plot of the count of harmonicc_val column values. It shows us that the distribution of harmonicc_val on 87 million domains is not highly right skewed like the pagerank or number of hosts distributions. It is not a perfect gaussian distribution but more gaussian than the distributions of pagerank and number of hosts. This distribution is multimodal.
Scatter plot of pagerank and harmonic centrality
As the majority of domains have low pagerank, we see a vertical red line when we scatter plot the pagerank and harmonic centrality values of domains but we observe the detachement of the domains' pagerank values from the masses begins when their harmonic centrality value is closer to 1e7 and accelerates when it is greater than.
Scatter plot of pagerank and harmonic centrality by number of hosts
On this scatter plot of pagerank and harmonic centrality values, red points show domains with n_hosts less than 10, green points show domains with n_hosts greater than or equal to 10.
Querying domains
Top domains in US
amazon.com, ebay.com, reddit.com
{'harmonicc_pos': array([22]), 'harmonicc_val': array([17583026.]), 'pr_pos': array([37]), 'pr_val': array([0.00084049]), 'host_rev': array(['com.amazon'], dtype='<U83'), 'n_hosts': array([749]), 'index': array([21])}
{'harmonicc_pos': array([241]), 'harmonicc_val': array([16102642.]), 'pr_pos': array([206]), 'pr_val': array([0.00010764]), 'host_rev': array(['com.ebay'], dtype='<U83'), 'n_hosts': array([936]), 'index': array([240])}
{'harmonicc_pos': array([61]), 'harmonicc_val': array([16686224.]), 'pr_pos': array([105]), 'pr_val': array([0.00028783]), 'host_rev': array(['com.reddit'], dtype='<U83'), 'n_hosts': array([1535]), 'index': array([60])}
Top domains in UK
amazon.co.uk, ebay.co.uk, bbc.co.uk
{'harmonicc_pos': array([230]), 'harmonicc_val': array([16126449.]), 'pr_pos': array([201]), 'pr_val': array([0.00011029]), 'host_rev': array(['uk.co.amazon'], dtype='<U83'), 'n_hosts': array([76]), 'index': array([229])}
{'harmonicc_pos': array([1501]), 'harmonicc_val': array([15403167.]), 'pr_pos': array([1730]), 'pr_val': array([1.98171142e-05]), 'host_rev': array(['uk.co.ebay'], dtype='<U83'), 'n_hosts': array([330]), 'index': array([1500])}
{'harmonicc_pos': array([108]), 'harmonicc_val': array([16438657.]), 'pr_pos': array([169]), 'pr_val': array([0.00014236]), 'host_rev': array(['uk.co.bbc'], dtype='<U83'), 'n_hosts': array([342]), 'index': array([107])}
Top domains in France
leboncoin.fr, orange.fr, amazon.fr
{'harmonicc_pos': array([14503]), 'harmonicc_val': array([14825333.]), 'pr_pos': array([2288]), 'pr_val': array([1.37111161e-05]), 'host_rev': array(['fr.orange'], dtype='<U83'), 'n_hosts': array([2860]), 'index': array([14502])}
{'harmonicc_pos': array([907]), 'harmonicc_val': array([15575607.]), 'pr_pos': array([681]), 'pr_val': array([4.26634194e-05]), 'host_rev': array(['fr.amazon'], dtype='<U83'), 'n_hosts': array([40]), 'index': array([906])}
Top domains in Turkey
sahibinden.com, hurriyet.com.tr, n11.com.tr
{'harmonicc_pos': array([20895]), 'harmonicc_val': array([14758365.]), 'pr_pos': array([38627]), 'pr_val': array([8.94421713e-07]), 'host_rev': array(['com.sahibinden'], dtype='<U83'), 'n_hosts': array([1044]), 'index': array([20894])}
{'harmonicc_pos': array([8034921]), 'harmonicc_val': array([11943149.]), 'pr_pos': array([1077872]), 'pr_val': array([4.68284367e-08]), 'host_rev': array(['tr.com.n11'], dtype='<U83'), 'n_hosts': array([5]), 'index': array([8034920])}
My blog's domain
searchdatalogy.com
{'harmonicc_pos': array([17769587]), 'harmonicc_val': array([11030533.]), 'pr_pos': array([3314413]), 'pr_val': array([1.94330501e-08]), 'host_rev': array(['com.searchdatalogy'], dtype='<U83'), 'n_hosts': array([1]), 'index': array([17769586])}
The domain with maximum number of subdomains
{'harmonicc_pos': array([22768913]), 'harmonicc_val': array([10713943.]), 'pr_pos': array([54517530]), 'pr_val': array([4.58480231e-09]), 'host_rev': array(['domains.everyone'], dtype='<U83'), 'n_hosts': array([26061259]), 'index': array([22768912])}
List of domains with more than or equal to 10 K subdomains
List of top 20 domains having n_hosts >= 10000
array(['com.wordpress', 'com.blogspot', 'com.tumblr', 'com.yahoo', 'com.github', 'com.gstatic', 'com.amazonaws', 'com.googleusercontent', 'com.weebly', 'net.cloudfront', 'io.github', 'net.doubleclick', 'com.appspot', 'com.squarespace', 'com.deviantart', 'net.sourceforge', 'com.googlecode', 'com.wix', 'com.live', 'com.list-manage'], dtype='<U83')
Majestic million data
Majestic provides open public data of top 1 million domains at this URL : http://downloads.majestic.com/majestic_million.csv
Preview of majestic million file
# | globalrank | domain | tld | refsubnets | refips |
---|---|---|---|---|---|
0 | 1 | 'google.com' | 'com' | 481744 | 3048605 |
1 | 2 | 'facebook.com' | 'com' | 467244 | 3085825 |
2 | 3 | 'youtube.com' | 'com' | 427535 | 2507495 |
3 | 4 | 'twitter.com' | 'com' | 417571 | 2494867 |
4 | 5 | 'microsoft.com' | 'com' | 313090 | 1188308 |
... | ... | ... | ... | ... | ... |
999,995 | 999996 | 'bauordnungen.de' | 'de' | 358 | 491 |
999,996 | 999997 | 'helios.eu' | 'eu' | 358 | 491 |
999,997 | 999998 | 'chinabi.net' | 'net' | 358 | 491 |
999,998 | 999999 | 'adammilstein.org' | 'org' | 358 | 491 |
999,999 | 1000000 | 'beckers.se' | 'se' | 358 | 491 |
Statistics of majestic million data
- Refsubnets
mean, min, max of refsubnets = 1068.226617, 3.58000e+02, 4.81744e+05
- Refips
mean, min, max of refips = 1440.785989, 3.640000e+02, 3.085825e+06
- Correlations
correlation(refsubnets, refips) = 0.87576101
Merging with majestic million data
After converting domain information to host_rev in majestic data as mhost_rev, I summed up refips and refsubnets of majestic's host_rev and remove duplicates.
Preview of majestic million data after transformation
Below is the preview of majestic million data after this tranformation
# | mhost_rev | refips_sum | refsubnets_sum |
---|---|---|---|
0 | 'com.google' | 8821359 | 2337390 |
1 | 'com.facebook' | 3501915 | 651350 |
2 | 'com.youtube' | 2507495 | 427535 |
3 | 'com.twitter' | 2640942 | 496545 |
4 | 'com.microsoft' | 1560747 | 495154 |
... | ... | ... | ... |
998,411 | 'de.bauordnungen' | 491 | 358 |
998,412 | 'eu.helios' | 491 | 358 |
998,413 | 'net.chinabi' | 491 | 358 |
998,414 | 'org.adammilstein' | 491 | 358 |
998,415 | 'se.beckers' | 491 | 358 |
Preview of common crawl and majestic million data join
Later I merged two data sets. Below is the preview of this join
# | harmonicc_pos | harmonicc_val | pr_pos | pr_val | host_rev | n_hosts | mhost_rev | refips_sum | refsubnets_sum |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 24993276.0 | 2 | 0.012750407517909759 | 'com.facebook' | 7348 | 'com.facebook' | 3501915 | 651350 |
1 | 2 | 24671056.0 | 1 | 0.01721003017179015 | 'com.googleapis' | 1904 | 'com.googleapis' | 59541 | 35878 |
2 | 3 | 23453366.0 | 3 | 0.010760718435453807 | 'com.google' | 3328 | 'com.google' | 8821359 | 2337390 |
3 | 4 | 22371572.0 | 4 | 0.008252403881908257 | 'com.twitter' | 1138 | 'com.twitter' | 2640942 | 496545 |
4 | 5 | 22136836.0 | 5 | 0.006785530343703675 | 'com.youtube' | 2985 | 'com.youtube' | 2507495 | 427535 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
87,160,627 | 87160628 | 0.0 | 87122763 | 4.481314067744052e-09 | 'zw.org.mdc' | 1 | -- | -- | -- |
87,160,628 | 87160629 | 0.0 | 87122764 | 4.481314067744052e-09 | 'zw.org.partnersforlife' | 1 | -- | -- | -- |
87,160,629 | 87160630 | 0.0 | 87122765 | 4.481314067744052e-09 | 'zw.org.yamurai' | 1 | -- | -- | -- |
87,160,630 | 87160631 | 0.0 | 87122766 | 4.481314067744052e-09 | 'zw.org.yard' | 1 | -- | -- | -- |
87,160,631 | 87160632 | 0.0 | 87122767 | 4.481314067744052e-09 | 'zw.org.youthalive' | 1 | -- | -- | -- |
Preview of final dataset
After droping rows which contain missing values, 953972 domains are left. Below is the preview from this final dataset
# | harmonicc_pos | harmonicc_val | pr_pos | pr_val | host_rev | n_hosts | mhost_rev | refips_sum | refsubnets_sum |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 24993276.0 | 2 | 0.012750407517909759 | 'com.facebook' | 7348 | 'com.facebook' | 3501915 | 651350 |
1 | 2 | 24671056.0 | 1 | 0.01721003017179015 | 'com.googleapis' | 1904 | 'com.googleapis' | 59541 | 35878 |
2 | 3 | 23453366.0 | 3 | 0.010760718435453807 | 'com.google' | 3328 | 'com.google' | 8821359 | 2337390 |
3 | 4 | 22371572.0 | 4 | 0.008252403881908257 | 'com.twitter' | 1138 | 'com.twitter' | 2640942 | 496545 |
4 | 5 | 22136836.0 | 5 | 0.006785530343703675 | 'com.youtube' | 2985 | 'com.youtube' | 2507495 | 427535 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
953,967 | 87153912 | 0.0 | 87116072 | 4.481314067744052e-09 | 'za.co.landmate' | 1 | 'za.co.landmate' | 711 | 475 |
953,968 | 87153921 | 0.0 | 87116081 | 4.481314067744052e-09 | 'za.co.langebaancpf' | 1 | 'za.co.langebaancpf' | 856 | 595 |
953,969 | 87153940 | 0.0 | 87116100 | 4.481314067744052e-09 | 'za.co.lasercorp' | 1 | 'za.co.lasercorp' | 915 | 737 |
953,970 | 87154893 | 0.0 | 87117051 | 4.481314067744052e-09 | 'za.co.misternat' | 1 | 'za.co.misternat' | 441 | 369 |
953,971 | 87160530 | 0.0 | 87122665 | 4.481314067744052e-09 | 'zw.co.helpstarsmedicaltrust' | 1 | 'zw.co.helpstarsmedicaltrust' | 1119 | 925 |
Statistics on the final 954 K domains
- These correlations below show us that refsubnets and refips of domains are correlated; refips more than refsubnets to pagerank values but number of hosts again as seen in the beginning is not correlated directly to pagerank.
correlation("pr_val", "refips_sum"),correlation("pr_val", "refsubnets_sum"),correlation("pr_val", "n_hosts")
0.60769659, 0.5162285, 0.06663268
- These following correlations show us that refsubnets, refips and number of hosts of domains are not strongly correlated to harmonic centrality values.
correlation("harmonicc_val", "refips_sum"),correlation("harmonicc_val", "refsubnets_sum"),correlation("harmonicc_val", "n_hosts")
0.06194632, 0.11714723, 0.01232035
Next
We can add some more data on this dataset as geographical information of location of the domains' hostings or their webperformances etc. We can create an ML classificiation or prediction models with the final data. Some detailed data analysis can be done on tld level which can reveal surprising insights too.
Thanks for taking time to read this post. I offer consulting, architecture and hands-on development services in web/digital to clients in Europe & North America. If you'd like to discuss how my offerings can help your business please contact me via LinkedIn
Have comments, questions or feedback about this article? Please do share them with us here.
If you like this article
Comments