There are things you don't care much about, until you age a few more years, get lovely kids and watch them grow.
And preventing them from seeing inappropriate content is one of these.
DNS-based control filters are appealing. Simple to setup, it's a configure-and-forget solution.
Unfortunately, they are not very effective. Most of them seem to be based on keywords, as well as curated blacklists.
Keywords are prone to blocking totally safe content, whereas blacklists don't scale and are bound to be constantly out of date.
While the popular DNS-based filters I tried did a decent job with (way too) popular websites, they all fell short with links found on these very websites.
Some also seem to blindly whitelist Alexa top-N domain names, considering them as perfectly safe for all public. This doesn't match the reality.
They also did a pretty poor job at blocking websites whose name is not made of English words.
Unfortunately, inappropriate websites exist in all languages, are very well referenced on search engines in these languages, and unless you live in an English-speaking country, this is what your kids are going to encounter first.
DNS-based systems also raised some privacy concerns, and didn't offer as much flexibility as I would have liked.
So I decided to build a parental control system from scratch.
Building the tools
Having something that would be compatible with all the devices in the house was a mandatory requirement.
As I was developing dnscrypt-proxy, first-class filters with parental control in mind thus became a priority.
Filters able to quickly match large lists with complex patterns were implemented.
In order to get Google, Bing and Youtube return only "safe" results, overwriting responses for some specific zones is required. I implemented a "cloaking" module in dnscrypt-proxy to achieve this.
Finally, I really wanted time-based rules, to block websites such as Youtube only during the week, and only a couple hours in the day where I'd rather see my little princesses do their homework instead of watching prank videos.
No existing DNS-based service seemed to offer it in spite of users having asked for that feature for years. So I implemented it.
My router is a (now very old but I still like it) Soekris Net4801 running OpenBSD. It's been running beautifully for all these years, but unfortunately, its very limited memory turned out to be showstopper to run dnscrypt-proxy with large lists on it.
While blocking everything client-side is something I would still highly recommend whenever possible, it remains okay to do it on the DNS resolver itself, especially if you also happen to operate that very resolver.
Existing filtering mechanisms in DNS resolvers are all very basic and can't match patterns.
So I implemented filters in dnscrypt-wrapper, a server-side implementation of the DNSCrypt protocol. This implementation uses qp-tries for prefix/suffix matching, can also match random susbtrings, and is both extremely fast and flexible. It uses the exact same format as what was implemented in dnscrypt-proxy, so blacklists can be freely moved from the client to the server and the other way round.
Now that we're done with the tools, let's move on to building the actual blacklists.
Explicit websites are explicit
Adult websites don't try to hide what they are all about. They are all trying to be as discoverable as possible. (Blackhat) Search Engine Optimization is a no breainer. Host names have to be combinations of all the possible keywords their target audience might type.
This also applies to VPNs, proxies and other trivial ways to work around filters. They are a dime a dozen, and also try to be as visible and obvious as possible, all fighting for the top search results on search engines. For this reason, I decided to build rule sets that would cover both simultaneously.
Looking at the host name is all it takes for a human to confidently label it as inappropriate. This is precisely what these websites have been optimized for. Network features are thus not mandatory in order to build decent filters.
If a human can do it, deep neural networks might also do a decent job. But not without a training set to start with.
An initial training set
But where do we start? Freely available lists are not great. Manually building them would be way too time consuming. Crawling is an impossible task.
I started by using
dnscap to store unique host names of non-empty/non-error responses from an open resolver. Doing so for 24 hours returned about 15 million host names, without any information about the original queries and who sent them, but with the certitude that all of these names had authoritative servers.
Now, some initial labels had to be applied. I simply used
grep with a dozen obvious keywords to build an initial set of potentially inappropriate names.
The resulting training set was a really terrible one. The positive examples certainly included many false positives, as any simple keyword-based approach would. The negative examples were obviously including a large amount of false negatives. By no means the handful keywords I used would be enough to correctly label the whole set.
But I had to start with something. So, I had a really awful initial training set, but a training set no matter what.
An initial model
A domain deep learning excels in is natural language processing. Recurrent neural networks have proven to be very effective for speech recognition, sentiment analysis, document classification, automated translation and more.
These models appeared like a natural fit for automated host names labeling.
Language processing models typically require dictionaries. However, host names are not only made of words, and don't respect any well-defined segmentation rules.
I thus considered each character as a word and represented each name as a one-hot vector. Host names were truncated to the last 64 characters, and I started with a reasonably large batch size (73) to quickly get some feedback on the model's performance.
These 73x64x42 matrices were the input to a initial network made of 500 LSTM units connected to a dense layer.
The overall performance was really poor. The model was clearly overfitting the data and was unable to generalize. A few obvious problems with this first attempt were:
- The highly skewed dataset. Giving more weight to the positive examples, either by duplicating them to balance the input, or by setting the weight properties in the Keras implementation, didn't help much.
- The natural tendency of RNNs to overfit and the absence of mitigations (e.g. dropout) prevented the model from generalizing
- The low-quality of the training set, especially the high number of false negatives
To address these:
- I removed a few false positives. Namely, adult-related patterns found in random host names generated by CDNs, load balancers, APIs and "ip leak" tests
- I built a dataset with a random selection of negative examples whose number matches the positive ones
- I added a dropout layer
LSTM units were also replaced by Gated Recurrent Units, whose accuracy was comparable, but training was faster, allowing me to iterate more quickly.
This vastly improved accuracy, even though the training set was way smaller. However, the number of false positives was still far from acceptable.
It turns out that the dropout layer was partly to blame. Dropout can hurt performance in recurrent networks (especially stacked ones), even though the dense layer may benefit from it.
I removed this layer and applied recurrent dropout instead, which made a massive difference to the number of false positives. This hyperparameter happened to be among the most important ones.
0.2 turned out to be a pretty good value for my set.
Intuitively, any subdomain/hostname within an inappropriate domain may inherit the same label, so the input vectors should be processed in reverse order. However, sequences resembling words should be processed in order.
In this context, a bidirectional RNN would make complete sense. However, this didn't make any significant improvement to the model's performance.
What did improve performance, though, is the addition of a bias to these units, as well as using the
softsign activation function instead of
We don't any use dictionaries at this point. "Words" are just unsegmented sequences of characters. With a number of gated units large enough, sequences commonly found for each label are still learned pretty well by the model.
However, reviewing the output strongly suggested that building some kind of dictionary would still make sense. Adult-related websites tend to use a domain-specific set of "words", with a few variations that the recurrent units can deal with.
I added an Embedding layer to the network. By learning embeddings, we can capture the domain-specific vocabulary, which in turn allows reducing the number of recurrent units without any significant loss in accuracy.
The downside is that training the model became much slower, even with half the number of recurrent units. Using Adagrad instead of RMSProp as an optimization method helped a bit to mitigate this.
I implemented early stopping monitoring variations of the accuracy going below 1E-4 with a patience of 4 epochs. Convergence was reached after about 150 epochs.
It's well know that RNNs are surprisingly effective, and this time wasn't an exception. Remember that the training set was just a substring search of a handful keywords.
The model had accurately learned many relevant words and idioms. What probably struck me more is the fact that it had learned domain-specific vocabulary in many different languages.
A significant amount of examples host names had mixed common English keywords with words in other languages for the model to learn about these new words.
Another shocking result is that, while reviewing the results for false positives, I stumbled upon a great amount of example that, to me, looked like obvious false positives, but turned out to be true positives. I discovered words and expressions I had no idea about.
The model was also exceptionnally good at capturing names. Many of the host names labeled as positives just looked like people's name to me, thus as obvious false positives. They were not. These were porn stars' name, that the model previous found in other, more explicit contexts.
The flipside is that anything within host names such as
tumblr.com also ended up being systematically labeled as positive. Even more embarassing is that words such as
gay were also considered as strong signals for a host name to be labeled as positive.
Refining the training set
Obviously the training set had to be refined:
- We need more positive examples. Remember that the initial set was highly unbalanced. We have plenty of available negative examples, so we can add many new positive examples and keep the set balanced.
- The negative examples still include many false negatives. If we add new examples to this class, they have to be manually reviewed.
- In order for the model to capture the fact that
tumblerare not enough to decide on a label, and give more importance to the context in this case, we need to add more counter examples with these sequences.
I built scripts to make a prediction about every single domain from the large initial set, and output the ones with high-confidence predictions that they should had been labeled positives while they initially weren't.
The output was about 150k domains and reviewing these manually was an impossible task.
I ended up extracting the ones with specific, obvious patterns, and did the same to add new examples of true negatives.
The process was repeated several times, each time manually reviewing some of the model's prediction to enrich the training set and build a better model.
Unfortunately, manual review is extremely time consuming, and I eventually got burned out before reaching the training set quality I was aiming for.
Being done with manual reviews, I implemented a simple script that retrieves the home page of a domain name, and looks for some keywords on it.
The domains set still comes from the model's predictions, with a fairly high cutoff threshold to keep the number of candidates low.
Crawling doesn't give very reliable results, missing many positives. It also introduces a different features set than what the model was originally trainined with.
This is especially visible on Tumblr pages. Some of the crawled pages were flagged because the "adult" disclaimer was found of them. These are very likely to be true positives. However, the host name may not convey any indications that this is the case. But we precisely want to train the model on the host name itself.
More manual reviews or more sophisticated analysis of the crawled pages would help. However, even without a perfect training set, the model's accuracy already got far beyond my expectations.
Training models require servers with GPUs. It costs money. Every single minute of computation costs money, and I can't do this any longer.
But even in its current state, the model's prediction are incredibly good. A manual review shows little false positives, but also shows the fact that the model properly captures most of the domain-specific vocabulary, idioms and patterns, in many different languages.
In order to get a rough idea about how it performs against long established parental control services, I sampled 50,000 names predicted as inappropriate. Due to how the initial set was built, the vast majority of these were all recent, existing websites. At least, they have IP addresses mapped to them, and users browsing them.
I then used
massresolver to resolve them using Cisco OpenDNS FamilyShield and counted how many occurrences of the Cisco IP addresses were returned instead of the actual website IP address.
Cisco blocked only 17% of them. I doubled, triple checked. 17%.
Please take it with a grain of salt. This is just a quick test. There are obviously false positives in my test sets. Perhaps many of them, since I didn't perform an extensively manual review of these lists. This doesn't imply that Cisco wouldn't block many positive host names that my model would flag as negative with high confidence.
But this still demonstrates that recurrent neural networks can be insanely useful for filtering this kind of domains.
Privacy-aware content filtering
Eventually, I'd like to publish my Keras/Tensorflow code, as well as the blacklists so that anybody can use them for non-commercial purposes.
However, publishing a huge list of questionnable websites may not be the best thing to do.
Ultimately, such a list should be only usable for filtering. Finding what the list actually contains should be hard, or at least, costly.
This has to be a solved problem, but I couldn't find much research on this.
Publishing hashes of the names instead of the names themselves doesn't solve the problem. The keyspace is very small, and dictionary attacks are very practical.
Password-stretching functions such as Argon2, Scrypt or Bcrypt seem like the answer: publish hashes of the names, as computing these hashes requires a significant amount of CPU cycles, drastically slowing down dictionary attacks.
Unfortunately, performing a very slow computation for every single DNS query would not be practical either. Even on a small home network, the rate of DNS queries sent by the various connected network devices is high, and for DNS, latency is critical.
One way to mitigate this is to use Bloom filters or Golomb-Coded Sets to avoid performing the computation for most host names not in the set. This is suboptimal. This is all-or-nothing. Some queries will get a fast response, some will be really slow, and unless the vector used in the probabilistic data structures is very large, even a significant amount of false positives will get a slow response. Updating these lists also involves complexity.
As an alternative, I am proposing a technique called progressive hashing, for the lack of a better name.
In a progressive hash function, the first bits of the output are computed sequentially. Computing a bit requires a higher work factor than the previously computed bit. Eventually, the remaining bits are all copied from the same state, after enough iterations to be comparable to a regular PBKDF2 output.
While computing a full hash still requires a significant amount of CPU cycles, verifying that a hash is not valid for a given input can be fast; the verification happens while the computation is being made, and if a computed bit doesn't match the expected one, the verification can return early instead of performing the full number of rounds.
In a content-control system, a reasonable assertion is that most queries will not be blocked. Therefore, checking for the presence of a host name (or a truncated version) in the hash set will be a fast operation. There will be exceptions, but the cost will be amortized without requiring bloom filters. The collisions from the first bits of a hash form an implicit probabilistic data structure.
While verification will usually be fast, finding domains matching a given hash requires the full hash, i.e. performing the full-round computation. This doesn't increase the key space, nor eliminates tradeoffs due to the targeted client platforms, but it significantly increases the cost over a plain list of host names.
Proof-of-concept code based on the BLAKE2B hash function has been published on github.
Exciting times ahead
Working on this was fun. It shows that many things can still be done to improve content filters.
I've been pleasantly surprised by the ridiculous accuracy of simple recurrent models, which demonstrates that deep learning can be a fantastic addition to traditional ways of building domain-specific blacklists.
And I finally got the parental control system I wanted.
What are the next steps? I really wish I could get some help manually reviewing training sets in order to keep improve the model.
But adding support for progressive-hashed blacklists to dnscrypt-proxy is the next step. Then, I'll be able to publish what I have, along with the Keras/Tensorflow models, and hopefully find some time to keep updating this.