What is crawling? And why is it so hard?
And why is it so hard? I can always go to a website, can’t I? We often get these questions from our customers and people who try to understand what we do. Let’s explain what crawling is and why it is so tough.
Crawling is grabbing information on the internet and making that information available. For a crawler to work, a few parts are critical.
Let’s discuss these different parts in more detail. A definition is the starting point of any crawler. What is the location that we need to visit? When you navigate to Facebook or Twitter, you know where you are going, and just like you, our crawler needs to know where to look for the information.
But unlike humans, these websites look very different—for example, a Facebook tile and a Twitter tile side by side. From a user’s perspective, it’s easy to think of them as text (and sometimes text with photos). But for a computer, these two elements are “named” completely differently. The Facebook tile is called <div class="m8h3af8h l7ghb35v kjdc1dyq kmwttqpk gh25dzvf n3t5jt4f"> and in Twitter <div class="css-901oao css-16my406 r-poiln3 r-bcqeeo r-qvutc0">.
As you can see, it’s important to tell the computer exactly where to look. And you can imagine the challenges increase when we have to navigate through multiple pages or categories. All of these steps need to be accurately defined (and, of course, updated if the underlying website changes some things). And all of the websites are different.
You can try this yourself by pressing right-click on a website and clicking “inspect element” to get an idea about how complicated a website looks in the background.
Suppose we’ve got a good definition setup for a particular domain (website). The job is only just starting. The second part of the crawling process is fetching the data from the domain.
Here we are in a constant tug of war with the websites we are trying to crawl. These websites want to put their content in the public domain and protect themselves against other companies taking their data and “rebuilding” their product.
You are probably familiar with many things that are used to stop bots (or crawlers) from crawling these websites. A Captcha is something we’ve all seen; it aims to differentiate between a real (person) user and a fake (computer) user. Captcha’s are notably hard to solve for computers and relatively easy (but annoying) for people. But that’s not all. A vital part of the puzzle is IP addresses.
Every computer has a “virtual address” called an Internet Protocol address, otherwise known as an IP address. This tells the website where the request is coming from. Every computer has a unique IP address. This address is typically not blacklisted because you behave like a human when visiting various websites.
What’s more, you have some cookies on your computer that can be “read” by the website, which also gives some clues about your humanness.
Now, for crawlers to work on an industrial scale, we are using AWS, Google, or Azure services. But companies like Cloudflare know which IP addresses belong to these big AWS computers and can simply block them.
To overcome this, proxy services can be used. They hide the original IP address by sending the traffic through another location. The most state-of-the-art solution is residential IPs (using people's personal IPs to route the traffic through, and of course, paying them for using their IP address for this purpose).
And that’s not all. These websites also look at the actual behavior of users. If you go to Facebook, scroll around, click here and there, stay a few moments on a post and then move on. But a bot could fire almost an unlimited number of requests to a particular domain (which is sometimes done maliciously as a Distributed Denial-of-Service attack (DDoS)). We always make sure we treat the website as friendly as possible. So send one request per second rather than overloading the website we are trying to fetch.
But, what websites do, in turn, is check that people actually behave like people, for example, by putting in time-outs. So after 100 requests, for example, the website blocks an IP address for an hour.
If only websites gave us one problem to solve at the time… but they don’t. They throw as much at us as possible—combinations of behavior, IP, Captchas, and time-outs at us. And different problems require different solutions. Not just that, different solutions are more or less costly in terms of time and computing resources.
For Legentic to be truly leading in this category, we’ve invented a content fetching system (CFS). This system smartly tries to identify how most efficiently fetch specific pages. A ton of little services can be freely combined to make an optimal choice of how to get information from the internet. It also ensures we don’t get back mambo-jumbo but rather understandable text.
Finally, the collected data need to make sense. Storing all the HTML information would be way too much. So we grab the information we want and store it in a meaningful way. We store the information as “body” or “title” so our customers can look at the information in our Legentic App.
This is also the stage at which we can do additional analysis of the data. This is where we apply our Machine Learning and AI to, for instance, extract License Plates and determine if a car has damage. Additionally, we extract the model, make, and year from the ad’s body and title.
So there you go. When you just visit a website, you know where to go and visit only a few pages. When we go and want to fetch thousands of pages per day, the interwebs throw a slew of challenges that we happily tackle but might not be as easy as you originally thought.