What is scraping and how to protect yourself to avoid appearing in leaks like those of Facebook and LinkedIn
When we learn that personal data from a web service, a social network or any other platform has appeared, the first thing that comes to mind is that the affected portal has suffered a security breach. However, this is not always the case: on occasions, the information has been obtained by other means without compromising the security of the affected website.
Currently, unless you have decided to avoid having an internet presence, on the net there is infinite information about you. Anyone can see your name and surname on Facebook, your job title on LinkedIn, or know your interests through Twitter. In addition, if you do not put your private data, your contacts can have access to your email or your phone number, among other confidential information.
Collecting the personal data that is displayed publicly manually would be a daunting task, but there are ways to automate the extraction of this information. This practice is called scraping and we are going to explain what it consists of.
What is scraping
What is scraping exactly? Scraping (scraping in Spanish) is a set of techniques used to extract information from websites and store them in a structured way. This work is not done manually, but is carried out in an automated way using software specially created for this purpose.
Although in recent weeks we have seen scraping related to illegitimate data collections, this activity does not have to be for malicious purposes. This is the technique used by search engines, such as Google or Bing, to index public information on web pages in an automated way.
Through this technique it is possible to obtain structured data that can be stored in a database, a spreadsheet or other storage format. Apart from search engines, scraping is also widely used by price comparators from different stores, price history applications, portals offering sports scores, web archiving initiatives and a long etcetera.
Although it can be a completely legitimate practice, scraping can also violate intellectual property, it can be related to unfair competition and it can violate the General Data Protection Regulation or the Data Protection Law.
The role of scraping in the latest Facebook, LinkedIn or ClubHouse leaks
In the last few weeks we have known Three Big Data Breaks: Facebook, LinkedIn, and ClubHouse. However, none of them have been the result of a hack, but the information collected in the databases managed by cybercriminals comes from scraping.
When criminals hack into a website, they can access confidential user information that the portal stores but is not publicly displayed, such as usernames, passwords, account or credit card numbers, emails, phone numbers, etc.
Instead, with scraping it is not possible to obtain confidential data such as passwords, but it is possible to collect all the public information of the users. This information can also be very broad and include name and surname, email, telephone number, links to social profiles, photographs and other personal data.
The case of Facebook: theft of data from more than 533 million users
In early April we encountered one of Facebook’s biggest personal data breaches. Specific, More than 533 million users were affected, including 11 million Spaniards, whose information appeared in a database posted on a hacking forum. The information included full name, location, email address, phone number, Facebook ID, date of birth, and biographies.
Although a priori it was said that Facebook had suffered a new security breach, the company denied it, pointing out that their systems had not been hacked, but rather The data collected in the leak had been obtained through scraping.
According to Mark Zuckerberg’s social network, the criminals collected the data before September 2019, at which time a bug was patched that exposed the users’ phone number, which could be extracted from Facebook’s servers.
“We believe that the data in question was extracted from people’s Facebook profiles by malicious actors using our contact importer before September 2019.” the corporation writes. “This feature was designed to help people easily find their friends using their contact lists.”.
You can find out if you have been a victim of this great leak using Have I Been Pwned ?, either by entering your email or your phone number in international format. Facebook decided not to notify users whose data was exposed, arguing that it is not entirely clear who has been affected and also that nothing can be done anymore, since the data has been publicly exposed.
LinkedIn: personal data of 500 million users appear
But Facebook has not been the only social network that has been affected by scraping recently. A few days after we learned about the leak, a user who sold a database of more than 500 million LinkedIn users appeared on the Dark Web.
Cybernews, the specialized media that discovered the database, verified that it contained real information through a sample. It included LinkedIn ID, full name, email, phone number, gender, link to LinkedIn profile, links to other social media profiles, professional titles, and other career-related data. Passwords weren’t included this time either.
Cybernews experts cannot know if it is current information or if it comes from previous security breaches, although they suspect that it is another case of scraping.
ClubHouse: personal data of more than 1.3 million accounts
The third major affected by scraping in this short period of time has been ClubHouse. In mid-April, the news broke that the fashionable social network could have been hacked, since it had been detected in a well-known hacker forum a file containing the personal data of 1.3 million accounts.
On this occasion, the database contains the user’s identification code, the real name, the profile photo, the username, the Twitter and Instagram username, the number of followers and people who are follows, the date of creation of the account and the user who invited you to ClubHouse.
All this information is public and can be made by anyone who enters your profile on the social network, which is why the company has been quick to deny that has suffered a security breach. ClubHouse has explained that this data can be obtained through the app or the API, so that the information has been collected through scraping techniques.
How to protect yourself from collecting your data
Although websites can take some measures to stop scrapers, such as adding entries in the Robots.txt file, blocking the IP address of bots, adding a captcha or other manual verification system or using the services of antiscraping companies, they do so. true is that it is very difficult to avoid data extraction using scraping techniques.
So how can we protect ourselves so that our personal data does not appear in these types of databases? The quick answer is not to create profiles on services that show public information, but it is clear that it is an option that almost no one chooses today.
If you do not want to give up having profiles on social networks, it is best that you enter as little personal information as possible. Do not provide your phone number, or the email that you usually use in your daily communications, and try to have a private profile that cannot be consulted by everyone.
And what if your personal data has appeared in a leak? In the event that your contact information has fallen into the wrong hands, you may receive malicious emails, as well as fraudulent SMS or calls. Keep in mind that criminals sell the databases to the highest bidder, and that they are often used to carry out phishing campaigns, distribute malware or carry out all kinds of scams.
In addition, having your username, your email and other personal data, even if the password is not included among the leaked information, cybercriminals can use the credentials from stopped security breaches to try to find your password and enter your accounts.
To avoid unpleasant surprises, follow safe internet practices and protect yourself.