Web Scraping for Cyber Security

Hola everyone,

Today we are going to learn about Data Scraping and Automatically collecting Cyber Threat Intelligence (CTI) feeds to programmatically extract and import IOCs in different SIEMs.

Do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you want.

**Disclaimer!! Remember that scraping without control can have impact on the website you are trying to ping, be considerate with your scraping, read the terms and conditions to see if it´s allowed, and try not to DDoS the server or website which you are gathering information from.

-----------------------------------------------------------------------------------------------------------------------------

Executive Summary

Data Scraping using Beautiful Soup Python Library to programmatically retrieve CTI feeds and ingest them into different SIEMs, for Threat Hunting, Detection Engineering, Automation.

-----------------------------------------------------------------------------------------------------------------------------

What is Data Scraping? .. and why do we care?

"Data scraping involves pulling information out of a website and into a spreadsheet. To a dedicated data scraper, the method is an efficient way to grab a great deal of information for analysis, processing, or presentation."

So, it allows us to obtain any data we want from anywhere we want. But why do we want data? Well, you can use it to train Machine Learning models, for "AI", to receive the latest notifications of your ongoing anime that has just been released. Anything you want!

Think of it as a way to programmatically retrieve information you think might be useful, either to build a dataset, or to do correlations. In our case, I will examine it from a Cyber perspective.

Automatically Collecting CTI Feeds

Where to look?

In this blog, I have included some links on your left panel, that you can see tagged as [Anime], [Sandbox], [Malware] and [CTI].

1. Links for you to use

[Anime] directs you to MyAnimeList, a very popular website that lists anime. In case you are more interested in knowing when is the latest One Piece episode released than ingesting CTI feeds.
[Sandbox] leads to AppAnyRun, popular public sandbox (do not upload company-sensitive documents here please).
[Malware] directs to VXUnderground, malware repository, in case you need any samples to analyze.
[CTI] contains a mix of different IOCs, that some companies/researchers share for free with the community. They are uploaded in their respective links frequently, so for us it will be quite valuable to ingest this to our SIEMs, or simply do Threat Hunting around it.

Now that we have identified where the information we want is ([CTI]), how do we retrieve it programmatically? The first step is to right click on the website and select "View Source", or Control+U to bring the Dev console up on your browser. Now you can click around and find where the information you want is, in our case we can simply Control+F (find) our [CTI] tag.

2. Source Code Inspection

You can also do so by simply hovering over the information you want, the [CTI] links, and clicking "Inspect Element"; this will also show you where it is located.

3. View Elements Inspection

And we can see that the information we want is somewhere in the "/html/body/aside/div[3]/div[1]/div/ul/li[2]/a", just following the HTML tags visible in the "Elements" tab of your dev console.

A quick reminder, for my website, the links are directly on the source code, which is why we can find them using the first method, but if this were to be a Single Page Application (SPA) where the content is modified in the Document Object Model (DOM) by some JavaScript, we would need to interact with it, and locate it using the third picture.

How do we get them now?

Now we can finally introduce Beautiful Soup, the holy grail of scraping libraries for Python; it supports any XML based language (including HTML). As I do not want to make this extremely dull for you to read, I will go straight to what you need. If you want more information on Beautiful Soup I will add some references and links at the bottom.

4. Beautiful Soup

First we need to import Beautiful Soup from bs4 and instantiate a "soup 🥣" object.

**Note that Beautiful Soup DOES NOT make web requests, you can handle that however you prefer, using the "requests" library from Python or anything of your choice. Or, if you do not want to make a web request, you can pass the object locally to the library and it will parse it too.

5. Code to extract links

Code Explanation:

Use Python´s requests library to make a web request to my blog. Create the "soup 🥣" object with the response. Find the CTI link tag, return all URLs.

CTI Links:

- https://threatview.io/Downloads/URL-High-Confidence-Feed.txt

- https://raw.githubusercontent.com/stamparm/ipsum/master/levels/8.txt

- https://raw.githubusercontent.com/0xDanielLopez/TweetFeed/master/month.csv

- https://raw.githubusercontent.com/drb-ra/C2IntelFeeds/master/feeds/domainC2swithURLwithIP-filter-abused.csv

- https://bazaar.abuse.ch/export/txt/md5/recent

Now that we have all of our URLs in a variable, we can repeat this process and iterate through our new URLs to get all IOCs in one place.

As we will be iterating now over the IOCs, let´s do the example with the first one.

6. Code to iterate over extracted URLs

We simply create some functions to regex our IOCs and append it to a CSV at the end, and voilà, we get a CSV with what we expected.

7. Extraction Successful

How to Integrate This with my SIEM?

For KQL-based SIEMs - Microsoft Defender for Endpoint, XDR, Sentinel, or whatever the name might be when you are reading this - here are some examples
For SPL, CrowdStrike, Humio, Splunk, either use the Security module or with inputlookup tables as your file to query, adapting your mapped fields - here are some examples

Basically how this works is: you either query the URL from the SIEM "Advanced Hunting or similar"(if that is supported) and parse it in real time to threat hunt & detect suspicious behavior in your environment, or you can scrape the data yourself. Set up a cron job to run this periodically and save it to a file, then import that file into your SIEM, and query from a lookup to the specified file.

For those environments using "Detection as Code", simply copying and adapting the code provided in your Jupyter Lab, or any other environment of your choice, will suffice. Data mangling your IOCs is where you will probably spend most of your time, parsing proper IOCs from different URLs is key for this to work.

Data Quality and Final Remarks

As you might expect, the entire Threat Intelligence field is yet another business line. As such, all premium, top-of-the-line CTI feeds will probably require some payment. This approach is as good as it gets; if you can find anything online, you will be able to get it and correlate it, but remember, the quality of your data will be very tricky. You will likely need to spend some time fine-tuning your automations to meet your environment´s needs. It can happen that one CTI source is good and/or relevant for finance, but not for healthcare, etc..

Do your research and explore with different CTI sources, see what works and what brings value to your organization. One last example I would like to showcase is TweetFeed, a project by Daniel Lopez, which automatically scrapes IOCs from well-known sybersecurity researchers on X (Twitter), parses them, and appends them to GitHub.

You can probably see by now the potential this has, right? Just please be mindful of those frontline analysts getting flooded if any of your CTI feeds decide to incorporate 8.8.8.8 (Google´s DNS) as a malicious IP; this is of course, based on a true story.

I am leaving you with a list of useful CTI resources below, happy scraping, and please do let me know if you have a cool use-case to share!

I have added a little challenge at the end of this post. You should see a button that you can click and get a secret CTI source, how would you retrieve that? Spoiler

References & More Useful Information

My GitHub, full code explained in this post can be found - HERE
Beautiful Soup in video format, full 1h course - HERE
Beautiful Soup in video format, 7 min video summarized - HERE
Some more detailed blogs on scraping - HERE
Notebook example extracting data from Steam - HERE
List of CTI resources for you to scrape and enrich your SIEMs - HERE

Search This Blog

Diego Writes a Blog