Hola! I'm Diego, a Security Engineer with some expertise in various fields including Digital Forensics and Incident Response (DFIR), Cloud, Machine Learning, and Artificial Intelligence. I'm passionate about integrating these areas to create valuable solutions. If you find my research and insights intriguing and applicable to your organization, I would love to hear your thoughts! Please feel free to leave a comment or reach out with any questions.
Web Scraping for Cyber Security
Get link
Facebook
X
Pinterest
Email
Other Apps
Hola everyone,
Today we are going to learn about Data Scraping and Automatically collecting Cyber Threat Intelligence (CTI) feeds to programmatically extract and import IOCs in different SIEMs.
Do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you want.
**Disclaimer!! Remember that scraping without control can have impact on the website you are trying to ping, be considerate with your scraping, read the terms and conditions to see if it´s allowed, and try not to DDoS the server or website which you are gathering information from.
Data Scraping using Beautiful Soup Python Library to programmatically retrieve CTI feeds and ingest them into different SIEMs, for Threat Hunting, Detection Engineering, Automation.
"Data scraping involves pulling information out of a website and into a spreadsheet. To a dedicated data scraper, the method is an efficient way to grab a great deal of information for analysis, processing, or presentation."
Think of it as a way to programmatically retrieve information you think might be useful, either to build a dataset, or to do correlations. In our case, I will examine it from a Cyber perspective.
Automatically Collecting CTI Feeds
Where to look?
In this blog, I have included some links on your left panel, that you can see tagged as [Anime], [Sandbox], [Malware] and [CTI].
1. Links for you to use
[Anime] directs you to MyAnimeList, a very popular website that lists anime. In case you are more interested in knowing when is the latest One Piece episode released than ingesting CTI feeds.
[Sandbox] leads to AppAnyRun, popular public sandbox (do not upload company-sensitive documents here please).
[Malware] directs to VXUnderground, malware repository, in case you need any samples to analyze.
[CTI] contains a mix of different IOCs, that some companies/researchers share for free with the community. They are uploaded in their respective links frequently, so for us it will be quite valuable to ingest this to our SIEMs, or simply do Threat Hunting around it.
Now that we have identified where the information we want is ([CTI]), how do we retrieve it programmatically? The first step is to right click on the website and select "View Source", or Control+U to bring the Dev console up on your browser. Now you can click around and find where the information you want is, in our case we can simply Control+F (find) our [CTI] tag.
2. Source Code Inspection
You can also do so by simply hovering over the information you want, the [CTI] links, and clicking "Inspect Element"; this will also show you where it is located.
3. View Elements Inspection
And we can see that the information we want is somewhere in the "/html/body/aside/div[3]/div[1]/div/ul/li[2]/a", just following the HTML tags visible in the "Elements" tab of your dev console.
A quick reminder, for my website, the links are directly on the source code, which is why we can find them using the first method, but if this were to be a Single Page Application (SPA) where the content is modified in the Document Object Model (DOM) by some JavaScript, we would need to interact with it, and locate it using the third picture.
How do we get them now?
Now we can finally introduce Beautiful Soup, the holy grail of scraping libraries for Python; it supports any XML based language (including HTML). As I do not want to make this extremely dull for you to read, I will go straight to what you need. If you want more information on Beautiful Soup I will add some references and links at the bottom.
4. Beautiful Soup
First we need to import Beautiful Soup from bs4 and instantiate a "soup 🥣" object.
**Note that Beautiful Soup DOES NOT make web requests, you can handle that however you prefer, using the "requests" library from Python or anything of your choice. Or, if you do not want to make a web request, you can pass the object locally to the library and it will parse it too.
5. Code to extract links
Code Explanation:
Use Python´s requests library to make a web request to my blog. Create the "soup 🥣" object with the response. Find the CTI link tag, return all URLs.
Now that we have all of our URLs in a variable, we can repeat this process and iterate through our new URLs to get all IOCs in one place.
As we will be iterating now over the IOCs, let´s do the example with the first one.
6. Code to iterate over extracted URLs
We simply create some functions to regex our IOCs and append it to a CSV at the end, and voilà , we get a CSV with what we expected.
7. Extraction Successful
How to Integrate This with my SIEM?
For KQL-based SIEMs - Microsoft Defender for Endpoint, XDR, Sentinel, or whatever the name might be when you are reading this - here are some examples
For SPL, CrowdStrike, Humio, Splunk, either use the Security module or with inputlookup tables as your file to query, adapting your mapped fields - here are some examples
Basically how this works is: you either query the URL from the SIEM "Advanced Hunting or similar"(if that is supported) and parse it in real time to threat hunt & detect suspicious behavior in your environment, or you can scrape the data yourself. Set up a cron job to run this periodically and save it to a file, then import that file into your SIEM, and query from a lookup to the specified file.
For those environments using "Detection as Code", simply copying and adapting the code provided in your Jupyter Lab, or any other environment of your choice, will suffice. Data mangling your IOCs is where you will probably spend most of your time, parsing proper IOCs from different URLs is key for this to work.
Data Quality and Final Remarks
As you might expect, the entire Threat Intelligence field is yet another business line. As such, all premium, top-of-the-line CTI feeds will probably require some payment. This approach is as good as it gets; if you can find anything online, you will be able to get it and correlate it, but remember, the quality of your data will be very tricky. You will likely need to spend some time fine-tuning your automations to meet your environment´s needs. It can happen that one CTI source is good and/or relevant for finance, but not for healthcare, etc..
Do your research and explore with different CTI sources, see what works and what brings value to your organization. One last example I would like to showcase is TweetFeed, a project by Daniel Lopez, which automatically scrapes IOCs from well-known sybersecurity researchers on X (Twitter), parses them, and appends them to GitHub.
You can probably see by now the potential this has, right? Just please be mindful of those frontline analysts getting flooded if any of your CTI feeds decide to incorporate 8.8.8.8 (Google´s DNS) as a malicious IP; this is of course, based on a true story.
I am leaving you with a list of useful CTI resources below, happy scraping, and please do let me know if you have a cool use-case to share!
I have added a little challenge at the end of this post. You should see a button that you can click and get a secret CTI source, how would you retrieve that? Spoiler
References & More Useful Information
My GitHub, full code explained in this post can be found - HERE
Beautiful Soup in video format, full 1h course - HERE
Beautiful Soup in video format, 7 min video summarized - HERE
Hola everyone, Today we are going to learn about Frequency Analysis using Fourier, applied to Cyber Security. This tool will allow us to find patterns within our dataset, in a much easier way than doing it in the time domain.** As last time, do not worry, I will leave a link to my GitHub at the very end under "References & More Useful Information" so you can copy everything if you want. ** Disclaimer !! Remember that the problem you are trying to solve might be slightly different than the one I am presenting, and maybe time-domain tools work best for your case. Do your own analysis before copy-pasting the code in the GitHub for optimal results. ----------------------------------------------------------------------------------------------------------------------------- Executive Summary Frequency Analysis using Fourier for Detection Engineering & Threat Hunting. Detecting C2 Beacons with and without Jitter, a technical analysis. ---------------------------------...
So.. why am I writing a blog now? Well, I feel like I’ve reached a point in my career where my expertise, research, and ideas could really benefit others, and since I’ve always loved sharing knowledge, this blog is the perfect way to do just that! What are you going to be posting about? I’ll be posting about various cybersecurity topics: Cyber Security Forensics Incident Response Malware Machine Learning Cloud "Artificial Intelligence" (AI) SOC - Security Operations Center Budget Cost Optmization SIEMs Detection Engineering Logging Architectures & Pipelines I’ll mix in some #technical posts with #business -focused insights to engage both tech-savvy and business-centric professionals. I plan to include plenty of visuals and examples to make things clearer, and each post will kick off with an Executive Summary to highlight the main takeaways. So who am I? I´m Diego, nice to meet you. I´ve been in the Cyber Security world for the past 6-7 years, I am a Telecommunication...
Great post!
ReplyDeleteWhat a fantastic publication. I just learned a bunch of stuff. Resources are also super useful!
ReplyDelete