Your way out, graph: how we did not find a good network graph and created our own

Your way out, graph: how we did not find a good network graph and created our own

Investigating cases related to phishing, botnets, fraudulent transactions and criminal hacker groups, Group-IB experts have been using graph analysis for many years to identify various kinds of connections. Different cases have their own data arrays, their own algorithms for identifying connections and interfaces tailored for specific tasks. All these tools were developed internally by Group-IB and were available only to our employees.

Graph analysis of network infrastructure (network graph) was the first internal tool that we built into all public products of the company. Before creating our network graph, we analyzed many similar developments on the market and did not find a single product that would satisfy our own needs. In this article, we will talk about how we created the network graph, how we use it, and what difficulties we encountered.

Dmitry Volkov, CTO Group-IB and Head of Cyber ​​Intelligence

What can a Group-IB network graph do?

Investigations

Since the founding of Group-IB in 2003 to the present, identifying, deanoning and bringing cybercriminals to justice has been a top priority in our work. Not a single investigation of a cyberattack was complete without an analysis of the network infrastructure of the attackers. At the very beginning of our journey, it was a rather painstaking "manual work" to find relationships that could help in identifying criminals: information about domain names, IP addresses, digital fingerprints of servers, etc.

Most attackers try to act as anonymously as possible online. However, like all people, they make mistakes. The main task of such an analysis is to find “white” or “gray” historical projects of attackers that have intersections with the malicious infrastructure used in the current incident that we are investigating. If it is possible to detect "white projects", then finding the attacker, as a rule, becomes a trivial task. In the case of "grey" ones, it takes more time and effort to search, as their owners try to anonymize or hide registration data, but the chances remain quite high. As a rule, at the beginning of their criminal activities, attackers pay less attention to their own security and make more mistakes, so the deeper we can dive into the story, the higher the chances of a successful investigation. That is why a network graph with a good history is an extremely important element of such an investigation. Simply put, the deeper the historical data a company has, the better its graph is. Let's say a 5-year history can help solve, conditionally, 1-2 out of 10 crimes, and a 15-year history gives a chance to solve all ten.

Phishing and fraud detection

Every time we receive a suspicious link to phishing, fraudulent or pirated resources, we automatically build a graph of related network resources and check all found hosts for similar content. This allows you to find both old phishing sites that were active but unknown, as well as completely new ones that are prepared for future attacks, but are not yet used. An elementary example that occurs quite often: we found a phishing site on a server with only 5 sites. By checking each of them, we find phishing content on other sites, which means we can block 5 instead of 1.

Search for backends

This process is necessary to establish where the malicious server is actually located.
99% of cardshops, hacker forums, a lot of phishing resources and other malicious servers are hidden behind both their own proxy servers and behind the proxy of legitimate services, such as Cloudflare. Knowing about the real backend is very important for investigations: a hosting provider becomes known from which a server can be seized, it becomes possible to build links with other malicious projects.

For example, you have a phishing site for collecting bank card data that resolves to IP address 11.11.11.11, and a card shop address that resolves to IP address 22.22.22.22. During the analysis, it may turn out that both the phishing site and the cardshop have a common backend IP address, for example, 33.33.33.33. This knowledge makes it possible to build a connection between phishing attacks and a card shop, which may sell bank card data.

Event correlation

When you have two different triggers (let's say on an IDS) with different malware and different servers to control the attack, you will consider them as two independent events. But if there is a good connection between malicious infrastructures, then it becomes obvious that these are not different attacks, but stages of one, more complex multi-stage attack. And if one of the events has already been attributed to some group of attackers, then the second one can also be attributed to the same group. Of course, the attribution process is much more complex, so treat it as a simple example.

Enrichment of indicators

We will not pay much attention to this, since this is the most common scenario for using graphs in cybersecurity: you give one indicator as input, and as an output you get an array of related indicators.

Pattern Revealing

Pattern detection is essential for effective hunting. Graphs allow not only to find related elements, but also to identify common properties that are inherent in a certain group of hackers. Knowing these unique features allows you to recognize the infrastructure of the attackers at the stage of preparation and without evidence confirming the attack, such as phishing emails or malware.

Why did we create our own network graph?

Again, we considered solutions from different vendors before we came to the conclusion that we need to develop our own tool that can do something that is not available in any existing product. It took several years to create it, during which we completely changed it more than once. But, despite the long development period, we still have not found a single analogue that would meet our requirements. With the help of our own product, we were eventually able to solve almost all the problems we found in existing network graphs. Let's look at these issues in detail below:

Problem
Solution

The absence of a provider with different data collections: domains, passive DNS, passive SSL, DNS records, open ports, running services on ports, files interacting with domain names and IP addresses. Explanation. Typically, vendors provide separate types of data, and you need to buy subscriptions from all to get the full picture. Even so, it's not always possible to get all the data: some passive SSL providers provide data only about certificates issued by trusted CAs, and their coverage of self-signed certificates is extremely poor. Others provide data on self-signed certificates, but collect them only from standard ports.
We have collected all the above collections ourselves. For example, to collect data about SSL certificates, we wrote our own service that collects them both from trusted CAs and by scanning the entire IPv4 space. Certificates were collected not only from IP, but also from all domains and subdomains from our database: if you have an example.com domain and its subdomain www.example.com and they all resolve to IP 1.1.1.1, then when you try to get an SSL certificate from port 443 by IP, domain and its subdomain, you can get three different results. To collect data on open ports and running services, I had to create my own distributed scanning system, because the IP addresses of scanning servers were often blacklisted by other services. Our scanning servers are also blacklisted, but the result of finding the services we need is higher than those who simply scan as many ports as possible and sell access to this data.

Lack of access to the entire database of historical records. Explanation. Each normal supplier has a good accumulated history, but for natural reasons, we, as a client, could not get access to all historical data. Those. you can get the entire history for a single record, such as a domain or IP address, but you can't see the history of everything—and without that, you can't see the full picture.
In order to collect as many historical records by domains as possible, we bought various databases, parsed a lot of open resources that had this history (it's good that there were many of them), and negotiated with domain name registrars. All updates in our own collections are, of course, stored with a full history of changes.

All existing solutions allow you to build a graph manually. Explanation. Let's say you've bought a lot of subscriptions from all possible data providers (commonly referred to as "enrichers"). When you need to build a graph, you “hands” give the command to complete the connection from the desired element, then select the necessary elements from the elements that appear and give the command to complete the connections from them, and so on. In this case, the responsibility for how well the graph will be built lies entirely with the person.
We have made automatic construction of graphs. Those. if you need to build a graph, then connections from the first element are built automatically, then from all subsequent ones too. The specialist only indicates the depth with which to build the graph. The process of automatic graph completion itself is simple, but other vendors do not implement it because it gives a huge number of irrelevant results, and we also had to take this drawback into account (see below).

The set of irrelevant results is the problem of all graphs over network elements. Explanation. For example, a “bad domain” (participated in an attack) is associated with a server that has 10 other domains associated with it over the past 500 years. When manually adding or automatically building a graph, all these 500 domains should also get out on the graph, although they are not related to the attack. Or, for example, you check an IP indicator from a vendor's security report. As a rule, such reports are released with a significant delay and often cover a year or more. Most likely, at the moment you are reading the report, the server with this IP address has already been rented out to other people with different connections, and building a graph will lead to you again getting irrelevant results.
We trained the system to identify irrelevant elements according to the same logic as our experts did with their hands. For example, you are checking the bad domain example.com, which now resolves to IP 11.11.11.11, and a month ago - to IP 22.22.22.22. In addition to the example.com domain, example.ru is also associated with IP 11.11.11.11, and 22.22.22.22 other domains are associated with IP 25. The system, like a person, understands that 11.11.11.11 is most likely a dedicated server, and since the example.ru domain is similar in spelling to example.com, then, with a high probability, they are connected and should be on the graph; but IP 22.22.22.22 belongs to shared hosting, so all its domains do not need to be placed on the graph if there are no other links showing that one of these 25 thousand domains also needs to be taken out (for example, example.net). Before the system understands that the links need to be broken and some of the elements not put on the graph, it takes into account the many properties of the elements and clusters in which these elements are combined, as well as the strength of the current links. For example, if we have a small cluster (50 elements) on the graph, which includes a bad domain, and another large cluster (5 thousand elements) and both clusters are connected by a connection (line) with very low strength (weight), then such a connection will break and elements from the large cluster will be removed. But if there are many connections between small and large clusters and their strength will gradually increase, then in this case the connection will not break and the necessary elements from both clusters will remain on the graph.

The interval of ownership of the server, domain is not taken into account. Explanation. Bad domains expire sooner or later and are bought again for malicious or legitimate purposes. Even at bulletproof hosting, servers are leased to different hackers, so it is critical to know and take into account the interval when a particular domain/server was under the control of one owner. We often encounter a situation where a server with IP 11.11.11.11 is now used as a C&C for a banking bot, and 2 months ago Ransomware was controlled from it. If you build a connection without taking into account ownership intervals, then it will look like there is a connection between the owners of the banking botnet and the extortionists, when in fact there is none. In our work, this error is critical.
We taught the system to determine the intervals of possession. For domains, this is relatively easy, because whois often has registration start and expiration dates, and when there is a complete history of whois changes, it is easy to determine the intervals. When a domain registration period has not expired, but its control has been transferred to other owners, you can also track it. For SSL certificates, there is no such problem, because it is issued once, is not renewed or transferred. But with self-signed certificates, you cannot trust the dates specified in the certificate expiration dates, because you can generate an SSL certificate today, and specify the certificate validity date from 2010. The most difficult thing is to determine the ownership intervals for servers, because only hosting providers have dates and lease periods. To determine the period of ownership of a server, we began to use the results of port scanning and the creation of fingerprints of running services on ports. Based on this information, we can quite accurately say when the server changed its owner.

Few connections. Explanation. Now it's not even a problem to get a free list of domains whose whois contains a certain e-mail address, or to find out all the domains that were associated with a certain IP address. But when it comes to hackers who do their best to make them hard to track, additional “tricks” are needed to find new properties and build new connections.
We have spent a lot of time researching how we can extract data that is not available in the usual way. We cannot describe here how it works for obvious reasons, but under certain circumstances, hackers make mistakes when registering domains or renting and setting up servers, which allow you to find out email addresses, aliases of hackers, backend addresses. The more connections you extract, the more accurately you can build graphs.

How our graph works

To start using the network graph, you need to enter a domain, IP address, email, or SSL certificate fingerprint in the search box. There are three conditions that the analyst can control: time, step depth, and clearing.

Your way out, graph: how we did not find a good network graph and created our own

Time

Time - the date or interval when the searched element was used for malicious purposes. If you do not specify this parameter, the system will determine the last ownership interval for this resource. For example, on July 11, Eset published report about how Buhtrap uses a 0-day exploit for cyberespionage. There are 6 indicators at the end of the report. One of them secure-telemetry[.]net was re-registered on July 16th. Therefore, if you build a graph after July 16, you will get irrelevant results. But if you indicate that this domain was used before this date, then 126 new domains, 69 IP addresses, which are not indicated in the Eset report, fall into the graph:

  • ukrfreshnews[.]com
  • unian-search[.]com
  • vesti-world[.]info
  • runewsmeta[.]com
  • foxnewsmeta[.]biz
  • sobesednik-meta[.]info
  • rian-ua[.]net
  • and more

In addition to network indicators, we immediately find links to malicious files that had links to this infrastructure and tags that tell us that Meterpreter, AZORult were used.

The great thing is that you get this result within one second and you no longer need to spend days on data analysis. Of course, such an approach sometimes significantly reduces the time for investigations, which is often critical.

Your way out, graph: how we did not find a good network graph and created our own

The number of steps or the depth of recursion with which the graph will be built

By default, the depth is 3. This means that all directly related elements will be found from the searched element, then new links to other elements will be built for each new element, and from the new elements from the previous step there will be new elements.

Let's take an example not related to APT and 0-day exploits. Recently, an interesting case of fraud related to cryptocurrencies was described on Habré. The report mentions the domain - themcx[.]co, used by scammers to host the website of the alleged Miner Coin Exchange and phone-lookup[.]xyz, to attract traffic.

It is clear from the description that the scheme requires a sufficiently large infrastructure to attract traffic to fraudulent resources. We decided to look at this infrastructure by building a graph in 4 steps. As a result, we got a graph with 230 domains and 39 IP addresses. Next, we break the domains into 2 categories: those that look like crypto-currency services and those that are designed to drive traffic through phone verification services:

Related to cryptocurrencies
Associated with phone punching services

coinkeeper[.]cc
caller-record[.]site.

mcxwallet[.]co
phone-records[.]space

btcnoise[.]com
fone-uncover[.]xyz

cryptominer[.]watch
number-uncover[.]info

Your way out, graph: how we did not find a good network graph and created our own

cleaning

By default, the “Graph cleaning” option is enabled and all irrelevant elements will be removed from the graph. By the way, it was used in all previous examples. I foresee a natural question: how to make sure that something important is not deleted? I will answer: for analysts who like to build graphs by hand, automated cleaning can be disabled and the number of steps = 1 can be selected. Then the analyst will be able to complete the graph from the elements he needs and remove elements from the graph that are irrelevant to the task.

Already on the graph, the analyst has access to the history of whois, DNS changes, as well as open ports and services running on them.

Your way out, graph: how we did not find a good network graph and created our own

financial phishing

We investigated the actions of one APT group, which for several years carried out phishing attacks against customers of various banks in different regions. A characteristic feature of this group was the registration of domains that are very similar to the names of real banks, and most of the phishing sites had the same design, the differences were only in the names of banks and their logos.

Your way out, graph: how we did not find a good network graph and created our own
In this case, automated graph analysis helped us a lot. Taking one of their domains, lloydsbnk-uk[.]com, in a few seconds we built a graph with a depth of 3 steps, which revealed more than 250 malicious domains that have been used by this group since 2015 and continue to be used. Some of these domains have already been bought by banks, but historical records show that they were previously registered to attackers.

For clarity, the figure shows a graph with a depth of 2 steps.

It is noteworthy that already in 2019, the attackers changed their tactics somewhat and began to register not only bank domains for hosting web phishing, but also domains of various consulting companies for sending phishing emails. For example, domains swift-department.com, saudconsultancy.com, vbgrigoryanpartners.com.

Your way out, graph: how we did not find a good network graph and created our own

cobalt gang

In December 2018, the Cobalt hacker group, which specializes in targeted attacks on banks, carried out a mailing list on behalf of the National Bank of Kazakhstan.

Your way out, graph: how we did not find a good network graph and created our own
The letters contained links to hXXps://nationalbank.bz/Doc/Prikaz.doc. The downloaded document contained a macro that launches powershell, which will attempt to download and execute the file from hXXp://wateroilclub.com/file/dwm.exe to %Temp%einmrmdmy.exe. The file %Temp%einmrmdmy.exe aka dwm.exe is a CobInt stager configured to interact with the hXXp://admvmsopp.com/rilruietguadvtoefmuy server.

Imagine not being able to receive these phishing emails and do a full analysis of the malicious files. The graph for the malicious domain nationalbank[.]bz immediately shows connections with other malicious domains, attributes it to a group, and shows which files were used in the attack.

Your way out, graph: how we did not find a good network graph and created our own
Let's take the IP address 46.173.219[.]152 from this graph and build a graph on it in one pass and turn off cleaning. 40 domains are associated with it, for example, bl0ckchain[.]ug
paypal.co.uk.qlg6[.]pw
cryptoelips[.]com

Judging by the domain names, it seems that they are used in fraudulent schemes, but the cleaning algorithm realized that they were not related to this attack and did not put them on the graph, which greatly simplifies the process of analysis and attribution.

Your way out, graph: how we did not find a good network graph and created our own
If you rebuild the graph using nationalbank[.]bz, but disable the graph cleanup algorithm, then it will contain more than 500 elements, most of which have nothing to do with either the Cobalt group or their attacks. An example of what such a graph looks like is shown below:

Your way out, graph: how we did not find a good network graph and created our own

Conclusion

After several years of fine-tuning, testing in real investigations, researching threats and hunting for attackers, we managed not only to create a unique tool, but also to change the attitude of experts within the company towards it. Initially, technical experts want complete control over the process of building the graph. It was extremely difficult to convince them that automatic graph construction could do this better than a person with many years of experience. Everything was decided by time and multiple “manual” checks of the results of what the graph issued. Now our experts not only trust the system, but also use the results obtained by it in their daily work. This technology works inside each of our systems and allows us to better detect threats of any type. The interface for manual graph analysis is built into all Group-IB products and significantly expands the possibilities for hunting for cybercrime. This is confirmed by the reviews of analysts from our clients. And we, in turn, continue to enrich the graph with data and work on new algorithms using artificial intelligence for the most accurate network graph.

Source: habr.com

Add a comment