cloud.google.com Open in urlscan Pro
2a00:1450:4001:830::200e  Public Scan

Submitted URL: https://t.co/hOt5mZyEqi
Effective URL: https://cloud.google.com/blog/topics/threat-intelligence/gemini-for-malware-analysis?utm_source=twitter&utm_medium=unpaid...
Submission: On May 22 via manual from KR — Scanned from DE

Form analysis 2 forms found in the DOM

/blog/search/

<form action="/blog/search/" class="A2C6Ob"><input class="BAhdXd" jsname="oJAbI" name="query" type="text" placeholder="Find an article..."><input type="hidden" name="language" value="en" hidden=""><input type="hidden" name="category" value="article"
    hidden=""><input type="hidden" name="paginate" value="25" hidden=""><input type="hidden" name="order" value="newest" hidden=""><input type="hidden" name="hl" value="en" hidden=""><span class="A0lwXc" jsname="D8MWrd"
    aria-label="Show the search input field." role="button" jsaction="click:jUF4E"><svg class="nRhiJb-Bz112c nRhiJb-Bz112c-OWXEXe-xgZe3c" viewBox="0 0 24 24" role="presentation" aria-hidden="true" width="40" height="22">
      <path d="M20.49 19l-5.73-5.73C15.53 12.2 16 10.91 16 9.5A6.5 6.5 0 1 0 9.5 16c1.41 0 2.7-.47 3.77-1.24L19 20.49 20.49 19zM5 9.5C5 7.01 7.01 5 9.5 5S14 7.01 14 9.5 11.99 14 9.5 14 5 11.99 5 9.5z"></path>
    </svg></span></form>

/blog/search/

<form action="/blog/search/" class="A2C6Ob"><input class="BAhdXd" jsname="oJAbI" name="query" type="text" placeholder="Find an article..."><input type="hidden" name="language" value="en" hidden=""><input type="hidden" name="category" value="article"
    hidden=""><input type="hidden" name="paginate" value="25" hidden=""><input type="hidden" name="order" value="newest" hidden=""><input type="hidden" name="hl" value="en" hidden=""><span class="A0lwXc" jsname="D8MWrd"
    aria-label="Show the search input field." role="button" jsaction="click:jUF4E"><svg class="nRhiJb-Bz112c nRhiJb-Bz112c-OWXEXe-xgZe3c" viewBox="0 0 24 24" role="presentation" aria-hidden="true" width="40" height="22">
      <path d="M20.49 19l-5.73-5.73C15.53 12.2 16 10.91 16 9.5A6.5 6.5 0 1 0 9.5 16c1.41 0 2.7-.47 3.77-1.24L19 20.49 20.49 19zM5 9.5C5 7.01 7.01 5 9.5 5S14 7.01 14 9.5 11.99 14 9.5 14 5 11.99 5 9.5z"></path>
    </svg></span></form>

Text Content

cloud.google.com uses cookies from Google to deliver and enhance the quality of
its services and to analyze traffic. Learn more.

Hide
Jump to Content

Cloud

Blog
Contact sales Get started for free


Cloud
Blog
Solutions & technology
Security
Ecosystem
Industries
 * Solutions & technology
 * Ecosystem
 * Developers & Practitioners
 * Transform with Google Cloud

 * AI & Machine Learning
 * API Management
 * Application Development
 * Application Modernization
 * Chrome Enterprise
 * Compute
 * Containers & Kubernetes
 * Data Analytics
 * Databases
 * DevOps & SRE
 * Maps & Geospatial
 * Security
 * Infrastructure
 * Infrastructure Modernization
 * Networking
 * Productivity & Collaboration
 * SAP on Google Cloud
 * Storage & Data Transfer
 * Sustainability

 * Security & Identity
 * Threat Intelligence

 * IT Leaders
 * Industries
 * Partners
 * Startups & SMB
 * Training & Certifications
 * Inside Google Cloud
 * Google Cloud Next & Events
 * Google Maps Platform
 * Google Workspace

 * Financial Services
 * Healthcare & Life Sciences
 * Manufacturing
 * Media & Entertainment
 * Public Sector
 * Retail
 * Supply Chain
 * Telecommunications

 * Solutions & technology
   * AI & Machine Learning
   * API Management
   * Application Development
   * Application Modernization
   * Chrome Enterprise
   * Compute
   * Containers & Kubernetes
   * Data Analytics
   * Databases
   * DevOps & SRE
   * Maps & Geospatial
   * Security
     * Security & Identity
     * Threat Intelligence
   * Infrastructure
   * Infrastructure Modernization
   * Networking
   * Productivity & Collaboration
   * SAP on Google Cloud
   * Storage & Data Transfer
   * Sustainability
 * Ecosystem
   * IT Leaders
   * Industries
     * Financial Services
     * Healthcare & Life Sciences
     * Manufacturing
     * Media & Entertainment
     * Public Sector
     * Retail
     * Supply Chain
     * Telecommunications
   * Partners
   * Startups & SMB
   * Training & Certifications
   * Inside Google Cloud
   * Google Cloud Next & Events
   * Google Maps Platform
   * Google Workspace
 * Developers & Practitioners
 * Transform with Google Cloud

Contact sales Get started for free



Threat Intelligence
FROM ASSISTANT TO ANALYST: THE POWER OF GEMINI 1.5 PRO FOR MALWARE ANALYSIS

April 29, 2024
 * 
 * 
 * 
 * 

BERNARDO QUINTERO



TRY GEMINI 1.5 MODELS

Google's most advanced multimodal models in Vertex AI

Try it


EXECUTIVE SUMMARY

 * A growing amount of malware has naturally increased workloads for defenders
   and particularly malware analysts, creating a need for improved automation
   and approaches to dealing with this classic threat.

 * With the recent rise in generative AI tools, we decided to put our own Gemini
   1.5 Pro to the test to see how it performed at analyzing malware. By
   providing code and using a simple prompt, we asked Gemini 1.5 Pro to
   determine if the file was malicious, and also to provide a list of activities
   and indicators of compromise.

 * We did this for multiple malware files, testing with both decompiled and
   disassembled code, and Gemini 1.5 Pro was notably accurate each time,
   generating summary reports in human-readable language. Gemini 1.5 Pro was
   even able to make an accurate determination of code that — at the time — was
   receiving zero detections on VirusTotal. 

 * In our testing with other similar gen AI tools, we were required to divide
   the code into chunks, which led to vague and non-specific outcomes, and
   affected the overall analysis. Gemini 1.5 Pro, however, processed the entire
   code in a single pass, and often in about 30 to 40 seconds.


INTRODUCTION

The explosive growth of malware continues to challenge traditional, manual
analysis methods, underscoring the urgent need for improved automation and
innovative approaches. Generative AI models have become invaluable in some
aspects of malware analysis, yet their effectiveness in handling large and
complex malware samples has been limited. The introduction of Gemini 1.5 Pro,
capable of processing up to 1 million tokens, marks a significant breakthrough.
This advancement not only empowers AI to function as a powerful assistant in
automating the malware analysis workflow but also significantly scales up the
automation of code analysis. By substantially increasing the processing
capacity, Gemini 1.5 Pro paves the way for a more adaptive and robust approach
to cybersecurity, helping analysts manage the asymmetric volume of threats more
effectively and efficiently.


TRADITIONAL TECHNIQUES FOR AUTOMATED MALWARE ANALYSIS

The foundation of automated malware analysis is built on a combination of static
and dynamic analysis techniques, both of which play crucial roles in dissecting
and understanding malware behavior. Static analysis involves examining the
malware without executing it, providing insights into its code structure and
unobfuscated logic. Dynamic analysis, on the other hand, involves observing the
execution of the malware in a controlled environment to monitor its behavior,
regardless of obfuscation. Together, these techniques are leveraged to gain a
comprehensive understanding of malware.

Parallel to these techniques, AI and machine learning (ML) have increasingly
been employed to classify and cluster malware based on behavioral patterns,
signatures, and anomalies. These methodologies have ranged from supervised
learning, where models are trained on labeled datasets, to unsupervised learning
for clustering, which identifies patterns without predefined labels to group
similar malware.

Despite technological advancements, the increasing complexity and volume of
malware present substantial challenges. While ML enhances the detection of
malware variants, it remains inadequate against completely new threats. This
detection gap allows advanced attacks to slip through cybersecurity defenses,
compromising system protection.


GENERATIVE AI AS MALWARE ANALYSIS ASSISTANT 

Code Insight, unveiled at the RSA Conference 2023, marked a significant step
forward in leveraging generative AI (gen AI) for malware analysis. This novel
feature of Google's VirusTotal platform specializes in analyzing code snippets
and generating reports in natural language, effectively emulating the approach
of a malware analyst. Initially supporting PowerShell scripts, Code Insight
later expanded to other scripting languages and file formats, including Batch,
Shell, VBScript, and Office documents.

By processing the code and generating summary reports, Code Insight assists
analysts in understanding the behavior of the code and identifying attack
techniques. This includes uncovering hidden functionalities, malicious intent,
and potential attack vectors that might be missed by traditional detection
methods.

However, due to the inherent constraints of large language models (LLMs) and
their limited token input capacity, the size of files that Code Insight could
handle was restricted. Although there have been continuous improvements to
increase the maximum file size limit and support more formats, analyzing
binaries and executables still poses a significant challenge. When these files
are disassembled or decompiled, their code size typically surpasses the
processing capabilities of the LLMs available at the time. Consequently, gen AI
models have functioned primarily as assistants to human analysts, enabling the
analysis of specific code fragments from binaries rather than processing the
entire code, which is often too voluminous for these models.


REVERSE ENGINEERING: THE HUMAN FACE OF MALWARE ANALYSIS

Reverse engineering is arguably the most advanced malware analysis technique
available to cybersecurity professionals. This process involves disassembling
the binaries of malicious software and carrying out a meticulous examination of
the code. Through reverse engineering, analysts can uncover the exact
functionality of malware and understand its execution flow. However, this method
is not without its challenges. It requires an immense amount of time, a deep
level of expertise, and an analytical mindset to interpret each instruction,
data structure, and function call to reconstruct the malware's logic and uncover
its secrets.

Furthermore, scaling reverse engineering efforts poses a significant challenge.
The scarcity of specialized talent in this field exacerbates the difficulty of
conducting these analyses at scale. Given the intricate and time-consuming
nature of reverse engineering, the cybersecurity community has long sought ways
to augment this process, making it more efficient and accessible.


GEMINI 1.5 PRO: SCALABLE REVERSE ENGINEERING FOR MALWARE ANALYSIS

The ability to process prompts of up to 1 million tokens enables a qualitative
leap in malware analysis, particularly in the realm of reverse engineering. This
advancement finally brings the power of gen AI to the analysis of binaries and
executables, a task previously reserved for highly skilled human analysts due to
its complexity.

How does Gemini 1.5 Pro achieve this?

 * Increased capacity: With its expanded token limit, Gemini 1.5 Pro can
   entirely analyze some disassembled or decompiled executables in a single
   pass, eliminating the need to break down code into smaller fragments. This is
   crucial because fragmenting code can lead to a loss of context and important
   correlations between different parts of the program. When analyzing only
   small snippets, it is difficult to understand the overall functionality and
   behavior of the malware, potentially missing key insights into its purpose
   and operation. By analyzing the entire code at once, Gemini 1.5 Pro gains a
   holistic understanding of the malware, allowing for more accurate and
   comprehensive analysis.

 * Code interpretation: Gemini 1.5 Pro can interpret the intent and purpose of
   the code, not just identify patterns or similarities. This is possible due to
   its training on a massive dataset of code, encompassing assembly language
   from various architectures, high-level languages like C, and pseudo-code
   produced by decompilers. This extensive knowledge base, combined with its
   understanding of operating systems, networking, and cybersecurity principles,
   allows Gemini 1.5 Pro to effectively emulate the reasoning and judgment of a
   malware analyst. As a result, it can predict the malware's actions and
   provide valuable insights even for never-seen-before threats. For more
   information on this, see the zero day case study section later in this post.

 * Detailed analysis: Gemini 1.5 Pro can generate summary reports in
   human-readable language, making the analysis process more accessible and
   efficient. This goes far beyond the simple verdicts typically provided by
   traditional machine learning algorithms for classification and clustering.
   Gemini 1.5 Pro's reports can include detailed information about the malware's
   functionality, behavior, and potential attack vectors, as well as indicators
   of compromise (IOCs) that can be used to feed other security systems and
   improve threat detection and prevention capabilities.

Let's explore a practical case study to examine how Gemini 1.5 Pro performs in
analyzing decompiled code with a representative malware sample. We processed two
WannaCry binaries automatically using the Hex-Rays decompiler, without adding
any annotations or additional context. This approach resulted in two C code
files, one 268 KB and the other 231 KB in size, which together amount to more
than 280,000 tokens for processing by the LLM.



In our testing with other similar gen AI tools, we faced the necessity of
dividing the code into chunks. This fragmentation often compromised the
comprehensiveness of the analysis, resulting in vague and non-specific outcomes.
These limitations highlight the challenges of using such tools with complex code
bases.

Gemini 1.5 Pro, however, marks a significant departure from these constraints.
It processes the entire decompiled code in a single pass, taking just 34 seconds
to deliver its analysis. The initial summary provided by Gemini 1.5 Pro is
notably accurate, showcasing its ability to handle large and complex datasets
seamlessly and effectively:

 * Issues a malicious verdict associated with ransomware

 * Identifies some files as IOCs (c.wnry and tasksche.exe)

 * Acknowledges the use of an algorithm to generate IP addresses and perform
   network scans to find targets on port 445/SMB to spread to other computers

 * Identifies URL/domain (WannaCry's "killswitch") and relevant registry key and
   mutex



While it might seem that Gemini 1.5 Pro's report of WannaCry is based on
pre-trained knowledge of this specific malware, this isn't the case. The
analysis comes from the model's ability to independently interpret the code.
This will become even clearer as we look at the upcoming examples where Gemini
1.5 Pro analyzes unfamiliar malware samples, demonstrating its wide-ranging
capabilities.


LLM ON CODE: DISASSEMBLED VS. DECOMPILED

In the previous example showcasing WannaCry analysis, there was a crucial step
before feeding the code to the LLM: decompilation. This process, which
transforms binary code into a higher-level representation like C, is fully
automated and mirrors the initial steps taken by malware analysts when manually
dissecting malicious software. But what is the difference between disassembled
and decompiled code, and how does it impact LLM analysis?

 * Disassembly: This process converts binary code into assembly language, a
   low-level representation specific to the processor architecture. While
   human-readable, assembly code is still quite complex and requires significant
   expertise to understand. It is also much longer and more repetitive than the
   original source code.

 * Decompilation: This process attempts to reconstruct the original source code
   from the binary. While not always perfect, decompilation can significantly
   improve readability and conciseness compared to disassembled code. It
   achieves this by identifying high-level constructs like functions, loops, and
   variables, making the code easier to understand for analysts.

Given these factors, when using LLMs for binary analysis, decompilation offers
several advantages on efficiency and scalability. The shorter and more
structured output from decompilation fits more readily within the processing
constraints of LLMs, allowing for a more efficient analysis of large or complex
binaries. In fact, the output from a decompiler is five to 10 times more concise
than that produced by a disassembler.

Disassembly is necessary to perform accurate decompilation and remains an
invaluable tool in certain scenarios where detailed, low-level analysis is
crucial. Given the structured and higher-level nature of decompiled output,
there are specific circumstances where disassembly provides insights that
decompilation cannot match.

Fortunately, Gemini 1.5 Pro demonstrates equal capability in processing both
high-level languages and assembly across various architectures. Thus, our
implementation for automating binary analysis can utilize both strategies or
adopt a hybrid approach, as suited to the specific circumstances of each case.
This flexibility allows us to tailor our analysis method to the nature of the
binary in question, optimizing for efficiency, depth of insight, and the
specific objectives of the analysis, whether that means dissecting the logic and
flow of the program or diving into the intricate details of its low-level
operations.

Next, we'll examine a case where we directly employ disassembly for analysis.
This time, we're working with a more recent and unknown binary; in fact, the
executable submitted to VirusTotal is flagged as malicious by only four out of
the 70 VirusTotal anti-malware engines, and only in a generic sense, without
providing any details about the malware family that could offer further clues
about its behavior.




After automatic preprocessing with HexRays/IDA Pro, the 306.50 KB executable
binary produces a 1.5 MB assembly file that Gemini 1.5 Pro can process in a
single pass within 46 seconds , thanks to its large token window in the prompt.
This capability allows for an analysis of the entire assembly output, offering
detailed insights into the binary's operations.



This case of the unknown binary showcases the remarkable capabilities of Gemini
1.5 Pro. Despite only four out of 70 anti-malware engines on VirusTotal flagging
the file as malicious—using only generic signatures—Gemini 1.5 Pro identified
the file as malicious, providing a detailed explanation for its verdict. The
file is likely a game cheat designed to inject a game hack dynamic-link library
(DLL) into the Grand Theft Auto video game process. The designation of
"malicious" may depend on perspective: deemed malicious by the game's developers
or their security team focused on anti-cheating measures, yet potentially
desirable for some players. Nevertheless, this automated first-pass analysis is
not only impressive but also illuminating regarding the nature and intent of the
binary.


UNVEILING THE UNKNOWN: A CASE STUDY IN ZERO-DAY DETECTION

The true test of any malware analysis tool lies in its ability to identify
never-before-seen threats undetected by traditional methods and proactively
protecting systems from zero-day attacks. Here, we examine a case where an
executable file is undetected by any anti-virus or sandbox on VirusTotal.



The 833 KB file, medui.exe, was decompiled into 189,080 tokens and subsequently
processed by Gemini 1.5 Pro in a mere 27 seconds to produce a complete malware
analysis report in a single pass.




This analysis revealed suspicious functionalities, leading Gemini 1.5 Pro to
issue a malicious verdict. Based on its observations, it concluded that the
primary goal of this malware is to steal cryptocurrency by hijacking Bitcoin
transactions and evading detection through the disabling of security software.

This showcases Gemini's ability to go beyond simple pattern matching or ML
classification and leverage its deep understanding of code behavior to identify
malicious intent, even in previously unseen threats. This is a significant
advancement in the field of malware analysis, as it allows us to proactively
detect and respond to new and emerging threats that traditional methods might
miss.


FROM ASSISTANT TO ANALYST

Gemini 1.5 Pro unlocks impressive capabilities, enabling the analysis of large
volumes of decompiled and disassembled code. It has the potential to
significantly change our approach to fighting malware by enhancing efficiency,
accuracy, and our ability to scale in response to a growing number of threats.

However, it's important to remember that this is just the beginning. While
Gemini 1.5 Pro represents a significant leap forward, the field of gen AI is
still in its infancy. There are several challenges that need to be addressed to
achieve truly robust and reliable automated malware analysis:

 * Obfuscation and packing: Malware authors are constantly developing new
   techniques to obfuscate their code and evade detection. In response, there's
   a growing need to not only continuously improve gen AI models but also to
   enhance the preprocessing of binaries before analysis. Adopting dynamic
   approaches that utilize various preprocessing tools can more effectively
   unpack and deobfuscate malware. This preparatory step is crucial for enabling
   gen AI models to accurately analyze the underlying code, ensuring they keep
   pace with evolving obfuscation techniques and remain effective in detecting
   and understanding sophisticated malware threats.

 * Increasing binary size: The complexity of modern software is mirrored in the
   growing size of its binaries. This trend presents a significant challenge, as
   the majority of gen AI models are constrained by much lower token window
   limits. In contrast, Gemini 1.5 Pro stands out by supporting up to 1 million
   tokens—currently the highest known capacity in the field. Nevertheless, even
   with this remarkable capability, Gemini 1.5 Pro may encounter limitations
   when handling exceptionally large binaries. This underscores the ongoing need
   for advancements in AI technology to accommodate the analysis of increasingly
   large files, ensuring comprehensive and effective malware analysis as
   software complexity continues to escalate.

 * Evolving attack techniques: As attackers continuously innovate, crafting new
   methods to bypass security measures, the challenge for gen AI models extends
   beyond simple adaptability. These models must not only learn and recognize
   new threats but also evolve in conjunction with the efforts of researchers
   and developers. There's a need to devise new methods for automating the
   preprocessing of threat data, which would enrich the context provided to AI
   models. For instance, integrating additional data from static and dynamic
   analysis tools, such as sandbox reports, plus the decompiled and disassembled
   code, can significantly enhance the models' understanding and detection
   capabilities. 

The journey towards scaling automated malware analysis is ongoing, but Gemini
1.5 Pro marks a significant milestone. Give Gemini 1.5 Pro a try; we look
forward to seeing the innovative ways the community leverages it to enhance
security operations.

At GSEC Malaga, we continue to research and develop ways to apply these models
effectively in AI, pushing the boundaries of what's possible in cybersecurity
and contributing to a safer digital future. 


MALWARE DETAILS

The following table contains details on the malware samples discussed in this
post.



Filename

SHA-256 Hash

Size

First Seen

File Type

lhdfrgui.exe (WannaCry dropper)

24d004a104d4d54034dbcffc2a4b19a
11f39008a575aa614ea04703480b1022c

3.55 MB (3723264 bytes)

2017-05-12

Win32 EXE

tasksche.exe (WannaCry cryptor)

ed01ebfbc9eb5bbea545af4d01bf5f10
71661840480439c6e5babe8e080e41aa

3.35 MB (3514368 bytes)

2017-05-12

Win32 EXE

EXEC.exe

1917ec456c371778a32bdd74e113b0
7f33208740327c3cfef268898cbe4efbfe

306.50 KB (313856 bytes)

2022-04-18

Win32 EXE

medui.exe

719b44d93ab39b4fe6113825349add
fe5bd411b4d25081916561f9c403599e50

833.50 KB (853504 bytes)

2024-03-27

Win32 EXE


PROMPT

The following is the exact prompt used in all the examples covered in the post.
The only exception is the example where the word "disassembled" is used instead
of "decompiled" because, as explained, we're working with disassembled code
rather than decompiled code to show that Gemini 1.5 Pro can interpret both.



Act as a malware analyst by thoroughly examining this decompiled executable
code. Methodically break down each step, focusing keenly on understanding the
underlying logic and objective. Your task is to craft a detailed summary that
encapsulates the code's behavior, pinpointing any malicious functionality. Start
with a verdict (Benign or Malicious), then a list of activities including a list
of IOCs if any URLs, created files, registry entries, mutex, network activity,
etc.

+[attached decompiled.c.txt sample file]

Posted in
 * Threat Intelligence
 * Security & Identity

RELATED ARTICLES

Threat Intelligence


HOLES IN YOUR BITBUCKET: WHY YOUR CI/CD PIPELINE IS LEAKING SECRETS

By Mandiant • 5-minute read

Threat Intelligence


UNCHARMED: UNTANGLING IRAN'S APT42 OPERATIONS

By Mandiant • 58-minute read

Threat Intelligence


RANSOMWARE PROTECTION AND CONTAINMENT STRATEGIES: PRACTICAL GUIDANCE FOR
HARDENING AND PROTECTING INFRASTRUCTURE, IDENTITIES AND ENDPOINTS

By Mandiant • 3-minute read

Threat Intelligence


POLL VAULTING: CYBER THREATS TO GLOBAL ELECTIONS

By Mandiant • 29-minute read


FOOTER LINKS

FOLLOW US

 * 
 * 
 * 
 * 
 * 

 * Google Cloud
 * Google Cloud Products
 * Privacy
 * Terms
 * Cookies management controls

 * Help
 * Language‪English‬‪Deutsch‬‪Français‬‪한국어‬‪日本語‬