What is entity resolution?
Entity resolution is the process of identifying and correlating information or records of real people, devices, and applications so that security teams can correlate information from various tools across any security event, regardless of log source. Aligning with MITRE, devices can be both physical and virtual. Physical devices can include servers and printers, whereas virtual devices can include virtual machines (VMs) or Amazon Elastic Compute Cloud (EC2). Entity resolution automates the manual effort required to manage an up-to-date and ever-changing asset inventory across tools and sources that represent users and devices differently.
Why is entity resolution important for cybersecurity?
When data is incomplete, inconsistent, or inaccurately represented, businesses and agencies cannot make data-driven business decisions or improve operations. Entity resolution is a critical data integration and management for data engineers that enables consistency and accuracy to an entity for analytics and reporting.
For cybersecurity, risk, and compliance teams, entity resolution can help to improve analytics when monitoring security controls coverage and gaps, addressing cybersecurity and asset hygiene, identifying users responsible for patching vulnerabilities, and detecting threats and chaining alerts. Performing entity resolution before it lands in its final destination such as a data lake enables security teams to be more performant when performing queries such as threat hunting. Since entity details can change over time, relying on a single identifier like an IP address or hostname is ineffective. In some cases, an analyst can parse logs and correlate entity information over an entire afternoon only for the information to change the next day.
Why is entity resolution challenging?
Entity resolution is challenging because data may be in silos, the datasets can have different data formats and schemas, or the volume and velocity of the data are difficult to manage. From a people perspective, manually correlating entity details is time consuming.
When organizations perform entity resolution, they are typically analyzing network log data to create associations between temporary or ephemeral data points like:
- Username
- Email address
- Internet Protocol (IP) addresses
- Domain names
- Media access control (MAC) addresses
User and device access to networks is constantly changing. While a username or email address may be consistently associated with a user, the way that they connect to network access and usage makes them dynamic. For example, users may access networks:
- Remotely from home or during work travel
- Onsite at a branch location, like an organization’s satellite office building
- Using multiple devices such as a mobile device and a machine at a workstation
To collect this data, organizations use log records generated by IT and cybersecurity rules including but not limited to:
- Active Directory (AD), assigning new usernames and emails
- Domain Name System (DNS) logs, indicating transitions of IP addresses between domain names
- Firewall logs, providing insight into network activity and data transmission
- Endpoint detection and response (EDR) records, recording activities and threats across endpoints and workloads
Despite collecting these vast amounts of data, many organizations struggle to gain insights because temporary IDs may be:
- Considered new when they were not observed during previous periods of network activity
- Associated with a network entity not previously included
- Associated with a network entity not otherwise observed in log data
These ephemeral IDs create duplicative data that impacts the overall accuracy and integrity of the data and associated analytics, especially as organizations find themselves using time-consuming, manual correlation processes.
What are some typical use cases for entity resolution?
Identify network performance baseline behaviors
With a static ID, organizations can train machine-learning models to identify historical network activity that acts as a baseline of normal use. With baseline behavior based on a persistent static ID and correlated network log data, organizations can build reliable, accurate analytics to gain meaningful insights.
For example, using a time-series dataset that tracks network transmission data, organizations can create more reliable network maps by correlating:
- Local Area Network (LAN)
- Wireless networks
- Network devices, like routers, switches, or servers
By aggregating and correlating active and passive network monitoring technologies, organizations gain insights into normal activity across all devices, including Internet of Things (IoT) devices.
Suggest owners or responsible parties
As the organization collects more data about users and devices, entity resolution may be used to suggest whether entities should be merged or appear to be new entities that need monitoring. When paired with a security data fabric, resolved entities can be woven with additional context and logs like authentication logs to suggest potential process owners or responsible parties. The security and IT teams can determine whether to create a new static ID or merge the data into an existing one.
For example, with a static user ID and device ID, entity resolution should be able to use network data to correlate data points and suggest merging network entities to identify device owners for:
- Servers
- Laptops
- Tablets
- Smartphones
Build accurate, reliable analytics
With a static entity ID, organizations can build analytics models to identify anomalous user, device, or network behavior that suggests a potential malicious behavior, including but not limited to:
- Insider threats: Users engaging in unusual activity on a network
- Malware infection: Malicious activity associated with a network and device based
- Advanced persistent threat: Associating new device activity with the static ID in the data repository to create a risk attribute indicating malicious actors within the network
Trace security incident root cause
The static user and device IDs give security analysts more consistent, reliable ways to define people and devices when engaging in security incident investigations. Organizations can store their security telemetry in their chosen data repository while still retaining the raw data. Once security analysts identify anomalous and potentially malicious activity, they can use these static IDs to correlate devices and users, identifying the root cause faster.