Blog | Cybersecurity & IT
August 29, 2024

You've reduced data, so what's next

Table of Contents

Organizations often adopt data tiering to reduce the amount of data that they send to their analytics tools, like Security Information and Event Management (SIEM) solutions. By diverting data to an object store or a data lake, organizations are able to manage and lower costs by minimizing the amount of data that their SIEM stores. Although they achieve this tactical objective, the process creates data silos. While people can query the data in isolation, they often fail to glean collective insights across the silos.  

Think of the problem like a large building with cameras across its perimeters. The organization can monitor each camera’s viewpoint, but no individual camera has the full picture, as any spy movie will tell you. Similarly, you might have different tools that see different parts of your security picture. Although SIEMs originally intended to tie together all security data into a composite, cloud applications and other modern IT and cybersecurity technology tool stacks generate too much data to make this cost-effective.  

As organizations balance saving money with having an incomplete picture, a high-quality data fabric architecture can enable them to build more sustainable security data strategies.

From default to normalized

When you implement a data lake, the diverted data remains in its default format. When you try to paint a composite picture across these tools, you rely on what an individual data set understands or sees, leaving you to pick out individual answers from these siloed datasets.

abstract of blue numbers

Instead of asking a question once, you need to ask fragments of the question across different data sets. In some cases, you may have a difficult time ensuring that you have the complete answer.  

With a security data fabric, you can normalize the data before landing it in one or more repositories. DataBee® from Comcast Technology Solutions uses extract, transform, and load processes to automatically parse security data, then normalizes it according to our extended Open Cybersecurity Schema Framework (OCSF) so that you can correlate and understand what’s happening in the aggregate picture.

By normalizing the data on its way to your data lake, you optimize compute and storage costs, eliminating some of the constraints arising from other data federation approaches

Considering your constraints

Federation reduces storage costs, but it introduces limitations that can present challenges for security teams.  

Latency 

When you move data from one location to another, you introduce various time lags. Some providers will define the times per day or number of times that you can transfer data. For example, if you want data in a specific format, some repositories may only manage this transfer once per day.  

Meanwhile, if you want to stream the data into a different format for collection, the reformatting can also create a time lag. A transformation and storage process may take several minutes, which can impact key cybersecurity metrics like mean time to detect (MTTD) or mean time to respond (MTTR).  

When you query security datasets to learn what happened over the last hour, a (near) real-time data source will contribute to an accurate picture, while a delayed source may not have yet received data for the same period. As you attempt to correlate the data to create a timeline, you might need to use multiple data sources that all have different lag times. For example, some may be mostly real-time while another sends data five minutes later. If you ask the question at the time an event occurred, the system may not have information about it for another five minutes, creating a visibility gap.   

Such gaps can create blind spots as you scale your security analytics strategy. The enterprise security team may be asking hundreds of questions across the data system, and the time delay can create a large gap between what you can see and what happened.  

Correlation

Correlating activities from across your disparate IT and security tools is critical. Data gives you facts about an event while correlation enables you to interpret what those facts mean. When you ask fragments of a question across data silos, you have no way to automate the generation of these insights.

For example, a security alert will give you a list of events including hundreds of failed login attempts over three minutes. While you have these facts, you still need to interpret whether they describe malicious actors using stolen credentials or a brute force attack.  

To improve detections and enable faster response times, you need to weave together information like:  

  • The IP address(es) involved over the time the event occurred 

  • The user associated with the device(s) 

  • The user’s geographic location 

  • The network access permissions for the user and device(s) 

You may be storing this data in different repositories without correlation capabilities. For example, you may have converged all DNS, DHCP, firewall, EDR, and Proxy data in one repository while combining user access and application data in another. To get a complete picture of the event, you need to make at least, although likely more than, two single-silo queries.  

While you may have reduced data storage costs, you have also increased the duration and complexity of investigating incidents, which gives malicious actors more time in your systems, making it more difficult to locate them and contain the threat.

Weaving together federated data with DataBee

Weaving together data tells you what and when something happened, enabling insights into activity rather than just a list of records. With a fabric of data, you can interpret it to better understand your environment or gain insights about an incident. With DataBee, you can focus on protecting your business while achieving tactical and strategic objectives. 

abstract blue wavy lines

At the tactical level, DataBee fits into your cost management strategies because it focuses on collecting and processing your data in a streamlined affordable way. It ingests security and IT logs and feeds, including non-traditional telemetry like organizational hierarchy data, from APIs, on-premises log forwarders, AWS S3s, or Azure Blobs then automatically parses and maps the data to the OCSF. You can use one or more repositories, aligning with cost management goals. Simultaneously, data users can access accurate, clean data through the platform to build reliable analytics without worrying about data gaps.  

Strategically, DataBee is a security, risk, and compliance platform built for creating customized insights that empower everyone in the organization.

The platform enriches your dataset with business policy context and applies patent-pending entity resolution technology so you can gain insights based on a unified, time-series dataset. This transformation and enrichment process breaks down silos so you can efficiently and effectively correlate data to gain real-time insights, empowering operational managers, security analysts, risk management teams, and audit functions.  

Author Information