Blog | Cybersecurity & IT
August 28, 2024

The value of OCSF from the point of view of a data scientist

Table of Contents

Data can come in all shapes and sizes. As the “data guy” here at DataBee® (and the “SIEM guy” in a past life), I’ve worked plenty with logs and data feeds in different formats, structures, and sizes delivered using different methods and protocols.  From my experience, when data is inconsistent and lacks interoperability, I’m spending most of my time trying to understand the schema from each product vendor and less time on showing value or providing insights that could help other teams.

That’s why I’ve become involved in the Open Cybersecurity Schema Framework (OCSF) community. OCSF is an emerging but highly collaborative schema that aims to standardize security and security-related data to improve consistency, analysis, and collaboration.  In this blog, I will explain why I believe OCSF is the best choice for your data lake.

The problem of inconsistency

When consuming enterprise IT and cybersecurity data from disparate sources, most of the concepts are the same (like an IP address or a hostname or a username) but each vendor may use a different schema (like the property names) as well as sometimes different ways to represent that data. 

Example: How different vendors represent a username field 

Vendor Raw Schema Representation
Vendor A (Firewall) user.name
Vendor B (SIEM) username
Vendor C (Endpoint) usr_name
Vendor D (Cloud) identity.user

Even if the same property name is used, sometimes the range of values or classifications might vary. 

Example: How different vendors represent “Severity” with different value ranges 

Vendor Raw Schema Representation Possible Values
Vendor A (Firewall) severity low, medium, high
Vendor B (SIEM) severity 1 (critical), 2 (high), 3 (medium), 4 (low)
Vendor C (Endpoint) severity info, warning, critical
Vendor D (Cloud) severity 0 (emergency) through 7 (debug)

In a non-standardized environment, these variations require custom mappings and transformations before consistent analysis can take place.   That’s why data standards can be helpful to govern how data is ingested, stored, and used, maintaining consistency and quality so that it can be used across different systems, applications, and teams. 

How can a standard help?

In the context of data modeling, a "standard" is a widely accepted set of rules or structures designed to ensure consistency across systems. The primary purpose of a standard is to achieve normalization—ensuring that data from disparate sources can be consistently analyzed within a unified platform like a security data lake or a security information event management (SIEM) solution.  From a cyber security standpoint, this becomes evident in at least a few common scenarios:

  1. Analytics: A standardized schema enables the creation of consistent rules, models, and dashboards, independent of the data source or vendor. For example, a rule to detect failed login attempts can be applied uniformly, regardless of whether the data originates from a firewall, endpoint security tool, or cloud application. 

  1. Threat Hunting - Noise Reduction: With normalized fields, filtering out irrelevant data becomes more efficient. For instance, if every log uses a common field for user identity (like username), filtering across multiple log sources becomes much simpler. 

  1. Threat Hunting - Understanding the Data: Having a single schema instead of learning multiple vendor-specific schemas reduces cognitive load for analysts, allowing them to focus on analysis rather than data translation. 

For log data, several standards exist. Some popular ones are: Common Event Format (CEF), Log Event Extended Format (LEEF), Splunk's Common Information Model (CIM), and Elastic’s Common Schema (ECS). Each has its strengths and limitations depending on the use case and platform.

Why existing schemas like CEF and LEEF fall short 

Common Event Format (CEF) and Log Event Extended Format (LEEF) are widely used schemas, but they are often too simplistic for modern data lake and analytics use cases. 

  • Limited Fields: CEF and LEEF offer a limited set of predefined fields, meaning most log data ends up in custom fields, which defeats the purpose of a standardized schema. 

  • Custom Fields Bloat: In practice, most data fields are defined as custom, leading to inconsistencies and a lack of clarity. This results in different interpretations of the same data types, complicating analytics. 

  • Overloaded Fields: Without sufficient granularity, crucial data gets overloaded into generic fields, making it hard to distinguish between different event types. 

Example: Overloading a single field like “message” to store multiple types of information (e.g., event description, error code) creates ambiguity and reduces the effectiveness of automated analysis. 

The limits of CIM and ECS: vendor-specific constraints 

Splunk CIM and Elastic ECS are sophisticated schemas that better address the needs of modern environments, but they are tightly coupled to their respective ecosystems. 

Proprietary Optimizations: 

  • CIM: Although widely used within Splunk, CIM is proprietary and lacks an open-source community for contributions to the schema itself. Its design focuses on Splunk’s use cases, which can be limiting in broader environments. 

  • ECS: While open-source, ECS remains heavily influenced by Elastic’s internal needs. For instance, ECS optimizes data types for Elastic’s indexing and querying, like the distinction between keyword and text fields. Such optimizations can be unnecessary or incompatible with non-Elastic platforms. 

Field Ambiguity: 

  • CIM uses fields like src and dest, which lack precision compared to more explicit options like source.ip or destination.port. This can lead to confusion and the need for additional context when performing cross-platform analysis.

Vendor-Centric Design: 

  • CIM: The field definitions and categories are tightly aligned with Splunk’s correlation searches, limiting its relevance outside Splunk environments. 

  • ECS: Data types like geo_point are unique to Elastic’s product features and capabilities, making the schema less suitable when integrating with other tools. 

How OCSF addresses these challenges

The OCSF was developed by a consortium of industry leaders, including AWS, Splunk, and IBM, with the goal of creating a truly vendor-neutral and comprehensive schema. 

  • Vendor-Neutral and Tool-Agnostic: OCSF is designed to be applicable across all logs, not just security logs. This flexibility allows it to adapt to a wide variety of data sources while maintaining consistency. 

  • Open-Source with Broad Community Support: OCSF is openly governed and welcomes contributions from across the industry. Unlike ECS and CIM, OCSF’s direction is not controlled by a single vendor, ensuring it remains applicable to diverse environments. 

  • Specificity and Granularity: The schema’s granularity aids in filtering and prevents the overloading of concepts. For example, OCSF uses specific fields like identity.username and network.connection.destination_port, providing clarity while avoiding ambiguous terms like src. 

  • Modularity and Extensibility: OCSF’s modular design allows for easy extensions, making it adaptable without compromising specificity. Organizations can extend the schema to suit their unique use cases while remaining compliant with the core model. 

In DataBee’s own implementation, we’ve extended OCSF to include custom fields specific to our environment, without sacrificing compatibility or requiring extensive custom mappings.  For example, we added the assessment object, which can be used to describe data around 3rd party security assessments or internal audits.  This kind of log data doesn’t come from your typical security products but is necessary for the kind of use cases you can achieve with a data lake. 

Now that we have some data points about my own experiences with some of the industry’s most common schemas, it’s natural to share a visualization through a comparison matrix of OCSF and two leading schemas. 

OCSF Schema Comparison Matrix 

Aspect OCSF Splunk CIM Elastic ECS
Openness Open-source, community and multi-vendor-driven  Proprietary, Splunk-driven  Open-source, but Elastic-driven 
Community Engagement Broad, inclusive community, vendor-neutral  Limited to Splunk community and apps  Strong Elastic community, centralized control 
Flexibility of Contribution Contributable, modular, actively seeks community input  No direct community contributions  Contributable, but Elastic makes final decisions 
Adoption Rate  Early but growing rapidly across multiple vendors High within Splunk ecosystem  High within Elastic ecosystem 
Vendor Ecosystem  Broad support, designed for multi-vendor use  Splunk-centric, limited outside of Splunk Elastic-centric, some third-party integrations 
Granularity and Adaptability  Structured and specific but modular; balances adaptability with detailed extensibility  Moderately structured with more generic fields; offers broad compatibility but less precision  Highly granular and specific with tightly defined fields; limited flexibility outside Elastic environments
Best For  Flexible, vendor-neutral environments needing both detail and adaptability  Broad compatibility in Splunk-centric environments  Consistent, detailed analysis within Elastic environments 

The impact of OCSF at DataBee

In working with OCSF, I have been particularly impressed with the combination of how detailed the schema is and how extensible it is. We can leverage its modular nature to apply it to a variety of use cases to fit our customers' needs, while re-using most of the schema and its concepts. OCSF’s ability to standardize and enrich data from multiple sources has streamlined our analytics, making it easier to track threats across different platforms and ultimately helping us deliver more value to our customers. This level of consistency and collaboration is something that no other schema has provided, and it’s why OCSF has been so impactful in my role as a data analyst. 

If we have ideas for the schema that might be usable for others, the OCSF community is receptive to contributions.  The community is already brimming with top talent in the SIEM and security data field and is there to help guide us in our mapping and schema extension decisions. The community-driven approach means that I’m not working in isolation; I have access to a wealth of knowledge and support, and I can contribute back to a growing standard that is designed to evolve with the industry.

Within DataBee as a product, OCSF enables us to build powerful correlation logic which we use to enrich the data we collect.  For example, we know we can track the activities of a device regardless of whether the event came from a firewall or from an endpoint agent, because the hostname will always be device.name. 

Whenever our customers have any questions about how our schema works, the self-documenting schema is always available at ocsf.databee.buzz (which includes our own extensions).  This helps to enable as many users as possible to gain security and compliance insights.  

Conclusion

As organizations continue to rely on increasingly diverse and complex data sources, the need for a standardized schema becomes paramount. While CEF, LEEF, CIM, and ECS have served important roles, their limitations—whether in scope, flexibility, or vendor-neutrality—make them less ideal for a comprehensive data lake strategy. 

For me as a Principal Cybersecurity Data Analyst, OCSF has been transformative and represents the next evolution in standardization. With its vendor-agnostic, community-driven approach, OCSF offers the precision needed for detailed analysis while remaining flexible enough to accommodate the diverse and ever-evolving landscape of log data.

Author Information