Normalizing the Cowrie Feed

Last week we took a look at the Cowrie feed. It’s made of a set of events and fields that, combined together, provide information useful in understanding patterns of cyberattack behavior, particularly those associated with brute-force attempts, as well as port and IP address scans. As a heterogeneous feed, the list of fields varies according to the event type. This sometimes makes querying and searching the data complicated. As I chose Splunk Enterprise as the repository for my honeypot data, my Splunk searches on the out-of-the-box Cowrie feed could be rather convoluted.

The Challenge

To see the scope of the challenge mentioned above through a real-life example, let’s consider the following arbitrary –but quite common and reasonable– definition:

A probe is an unrequested interaction with our honeypot initiated by a third-party entity –typically, a scanner, crawler, or searchbot– that does not attempt to login to the remote shell.

Given the definitions above, let’s put ourselves in the shoes of an analyst who needs to create a report that includes the number of probes recorded by our Cowrie honeypot. As the information requires searching in different events, each with a different set of fields, our Splunk searches would need to resort to temporary variables, subsearches, and other Splunk artifacts that first extract, then combine the data together.

Our Splunk search might look something like this:

index=cowrie eventid=cowrie.session.connect OR eventid=cowrie.login.*
| table session, eventid
| eval tmp_field1=if(eventid="cowrie.session.connect", 1, -1)
| stats sum(tmp_field1) as tmp_field2 by session
| eval tmp_field3=if(tmp_field2=1, 1, 0)
| stats sum(tmp_field3) as count

Let’s unpack it:

  • The first line defines the index –a repository of Splunk data– that contains the Cowrie information (cowrie by default), and then selects a subset of the events, specifically the cowrie.session.connect, cowrie.login.success, and cowrie.login.failed events.
  • The second line keeps only the session and eventid fields and ignores all other fields in the events selected in the first line.
  • To the two fields selected above, the third line adds the temporary field tmp_field1 to distinguish between cowrie.session.connect events (value 1) and either cowrie.login.success or cowrie.login.failed (value -1) events.
  • The fourth line adds the values of tmp_field1 corresponding to the same session and saves the result in the new temporary field tmp_field2. To understand this, we need to remember that the session field binds together otherwise separate events that are part of the same interaction with the honeypot. When attackers attempt to "guess" the credentials that would give them access to the remote shell of the honeypot, they often try multiple times, until they succeed, they give up, or the connection times out. In a session that has, for example, three login attemtpts, there would be one cowrie.session.connect event (there’s only one per session) and three cowrie.login.* events. The result of the fourth line of our search is to assign the value -2 (1-1-1-1) to tmp_field2 for the session in question.
  • The fifth line defines yet another temporary field, tmp_field3, that’s 1 if tmp_field2 is 1 or 0 otherwise. In other words, tmp_field3 is 1 only in those sessions that don’t have any cowrie.login.* events.
  • The last line adds the values of tmp_field3, effectively counting the sessions that don’t have login attempts, which we defined earlier as probes.

Phew! Doable, but quite involved.

The Solution: A Normalized Feed

Data normalization is the process used to reorganize or ‘massage’ the data so that it’s easier, faster to work with it. Normalization involves reducing/eliminating data redundancy and ensuring that the data dependencies are implemented in a way that takes into account the constraints of the underlying database that holds the data. This allows the data to be queried and analyzed more easily.

When we explored the taxonomy of the default Cowrie field, we saw that it consisted of 21 event types and 47 fields, of which 6 were common to all event types. When designing a normalized Cowrie feed, I used the following guiding principles:

  1. Move from an event-based approach to a session-based approach: reduce the number of event types (21) to just one that captures all the information of a session, defined as an (unwanted) interaction with our honeypot initiated by a third-party entity
  2. Simplify the feed as much as possible without sacrifizing data richness or performance: prioritize ease-of-use (readability, maintainability, scalability)
  3. Take advantage of common database constructs, such as support for multivalue fields.

Implementation Steps

The main change, stemming from the first guiding principle, was to merge the information of all events corresponding to the same session into the new normalized session record. In general, this was done by aggregating single-value fields from different event records into a single multivalue field:

Implementation Action
Aggregate the values of the input field from cowrie.command.* events sharing a common session identifier into a new commands field; use semicolons for the string concatenation.
Create a new creds_login field by aggregating "is login" information from the cowrie.login.* events: true if the event is cowrie.login.success, false if the event is cowrie.login.failed; use the string "<sep>" for the string concatenation.
Aggregate the values of the password field from cowrie.login.* events sharing a common session identifier into a new creds_pwd field; use the string "<sep>" for the string concatenation. NOTE: I did not use the semicolon to concatenate strings because it’s a valid password character.
Aggregate the values of the username field from cowrie.login.* events sharing a common session identifier into a new creds_uname field; use the string "<sep>" for the string concatenation.
Aggregate the values of the url field from cowrie.session.file_download events and the filename field from the cowrie.session.file_upload events sharing a common session identifier into a new malware_files field; use semicolons for the string concatenation.
Aggregate the values of the shasum field from cowrie.session.file_download and cowrie.session.file_upload events sharing a common session identifier into a new malware_hashes field; use semicolons for the string concatenation.
Aggregate the values of the dst_ip field from cowrie.direct-tcpip.request events sharing a common session identifier into a new tcpip_dstip field; use semicolons for the string concatenation.
Aggregate the values of the dst_port field from cowrie.direct-tcpip.request events sharing a common session identifier into a new tcpip_dstport field; use semicolons for the string concatenation.
Aggregate the values of the src_ip field from cowrie.direct-tcpip.request events sharing a common session identifier into a new tcpip_srcip field; use semicolons for the string concatenation.
Aggregate the values of the src_port field from cowrie.direct-tcpip.request events sharing a common session identifier into a new tcpip_srcport field; use semicolons for the string concatenation.
Aggregate the values of the ttylog field from cowrie.log.closed events sharing a common session identifier into a new ttylogs field; use semicolons for the string concatenation.
Aggregate the values of the sha256 field from cowrie.virustotal.scanfile events and the url field from cowrie.virustotal.scanurl events sharing a common session identifier into a new vt_files field; use semicolons for the string concatenation.
Aggregate the values of the positives field from cowrie.virustotal.scanfile and cowrie.virustotal.scanurl events sharing a common session identifier into a new vt_positives field; use semicolons for the string concatenation.
Aggregate the values of the is_new field from cowrie.virustotal.scanfile and cowrie.virustotal.scanurl events sharing a common session identifier into a new vt_new field; use semicolons for the string concatenation.
Aggregate the values of the total field from cowrie.virustotal.scanfile and cowrie.virustotal.scanurl events sharing a common session identifier into a new vt_scans field; use semicolons for the string concatenation.
Technically not part of the feed normalization, but I took this opportunity to do a bit of data augmentation. Specifically, I added four new fields with the country and autonomous system information on the source and the destination IP addresses: dst_asn, dst_country, src_asn, and src_country. For this, I used the MaxMind GeoLite2 database.

Critical to the success of the normalization scheme is the fact that values of related fields remain in sync. As an example, the creds_* fields could have the following values:

  • creds_login = false<sep>false<sep>true
  • creds_uname = root<sep>admin<sep>root
  • creds_pwd = 123<sep>123456<sep>admin

The above values correspond to three login attempts:

  • Failed login attempt with username=root and password=123
  • Failed login attempt with username=admin and password=123456
  • Successful login attempt with username=root and password=admin

I purposely ignored ignored the message field because its information is redundant. For now, I’m also ignoring the data field from the cowrie.direct-tcpip.data event type, as the information it contains is not part of my current workflow.

The New Feed

The end result of the implementation outlined in the previous section is a normalized feed consisting of a single event type with 24 fields:

Fied Type Description
commands Multivalue Semicolon separated list of executed/attempted commands
creds_login Multivalue <sep>-separated list of "is login" information (true|false)
creds_pwd Multi-value <sep>-separated list of passwords
creds_uname Multi-value <sep>-separated list of usernames
dst_asn Single value New field with information provided by MaxMind
dst_country Single value New field with information provided by MaxMind
dst_ip Single value Same as in original feed
dst_port Single value Same as in original feed
duration Single value Same as in original feed
malware_files Multivalue Semicolon-separated list of malware filenames or URLs
malware_hashes Multivalue Semicolon-separated list of malware hashes
protocol Single value Same as in original feed
sensor Single value Same as in original feed
session Single value Same as in original feed
src_asn Single value New field with information provided by MaxMind
src_country Single value New field with information provided by MaxMind
src_ip Single value Same as in original feed
src_port Single value Same as in original feed
tcpip_dstip Multivalue Semicolon-separated list of TCP/IP destination IP addresses
tcpip_dstport Multivalue Semicolon-separated list of TCP/IP destination ports
tcpip_srcip Multivalue Semicolon-separated list of TCP/IP source IP addresses
tcpip_srcport Multivalue Semicolon-separated list of TCP/IP source ports
timestamp Single value Same as in original feed
ttylogs Multivalue Semicolon-separated list of TTY log files
vt_files Multivalue Semicolon-separated list of VirusTotal scanned hashes or URLs
vt_positives Multivalue Semicolon-separated list of VirusTotal positives
vt_new Multivalue Semicolon-separated list of VirusTotal "is new" information
vt_scans Multivalue Semicolon-separated list of VirusTotal scans

Revisiting the Example

With our new normalized, session-based Cowrie feed uploaded to Splunk under a separate cowrie-n index, the search to count the number of probes would be significantly cleaner and simpler than the one conducted on the original feed:

index=cowrie-n creds_uname=""
| stats count

Basically, select and count those entries (sessions) that have an empty creds_uname field. Nice and easy.

Summary

We’ve seen how building a normalized, session-based feed from the original heterogeneous, event-based Cowrie feed can make the life of the cybersecurity analyst easier, as it results in data queries that are cleaner and easier to build, maintain and document. I chose to keep both feeds in Splunk (heck, it’s just storage) as a safety measure. To keep things simple, I opted for a delayed, asynchronous solution: once a day, a utility runs automatically to normalize all the Cowrie events recorded in the previous 24 hours. The resulting 24-hour delay has not impacted my ability to generate reports and dashboards with reasonably up-to-date data.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *