Cleaning Dirty Data- Splunking JAWS

I’ve been playing around with my shark data. I know this is not the sort of data you would normally ingest into Splunk but when I’m testing different visualizations or new apps I like to use data that interests me. The down side is that the data is not always clean. Below I’m going to show you how to clean up Mixed case dirty data directly from search.

Dirty Data

In this case I have a field that has mixed case for a field value and returns three different version of the same value SCUBA diving, scuba diving and Scuba Diving.

Splunk search is not case sensitive but if I want to return the results in a chart I want Scuba Diving to be one value not 3.

There are a couple of different ways you can do this. One is configuring the props.conf. Check out the Admin manual.

But if you don’t have access to make changes here you can do it straight in your search.

Problem – Mixed Case Data

Starting from the search I find the results I want and return the fields into a table. I then want to make sure that the results are clean and the values with mixed case are classed as one category.
source="attacks.csv"  Country=*  Year>1966   Activity=*scuba*diving | Top limit=20 Activity
BEFORE
CaseSensitiveSplunk

A quick Eval Function here will convert all of the values to the same case and they will be classed together as one category.

 We can use lower or upper which are default to splunk or we can get a little tricky.
Lowercase
eval Activity=lower(Activity)

lowercasesplunk

Uppercase
eval Activity=upper(Activity)

uppercase splunk

Propercase
source="attacks.csv" Country=* Year>1966 Activity=*scuba*diving | eval Activity = lower(Activity) | makemv delim=" " Activity | mvexpand Activity | eval A = substr(Activity, 1, 1) | eval B = substr(Activity,2) | eval A = upper(A) | eval Activity = A.B | fields - A, B 
| mvcombine Activity | eval Activity = mvjoin(Activity, " ") | top limit=20 Activity

Propercase in Splunk

BEFORE AND AFTER

Before and after clean data
Whilst this sort of cleaning activity is more commonly needed for static or periodic data dumps, for example exports from databases. We also see many instances and Use Cases with machine and real time data. A recent example was a requirement to clean up web logs generated from an online shop. Most of the cleaning was related to the user created data such as search terms as well as shop inventory items. I hope this has helped you find and clean up any dirty data issues you may have.

And a per usual if you have this sort of work that you need help with, the App Assembly is very happy to offer consulting services for all your Splunk needs.

Share this Post

Leave a Reply

Your email address will not be published. Required fields are marked *