I’ve always considered myself a “data guy”, but I’m feeling a bit conflicted about it. When I talk about data, I think of data analysis, charting, time series, forecasting, and mathematical models like simulations, linear regressions, and linear programming.
Though I’ve explored advanced techniques like the Excel Solver, linear programming, discrete event simulation, Neural Networks, and statistical Python libraries like statsmodels for regression analysis and exponential smoothing, I’ve found limited practical application for these in my career. Often, my attempts to introduce these methods to colleagues were met with confusion as they didn’t see the immediate value.
Most places I have worked with do not understand the difference between the average and the median. In general it has not been fruitful going beyond basic statistics in a work environment.
In my experience, data analysis has been most useful in these scenarios:
- Preparing reports
- Performing an analysis
While complex mathematical models have their place, they were not commonly used in my everyday work environment. Task automation and decision automation, which often involve data manipulation, are more prevalent. This requires proficiency in reading data from various sources, manipulating it, and using it to automate decisions or tasks.
Data Wrangling
Data wrangling and data munging play a significant role here, ensuring data quality and consistency. Where I excel is not necessarily in analyzing the data itself, but rather in knowing how to work with the data. I understand the nuances between different data types and formats, particularly structured data. My strength lies in reading data from various sources, studying it, and troubleshooting issues with the code that interacts with it.
Common data formats that I have come across:
- .txt
- .json
- .xml
- .csv
- .html
- .xls/,xlsx Microsoft Excel
Languages I have used to work with data:
I needed to know all the different types of data out there, whether it’s numbers, strings, objects, floats, dates, etc. I also needed to know the different file formats that are out there, like XML, JSON, HTML, CSV, etc. I also came to realize I will always find a need to look for data in a database. The big ones I have come across with are Oracle, MS SQL, PostgreSQL, Redis, MySQL, etc. I had to be able to ingest that data via code and do all sorts of manipulations with it, whether it’s combining, merging, slicing, or doing calculations on the data. I might need to create new databases, new tables, or save data in different formats for different purposes.
Moving Data Around
Another aspect I haven’t mentioned is that sometimes I had to move data from one place to another. For example, on a website, I might have a form where people input data, and I needed to move that data to the server (the back end) and do something with it, then push it back out to the front end. I had to be really good at moving data around, where I then manipulate it with Python or some other back end programming language. Then, I needed to know how to push it the data to the front end, and at that point, not only did I have to understand the different data types that are out there, but I also had to understand how HTML, CSS, and JavaScript work together so I can display things like graphs and maps. For maps, I had to understand that I needed to convert zip codes into latitude and longitude coordinates. For JavaScript dashboards, I had to decide if I was going to push the data in a CSV or in JSON format. There are a lot of moving pieces.
Data Analysis Life Cycle
Despite this, I’ve found enjoyment and interest in data analysis. The process I typically went through was usually the same:
- Read the data
- Clean and Manipulate the data
- Analyze the data (data gaze, chart it)
- Write a document on my findings
- The deliverable was an analysis or a proposal document
All of the steps above are done manually at first. Yes, I may have pulled the data via sql and if I was allowed to install Python at work, then maybe Pandas. After the data was gathered, I need to analyze the data and start writing my findings in a Microsoft Word document. Unless I needed to repeat the process over and over, automation was not a goal of mine.
Misc
Data Science plays a role in transforming data for use in mathematical models like regression and neural networks. Data Science requires a lot of the data munging and wrangling skills already mentioned. The data needs to be gathered, cleaned, and then transformed in such a way that it could be pushed into mathematical models like for example a neural network. While I’ve explored tools like Tensorflow and scikit-learn, I found them interesting but not personally enjoyable. AI technologies have evolves rapidly and I have not kept up with this topic over time.
Summary
While data is a very broad topic, I have understood that it is very important to be able to understand the basics. Knowing the types of data you will find in the wild, knowing where the data gets stored, knowing how to use code to move and manipulate data are the key skills I have used when working with data. I learned these skills by working with data over a long period of time.