Part 1
At work I had the task to reverse engineer some .csv files. I was given a folder with hundreds of .txt files and then asked to convert these into .csv files.
Below is an example of what these text files looked like:
As you can tell from the data above, some of it makes sense and some of it is a bit cryptic. The real data (I obviously cannot post it here) was much longer and even more confusing.
What made the task challenging was that I could not simply parse the string via a delimiter. These were fixed length data files and I needed to know the specifications of each column to properly parse it.
Side Note: I did try to get the information via meetings and SMEs, but that ended up fruitless.
Luckily the java code that parsed these text files told me how to parse it. Yay!
This is how the code looked like for one data element:
Now that I have all the data I need to parse the text files, lets write some Python code.
Everything in the Python code is pretty standard. We loop through all the files in a folder, open the text files one at a time, and we read the data in each file line by line. I did add a condition to skip a file that the line is not of length 80. I am expecting all of the lines to be exactly 80 characters long, so any files that do not meet this criteria will be ignored.
The slices function may be the only piece that at first looks a bit confusing. The (star)args parameter allows me to pass as many parameters to the slices function as I want. Remember that these numbers simply represent the length of the data element in the string. Let’s walk through the code a bit so you get the idea.
If we run the code above, what will we get?…
- ‘The ’ is of length 4
- ‘slices ’ is of length 7
- ‘function’ is of length 8
The code will get the substring from position 0 to position (0+4), which equals “The “. On the second iteration, it starts grabs the substring from position 4 to position (4+7), which equals “slices “. Get it? We are moving across the string a little at a time. You just have to make sure the numbers you are feeding the function are grabbing the correct substrings.
Another point to make is that the yield keyword will return a generator object. This is why I wrapped it in a list to actually see the results.
The code ends by saving my list into a csv file. Note that I used the ‘a’ append flag, so that on every iteration, a new line is added to the file.
That is pretty much it, hope you learned something new today.
Part 2
The next day I get to work and I feel good. I have my hundreds of converted CSV files and all is well…
I then get a message from a co-worker, “hey can you rename a couple column headers?“. Didn’t he realize I manually created the header for each of these files. I thought it was a one time task.
Here is the script I put together to update a few column headers for all the .csv files.
While I am working on the code and cleaning up the .csv files, I get another request. “Hey, can you just grab the first 4 characters of column Z?”
Here is the script I used to accomplish this task.
Lesson learned: I should have just created a script to port the data from .txt to .csv and at least give myself some room to also make changes at the row and column level if needed.