Prepping large files for import to PostgreSQL: simple command line tools

When importing a into a relational database like PostgreSQL, some file manipulation is usually required first. Typical changes include:

Examine the file contents, identify, anomalies, et cetera;
Drop certain columns;
Rename columns;
Reorder columns;
Drop certain rows using some criteria;
Create a new column;
Reformat columns (e.g., dates).

Typically I will write an R script to do any of those tasks (and to import the data using the DBI package). But for very large files using R becomes less attractive if not prohibitively memory-intensive. Of course, you may want to import the entire file into the database, maybe in a staging area/schema, and then complete these tasks with Postgres commands (as suggested by Regina Obe and Leo Hsu in their nice book PostgreSQL Up and Running) but I find that too clunky. Data munging is not what SQL is good for.

A better option is to learn a few command line tools and put them into a shell script. Although the command line is intimidating for many, its very feasible to learn a few things. The following can be used on any operating system, you just need Bash ('Bourne Again SHell'). If you're stuck on a Windows system, you can get a nice Bash window by installing git, which is good to have anyways. Once installed, Windows users can find the Bash window by typing 'bash' in the system search bar. In my experience with Windows, no further setup was necessary.

You can also enter the psql shell directly from Bash, which makes the entire process very streamlined.

If you'd like more of an introduction to the Bash shell and command lines, this page has some pretty friendly introductory material.

Say I have a big text file in my working directory:

ls
## gama.csv

The file has ground water testing results collected by the State of California. We'll use this to demonstrate some basic techniques for data munging from the command line.

The file has 715,158 rows. This is small enough to easily work with in R, but it will serve our purpose nonetheless.

wc -l gama.csv
## 715158 gama.csv

The general workflow will consist of applying some command to the file and then directing the output of the command directly to a new file. To start, we'll print the column names and then send them into a new file named 'header.txt'.

We can view the first n lines only using head -n file.name, like so:

head -1 gama.csv
## _id,gm_county_name,gm_dataset_name,gm_well_category,gm_data_source,gm_well_id,gm_chemical_vvl,gm_chemical_name,gm_result_modifier,gm_result,gm_chemical_units,gm_reporting_limit,gm_samp_collection_date,gm_latitude,gm_longitude,gm_well_depth_ft,gm_top_depth_of_screen_ft,gm_bottom_depth_of_screen_ft,gm_cas_number,gm_altwell_id1,gm_altwell_id2,gm_altwell_id3,src_chemical,src_result_modifier,src_result,src_chemical_units,src_reporting_limit,src_samp_collection_date,src_samp_collection_time,src_analysis_date,src_analytical_method,src_lab_note,src_latitude,src_longitude,src_datum,src_well_depth_ft,src_top_depth_of_screen_ft,src_bottom_depth_of_screen_ft,src_well_category

To send the output of head into a new file we just append > newfile.txt:

head -1 gama.csv > header.txt

That will send the column names to a new file named 'header.txt'. Note that if 'header.txt' already exists, this command will erase its contents and replace it with the output from your command. (To append content to an existing file without overwriting its contents, we would use >> insteead of >.)

Let's create a small file for demonstration purposes; it will contain the first 20 lines:

head -20 gama.csv > test.txt

Use cat to print the entire file contents.

cat test.txt

Our two most important operations are row and column selection. To select columns we will use cut. This requires three arguments: tell it what the delimiter is (comma-separated), tell it which columns to keep, and provide the file name. We'll print the first, third, sixth, and eighth column:

cut -d ',' -f 1,3,6,8 test.txt
## _id,gm_dataset_name,gm_well_id,gm_chemical_name
## 1,GAMA_DOM,TUL917,Carbon Tetrachloride
## 2,GAMA_DOM,TUL972,Fecal Coliform (bacteria)
## 3,GAMA_DOM,MON173,Gross Alpha radioactivity
## 4,GAMA_DOM,TUL993,Sodium
## 5,GAMA_DOM,MON183,Styrene
## 6,GAMA_DOM,TUL1081,"1
## 7,GAMA_DOM,TUL984,Carbon Tetrachloride
## 8,GAMA_DOM,SD 814,Zinc
## 9,GAMA_DOM,MON156,"1
## 10,GAMA_DOM,TUL916,Potassium
## 11,GAMA_DOM,SD 874,Chloroform (THM)
## 12,GAMA_DOM,SD 891,"1
## 13,GAMA_DOM,MON121,delta H2/H1
## 14,GAMA_DOM,TUL995,Total Dissolved Solids
## 15,GAMA_DOM,SD 889,Nitrite as N
## 16,GAMA_DOM,SD 877,Sodium
## 17,GAMA_DOM,MON114,Uranium
## 18,GAMA_DOM,TUL1005,Copper
## 19,GAMA_DOM,TUL931,Hardness

You have to be careful to get your column names matched up with their position (number), but otherwise there's nothing to this trick. (If you want to change column order you can use awk, which is slightly more involved; also, I turned to cut after finding that awk was not able to read a file delimiter properly, but I doubt that happens often.)

To subset by row, we can use text matching or regular expressions. We'll use the grep command (for 'find-replace' you can start with sed). Type grep, then the text you're looking for, e.g., chloro, and then the filename. Searching for 'chloro' returns all the lines where that text fragment appears (in this case catching the contaminant 1,2 Dichlorobenzene).

grep chloro test.txt
## 9,MONTEREY,GAMA_DOM,Domestic,GAMA,MON156,DCBZ12,"1,2 Dichlorobenzene (1,2-DCB)",<,0.5,UG/L,0.5,2011-06-09T00:00:00,36.52389119,-121.7908844,NaN,NaN,NaN,NA,MON156,NA,NA,DCBZ12,<,0.5,UG/L,0.5,2011-06-09,1013,06/11/2011,E524.2,NA,36.52389119,-121.7908844,NAD83,NaN,NaN,NaN,Remediation/Groundwater Monitoring Well
## 12,SAN DIEGO,GAMA_DOM,Domestic,GAMA,SD 891,DCA12,"1,2 Dichloroethane (1,2 DCA)",<,0.5,UG/L,0.5,2008-06-19T00:00:00,32.70428713,-116.2936166,NaN,NaN,NaN,NA,SD 891,NA,NA,DCA12,<,0.5,UG/L,0.5,2008-06-19,0948,06/30/2008,E524.2,NA,32.70428713,-116.2936166,NAD83,NaN,NaN,NaN,Remediation/Groundwater Monitoring Well
##

Building on this, let's say you want all Perflouroalkyl substances (PFAS) like PFOA, PFOS, etc. The following command will catch any row in gama.csv containing 'perfluoro', without case sensitivity.

grep -i perfluoro gama.csv

To have a glimpse of the results, I'll print the output using head. They get chained together using the pipe operator |:

grep -i perfluoro gama.csv | head -3
## 132307,TEHAMA,GAMA_USGS,Domestic,USGS,S9-REDBLS-RB02,8:2FTS,"8:2 Fluorotelomer sulfonic acid (1H, 1H, 2H, 2H-Perfluorodecanesulfonic acid) (8:2 FTS)",<,NaN,NG/L,7.8,2024-03-20T00:00:00,40.0201110839844,-122.40013885498,140,120,140,NA,400112122240001,NA,NA,"8:2 FTS (linear+branched), water, unfltd, recov",<,NaN,ng/l,7.8,2024-03-20,1300,NA,LM102,c :,40.0201110839844,-122.40013885498,NA,140,120,140,Domestic
## 132312,TEHAMA,GAMA_USGS,Domestic,USGS,S9-REDBLS-RED10,PFUNDCA,Perfluoroundecanoic acid (PFUnDA),<,NaN,NG/L,2,2024-03-19T00:00:00,40.2728881835938,-122.388725280762,453,413,453,NA,401622122231901,NA,NA,"PFUnDA (linear+branched), water, unfltd, recov",<,NaN,ng/l,2,2024-03-19,1230,NA,LM102,c :,40.2728881835938,-122.388725280762,NA,453,413,453,Domestic
##

Combine cut and grep to see just the contaminants.

cut -d ',' -f 8 gama.csv | 
	grep -i perfluoro

That will print a long list with many duplicates. To reduce it to unique values, we first put it into order using sort and then apply uniq. You have to use sort first, because that's how uniq works. (But note that this sort | uniq command works with any number of columns.)

cut -d ',' -f 8 gama.csv | 
	grep -i perfluoro |
	sort |
	uniq
## N-Ethyl perfluorooctane sulfonamidoacetic acid (NEtFOSAA)
## N-Methyl perfluorooctane sulfonamidoacetic acid (NMeFOSAA)
## Perfluorobutane sulfonic acid (PFBS)
## Perfluorobutyric acid (PFBA)
## Perfluorodecane sulfonic acid (PFDS)
## Perfluorodecanoic acid (PFDA)
## Perfluoroheptane sulfonic acid (PFHpS)
## Perfluoroheptanoic acid (PFHpA)
## Perfluorohexanesulfonic acid (PFHxS)
## Perfluorohexanoic acid (PFHxA)
## Perfluorononane sulfonic acid (PFNS)
## Perfluorononanoic acid (PFNA)
## Perfluorooctane sulfonamide (PFOSAm)
## Perfluorooctane sulfonate (PFOS)
## Perfluorooctanoic acid (PFOA)
## Perfluoropentane sulfonoic acid (PFPeS)
## Perfluoropentanoic acid (PFPeA)
## Perfluorotetradecanoic acid (PFTeDA)
## Perfluorotridecanoic acid (PFTrDA)
## Perfluoroundecanoic acid (PFUnDA)

Now we can create a lookup table with the contaminant names and their shorthands (which are in column 7):

cut -d ',' -f 7,8 gama.csv | 
	grep -i perfluoro |
	sort |
	uniq
## NETFOSAA,N-Ethyl perfluorooctane sulfonamidoacetic acid (NEtFOSAA)
## NMEFOSAA,N-Methyl perfluorooctane sulfonamidoacetic acid (NMeFOSAA)
## PFBSA,Perfluorobutane sulfonic acid (PFBS)
## PFBTA,Perfluorobutyric acid (PFBA)
## PFDSA,Perfluorodecane sulfonic acid (PFDS)
## PFHA,Perfluorohexanoic acid (PFHxA)
## PFHPA,Perfluoroheptanoic acid (PFHpA)
## PFHPSA,Perfluoroheptane sulfonic acid (PFHpS)
## PFHXSA,Perfluorohexanesulfonic acid (PFHxS)
## PFNA,Perfluorononanoic acid (PFNA)
## PFNDCA,Perfluorodecanoic acid (PFDA)
## PFNS,Perfluorononane sulfonic acid (PFNS)
## PFOA,Perfluorooctanoic acid (PFOA)
## PFOSA,Perfluorooctane sulfonamide (PFOSAm)
## PFOS,Perfluorooctane sulfonate (PFOS)
## PFPA,Perfluoropentanoic acid (PFPeA)
## PFPES,Perfluoropentane sulfonoic acid (PFPeS)
## PFTEDA,Perfluorotetradecanoic acid (PFTeDA)
## PFTRIDA,Perfluorotridecanoic acid (PFTrDA)
## PFUNDCA,Perfluoroundecanoic acid (PFUnDA)

Just save it to a text file, then you can use psql's nice \copy command to send it over to your database. (To allow line-breaks at any point in your Bash code, use this slash \.)

cut -d ',' -f 7,8 gama.csv | 
	grep -i perfluoro |
	sort |
	uniq \
	> lookup.txt

This is a small file, but the workflow will allow you to start working with large files just as easily. You can also combine your Bash with psql commands in the same script.