Article
0 comment

From JSON to CSV to SQL

The task

I recently had the problem, that I had to insert JSON data sets into a PostgreSQL database. As rows, not as JSON field type.

The JSON represents a network of nodes and links and looked like this:

What I needed:

  1. 1. split JSON into nodes and links separately
  2. 2. convert to CSV
  3. 3. add an ID field as first column and primary key
  4. 4. number data rows with ID from 1 to n
  5. 5. delete the ugly path “/mydemo/images/”, leaving only the filename
  6. 6. create table in PostgreSQL DB and insert data

Create CSV from JSON

For this I used the brilliant Python based cvskit package. This is done in 2 steps:

The first step completes tasks 1 to 5:

underbrace{tt in2csv -k nodes myfile.json}_{mbox{Term 1}}nolinebreak[4]midnolinebreak[4]underbrace{tt nl -v0 -s,}_{mbox{Term 2}}nolinebreak[4]midnolinebreak[4]underbrace{tt perl -pequad's/hat{quad}{backslash}s+//'}_{mbox{Term 3}}nolinebreak[4]midnolinebreak[4]underbrace{tt sedquad"s/hat{quad}0,/id,/g;s#/visdemo/images/##g"}_{mbox{Term 4}}nolinebreak[4]>nolinebreak[4]mbox{tt nodes.csv}

Term 1: Export subarray with key nodes.
Term 2: Number lines starting with zero (uses unix tool nl). Delimiter is colon.
Term 3: Strip leading spaces.
Term 4: Replace “0,” in first line with col name, strip image path.

The same for the links section of the JSON file:

underbrace{tt in2csv -k links graph3345.json}_{mbox{Term 1}}nolinebreak[4]midnolinebreak[4]underbrace{tt nl -v0 -s,}_{mbox{Term 2}}nolinebreak[4]midnolinebreak[4]underbrace{tt perl -pequad's/hat{quad}{backslash}s+//'}_{mbox{Term 3}}nolinebreak[4]midnolinebreak[4]underbrace{tt sedquad"s/hat{quad}0,/id,/g"}_{mbox{Term 4}}nolinebreak[4]>nolinebreak[4]mbox{tt links.csv}

Term 1: Export subarray with key links.
Term 2: Number lines starting with zero (uses unix tool nl). Delimiter is colon.
Term 3: Strip leading spaces.
Term 4: Replace “0,” in first line with col name.

Insert CSV into database

Now we need another tool from the csvkit tool chain, which reads CSV and builds an SQL CREATE TABLE statement and inserts the data into the created table.

 

The parameters are:

  • ● -db <DBURL> The database URL to connect to, in SQLAlchemy format
  • ● –table <TABLENAME> What name should the table get? if not supplied, the CSV filename without extension will be used.
  • ● –insert not only create the CREATE TABLE DDL statement but also insert the data sets into the freshly created table.

This tool has (in my opinion) three small drawbacks:

  1. 1. It doesn’t allow you to see the insert statements. The CREATE TABLE statement will be printed to stdout, if you don’t provide a db URL. But there is no way to see the insert statements.
  2. 2. It can’t handle already existing tables. There is no –add-drop option to issue a DROP TABLE statement in advance to create the new one.
  3. 3. It’s not able to autocreate an ID column. This complicates the first step of creating a usable CSV file to some extend, as you can see above.

So the last action is to create the primary key on the id column in both tables:

 

Even with these little drawbacks the csvkit tool chain has numerous applications. Give it a spin!