Downloading and storing GTFS Schedule data
This tutorial and its repository show how to build a NodeJS app that ingests GTFS data into a PostgreSQL database, in three main steps:
- download a GTFS ZIP file from the web
- unzip it to reveal a list of TXT files
- save each row of each file into the database.
Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. – Source
Let's get started!
0. Prerequisites
First make sure you have the prerequisites listed here, then fork our gtfs-app-setup repository to save time on creating the development environment.
Next, delete the setup elements you don't actually need for this project. Namely the frontend, the Express server, and the sample GTFS data.
-
Delete the
./public/
and./views/
frontend directories (with their contents). -
Delete the
./sample-data/
directory containing sample GTFS data. -
In
./docker-compose.yml
, delete the volumes related to the deleted elements:
services:
node-app-service:
volumes:
- ./public/:/gtfs-app/public/ # delete this
- ./views/:/gtfs-app/views/ # and this
postgresql-database-service:
volumes:
- ./sample-data/:/docker-entrypoint-initdb.d/ # and also this
- Completely overwrite the
./src/index.ts
file to simply log a message, for now, instead of creating an Express server:
console.log(`I'm an app!`);
-
Move the
./src/getDatabasePool.ts
file to a new path,./src/database/getDatabasePool.ts
, so it can live with the other database function you're about to create. -
Launch this simplified setup with
docker-compose up --build
. The "I'm an app!" message should appear in the terminal, and http://localhost:8001/ should give you access to pgAdmin.
If you already ran other projects using this setup, you might get a conflict error. Just clear the previous setup with
docker stop $(docker ps -q)
,docker system prune -a
, anddocker volume rm $(docker volume ls -q)
, then trydocker-compose up --build
again.
The database is empty, that's normal. The whole point of this project is to fill it up. Let's do that now!
1. File download
Downloading a file with NodeJS is a two-step process: first making an HTTPS request to the file's URL, then immediately writing the response to a local file. ~Do not pass GO, do not collect 200 €~ ahem I mean without intermediate processing of the response.
Both steps can be accomplished using modules already built into NodeJS, namely https
and fs
. The latter stands for "file system" and enables file handling, using streams among other things, which are perfect for this use case.
Streams are a way to handle reading/writing files, network communications, or any kind of end-to-end information exchange in an efficient way. What makes streams unique, is that instead of a program reading a file into memory all at once like in the traditional way, streams read chunks of data piece by piece, processing its content without keeping it all in memory. – Source
-
Find the URL of the GTFS file you want to ingest, or keep the Swiss data used in this tutorial (from this permalink).
-
Create the new file
./src/downloadGtfs.ts
to download a GTFS ZIP file from a given URL, to a given filename:
import https from 'https';
import { WriteStream, createWriteStream } from 'fs';
export const downloadGtfs = (sourceUrl: string, targetFilename: string): void => {
const fileWriteStream: WriteStream = createWriteStream(`${targetFilename}.zip`);
https
.get(sourceUrl, (response) => {
response.pipe(fileWriteStream);
fileWriteStream.on('finish', () => {
console.log('GTFS file download completed:', fileWriteStream.path);
fileWriteStream.close();
});
})
.on('error', (error) => {
console.log(error);
});
};
- Overwrite
./src/index.ts
to call the downloader on your chosen source URL and target filename:
import { downloadGtfs } from './downloadGtfs';
const app = async (): Promise<void> => {
const sourceUrl = 'https://opentransportdata.swiss/fr/dataset/timetable-2024-gtfs2020/permalink';
const zipFilename = 'gtfsSwitzerland';
downloadGtfs(sourceUrl, zipFilename);
};
app();
Thanks to our setup's use of Nodemon, the app automatically reloads on save, runs, and displays the confirmation message "GTFS file download completed: gtfsSwitzerland" in the terminal. Great! But where is the downloaded file?
It's right there, at the root of the repository, inside the container. That's annoying. Let's make it accessible from outside the container, i.e. in VSCode's file explorer.
-
Create a
./downloads/
directory at the root of the repository. -
Create a
./downloads/.gitkeep
file to save the empty directory in the git repository. -
Add two lines to
./.gitignore
to exclude the other contents of that directory from git:
downloads/*
!downloads/.gitkeep
- In
./docker-compose.yml
, add a volume to bind the new directory inside and outside the container:
services:
node-app-service:
volumes:
- ./downloads:/gtfs-app/downloads/
Volumes are (...) mechanisms for persisting data generated by Docker containers. Docker volumes basically create a link between one of the local folders on the system and the folder on the docker container. These folders have data replication so that even if a container gets restarted/deleted at some point in time, you will still have access to the data that the container generated. – Source
- In
./src/downloadGtfs.ts
, add the path to this new directory so that the downloaded files go there instead of the root of the repository:
const fileWriteStream: WriteStream = createWriteStream(`downloads/${targetFilename}.zip`);
Relaunch the containers with docker-compose up
. This time, the downloaded file is visible in the file explorer. Open it, it should contain a bunch of TXT files.
Oh no! In our case it just contains "Redirecting....". Switzerland's GTFS permalink actually redirects to, rather than serve, the latest ZIP file.
- Edit
./src/downloadGtfs.ts
to follow redirects using a recursive function, and wrap the whole thing in a Promise to ensure the app waits for each response:
import { WriteStream, createWriteStream } from 'fs';
import https from 'https';
export const downloadGtfs = async (sourceUrl: string, targetFilename: string): Promise<void> => {
const fileWriteStream: WriteStream = createWriteStream(`downloads/${targetFilename}.zip`);
return new Promise((resolve, reject) => download(sourceUrl, fileWriteStream, resolve, reject));
};
const download = (
sourceUrl: string,
fileWriteStream: WriteStream,
resolve: (value: void | PromiseLike<void>) => void,
reject: (reason?: any) => void
): void => {
https
.get(sourceUrl, (response) => {
const redirectUrl = response.headers?.location;
if (response.statusCode === 200) {
response.pipe(fileWriteStream);
fileWriteStream.on('finish', () => {
console.log('GTFS file download completed:', fileWriteStream.path);
fileWriteStream.close();
resolve();
});
} else if (response.statusCode === 302 && redirectUrl) {
console.log('Download redirected to:', redirectUrl);
return download(redirectUrl, fileWriteStream, resolve, reject);
} else {
throw new Error(`Unsupported response code: ${response.statusCode}.`);
}
})
.on('error', (error) => {
console.log(error);
reject(error);
});
};
- Await the response in
./src/index.ts
:
await downloadGtfs(sourceUrl, zipFilename);
Save your changes, then wait for the app to reload and output the "GTFS file download completed: gtfsSwitzerland" message again. This might take a few minutes (it's a big file!). Done? Cool, this time the ZIP file contains the expected GTFS TXT files. Let's extract them.
2. File unzip
NodeJS doesn't have a built-in unzipper so you must use an extra package. The current consensus is that yauzl is the best tool for the job, so let's go with that.
-
Install the package with
npm install yauzl
andnpm install @types/yauzl --save-dev
. -
Create a new file named
./src/unzipGtfs.ts
based on yauzl's starter code. Notice how extracting files is very similar to downloading them: NodeJS reads from a source using a ReadStream, then saves the data in a target file using a WriteStream:
import yauzl from 'yauzl';
import { createWriteStream } from 'fs';
export const unzipGtfs = (sourceFilename: string): void => {
const sourceFile = `downloads/${sourceFilename}.zip`;
const destinationPath = `downloads/`;
yauzl.open(sourceFile, { lazyEntries: true }, (error, zipfile) => {
if (error) {
console.log(error);
throw error;
}
zipfile.readEntry();
zipfile.on('entry', (entry) => {
zipfile.openReadStream(entry, (error, readStream) => {
if (error) {
console.log(error);
throw error;
}
readStream.on('end', () => {
zipfile.readEntry();
});
readStream.pipe(createWriteStream(`${destinationPath}${entry.fileName}`));
});
});
zipfile.on('end', () => {
console.log('GTFS files unzipped:', destinationPath);
});
});
};
- Call this new function in
./src/index.ts
:
// Add this import:
import { unzipGtfs } from './unzipGtfs';
// Add this line before the closing brace of the app function:
unzipGtfs(zipFilename);
Run docker-compose up --build
again (as there are new packages). Now the TXT files are available in the ./downloads/
directory alongside the downloaded ZIP file. Nice, except they're incomplete. Also, it would be neater to extract them in their own sub-directory to enable importing multiple feeds at some point.
- Edit
./src/unzipGtfs.ts
to extract the files into a new sub-directory of./downloads/
named after the source ZIP file, creating it if it doesn't already exist:
// Add these imports:
import { existsSync, mkdirSync } from 'fs';
// Edit this line:
const destinationPath = `downloads/${sourceFilename}/`;
// Immediately under the edited line above, add these:
if (!existsSync(destinationPath)) {
mkdirSync(destinationPath);
}
- Wrap the whole thing in a promise like you did for
./src/downloadGtfs.ts
, so NodeJS waits for every file to be fully extracted:
export const unzipGtfs = async (sourceFilename: string): Promise<void> => {
return new Promise((resolve, reject) => {
// ...
if (error) {
console.log(error);
// Use "reject" instead of "throw" in both "if (error) {}" conditions.
reject(error);
}
// ...
zipfile.on('end', () => {
console.log('GTFS files unzipped:', destinationPath);
// Add this line here:
resolve();
});
});
};
- Finally, await the response in
./src/index.ts
:
await unzipGtfs(zipFilename);
Save and wait for the app to run. Wonderful, the gtfsSwitzerland.zip
file has been extracted to a new ./downloads/gtfsSwitzerland/
directory, and all of its TXT files are accessible and valid.
Now let's have a look at the database.
3. GTFS models and tables
A GTFS Schedule feed is basically a relational database in CSV format, where each file corresponds to a table, each column is a column, and each row is a record.
To easily manipulate tables, apps usually implement a Data Access Layer. In this case, a super basic set of objects will do the job just fine.
A Data Access Layer (DAL) is the communication bridge between your database and application, ensuring smooth data retrieval and management. – Source
-
Create the new directory
./src/models/
. -
In
./src/models/
, create one TypeScript file per TXT file. Use singular names instead of the GTFS mishmash of singular and plural names, i.e.agency.ts
,calendar.ts
,route.ts
,stop.ts
, etc.
If you're not using the sample Swiss feed, you could have a slightly different list of TXT files. That's fine but make sure you create one model per TXT file, and that all properties are listed in the same order as the file's columns.
- In each file, create an object listing the model's properties/columns, and give them all a
string
type. For example, here's whatsrc/models/agency.ts
looks like:
export const agency = {
agency_id: { type: 'string' },
agency_name: { type: 'string' },
agency_url: { type: 'string' },
agency_timezone: { type: 'string' },
agency_lang: { type: 'string' },
agency_phone: { type: 'string' }
};
Note that agency_id
is obviously supposed to be a unique identifier, hence a primary key in the database. And other properties in other models could be booleans, numbers, dates, etc. For now though, we're just making them all strings to match the source format from the TXT files.
- Create
./src/database/createTables.ts
to generate a "CREATE TABLE" query for each model:
import { agency } from '../models/agency';
// + Import all the other models here.
export const createTables = (): void => {
const models: { [filename: string]: {} } = {
agency
// + Add all the other models here.
};
const queries = Object.keys(models).map((modelName) => {
const properties = Object.keys(models[modelName]).map((propertyName) => {
return `"${propertyName}" VARCHAR(255)`;
});
return `CREATE TABLE "${modelName}" (${properties.join(',\n')});`;
});
console.log(queries.join('\n'));
};
- Comment out everything from
./src/index.ts
for now (you don't need to keep downloading and extracting files while testing the database), and only call./src/database/createTables.ts
instead:
// Add this import:
import { createTables } from './database/createTables';
const app = async (): Promise<void> => {
// Only execute this line in app(), comment out the rest:
createTables();
};
Save. You can see the "CREATE DATABASE" queries in the logs. They look right, so execute them!
- In
./src/database/createTables.ts
, get the database pool and replace theconsole.log()
by an actual query (which requires making the function async), and add a completion message:
// Add this import:
import { getDatabasePool } from './getDatabasePool';
// Make the function async and return a Promise:
export const createTables = async (): Promise<void> => {
// Add this line after the list of models:
const pool = getDatabasePool();
// Replace "console.log(queries.join('\n'));" with an actual query:
await pool.query(queries.join('\n'));
// Add this completion message at the end:
console.log('Database tables created.');
};
- Edit
./src/database/createTables.ts
again to wrap the whole process in atry ... catch
for good measure:
export const createTables = async (): Promise<void> => {
try {
// The existing logic goes here.
} catch (error) {
console.log(error);
throw error;
}
};
- In
./src/index.ts
, await the now asynchronous function:
await createTables();
Save... You get the "Database tables created." message. Save again... Oh no, this time you get ERROR: relation "agency" already exists
. Let's prevent this from happening.
- In
./src/database/createTables.ts
, add a query to drop the previously created tables before trying to create them again:
return `
DROP TABLE IF EXISTS "${modelName}";
CREATE TABLE "${modelName}" (${properties.join(',\n')});
`;
Save, this time there's no error. And if you check pgAdmin on http://localhost:8001/ under Servers > GTFS > root > Schemas > public
, you'll see that all the tables are there, though empty. So now all that's left to do is to read each TXT file and store its contents in the corresponding table. You're nearly done!
4. File to database
This is yet another two-step process using NodeJS' streams:
-
First, reading the source TXT files with
fs
' read stream. -
Then, writing the data to the database using a
CopyStreamQuery
. This is a a type of write stream that uses PostgreSQL'sCOPY
feature, so the read stream can be piped directly into the database instead of generating and executing one or moreINSERT INTO
queries.
This process is made easy with pg-copy-streams, a package created by the author of node-postgres
that enables exactly this.
Read: Import CSV File Into PostgreSQL Table, COPY and Understanding Streams in Node.js.
-
Install the new dependency with
npm install pg-copy-streams
andnpm install @types/pg-copy-streams --save-dev
. -
Create a new file,
./src/importGtfs.ts
, based on pg-copy-streams's starter code to import just one file (agency.txt
) for now:
import { getDatabasePool } from './database/getDatabasePool';
import { ReadStream, createReadStream } from 'fs';
import { CopyStreamQuery, from } from 'pg-copy-streams';
import { pipeline } from 'node:stream/promises';
export const importGtfs = async (zipFilename: string): Promise<void> => {
const pool = getDatabasePool();
const client = await pool.connect();
const directory = `downloads/${zipFilename}`;
const filename = 'agency.txt';
const filepath = `${directory}/${filename}`;
const tableName = 'agency';
const fileReadStream: ReadStream = createReadStream(filepath);
const databaseWriteStream: CopyStreamQuery = client.query(
from(`
COPY "${tableName}"
FROM STDIN
DELIMITER ','
CSV HEADER
;`)
);
await pipeline(fileReadStream, databaseWriteStream);
fileReadStream.close();
console.log('Imported file:', filename);
};
- Call this function from the
./src/index.ts
file, keeping all other instructions commented out for test speed:
// Add this import:
import { importGtfs } from './importGtfs';
const app = async () => {
// Have only these uncommented inside app():
const zipFilename = 'gtfsSwitzerland';
await importGtfs(zipFilename);
};
Run docker-compose up --build
, wait to see "Imported file: agency.txt" then refresh pgAdmin, log in again, find the agency
table, right click it, select View/Edit Data > All Rows
, and voilà: all the agency records are there. Cool, now onto the other files.
- Get all the filenames from the source directory, then loop over each file to import its data into the corresponding table (converting singular filenames to plural for consistency):
// Add this import:
import { readdirSync } from 'fs';
// Replace "const filename = 'agency.txt';" with this:
const filenames = readdirSync(directory);
// Wrap the rest of the process in this promise:
await Promise.all(
// Replace "const filename = 'agency.txt';" with a loop over each file:
filenames.map(async (filename) => {
// Replace "const tableName = 'agency';" with this:
const tableName = filename === 'agency.txt' ? 'agencies' : filename.replace(/s?\.txt$/, 's');
})
);
Save. This time all files get imported. Amazing, the app is pretty much done at this point! Try running it from start to finish.
- In
./src/index.ts
, uncomment the previous steps and add a completion message at the end:
const app = async (): Promise<void> => {
// Uncomment all the steps:
const sourceUrl = 'https://opentransportdata.swiss/fr/dataset/timetable-2024-gtfs2020/permalink';
const zipFilename = 'gtfsSwitzerland';
await downloadGtfs(sourceUrl, zipFilename);
await unzipGtfs(zipFilename);
await createTables();
await importGtfs(zipFilename);
// Add this completion message:
console.log('GTFS successfully imported.');
};
-
Delete the contents of the
./downloads/
directory. -
Rebuild the containers from scratch (for a complete test run) by running
docker-compose stop && docker-compose rm -f && docker system prune -af --volumes && docker-compose up --build
.
Wait for all the "Imported file: [filename]" messages to be logged, go to pgAdmin, reload, sign in, view the tables, and done!
The GTFS Schedule data has been succesfully imported into the database.
Discounting the 4 minutes for the Docker build, the whole process took 4-5 minutes to run on my machine. Most of that time was spent downloading and extracting the 170 Mb GTFS ZIP file. Not too bad for the ingestion of over 21 million records!
5. Clean up
Last thing: let's delete the files once they're unzipped and imported.
- Create the new file
./src/deleteFileOrDirectory.ts
containing two functions:
import { unlink } from 'fs/promises';
import { rmdirSync } from 'fs';
export const deleteFile = async (filepath: string): Promise<void> => {
try {
await unlink(filepath);
console.log('Deleted file:', filepath);
} catch (error) {
console.log(error);
throw error;
}
};
export const deleteDirectory = (directoryPath: string): void => {
try {
rmdirSync(directoryPath);
console.log('Deleted directory:', directoryPath);
} catch (error) {
console.log(error);
throw error;
}
};
- In
./src/unzipGtfs.ts
, calldeleteFile()
to remove the ZIP file once its contents have been extracted:
// Add this import:
import { deleteFile } from './deleteFileOrDirectory';
// Add this line under "console.log('GTFS files unzipped:', destinationPath);":
await deleteFile(sourceFile);
- In
./src/importGtfs.ts
, call bothdeleteFile()
anddeleteDirectory()
to remove the TXT files and their directory as soon as they're no longer useful:
// Add these imports:
import { deleteDirectory, deleteFile } from './deleteFileOrDirectory';
// Add this line under "console.log('Imported file:', filename);":
await deleteFile(filepath);
// Add this line right at the end, just before the function's closing brace:
deleteDirectory(directory);
Now the ./downloads/
directory will be empty at the start of each run.
Congratulations, you're done!
6. Bonus: database to file
It might come in handy to export the contents of the freshly filled database as SQL files at some point (e.g. to generate backups). Don't want/need to? Then feel free to skip this section!
-
Create a
./sample-data/
directory at the root of the repository. -
Add it as a new volume in
./docker-compose.yml
:
postgresql-database-service:
volumes:
- ./sample-data/:/sample-data/
-
Rebuild the containers with
docker-compose up --build
, then wait for the script to run. -
In a new terminal, enter the database container with
docker exec -it postgresql-database-container bash
. -
Dump the contents of the database into the
./sample-data/
directory, as one SQL file per table, using the commandfor table in $(psql --tuples-only --command "SELECT table_name FROM information_schema.tables WHERE table_type='BASE TABLE' AND table_schema NOT IN ('pg_catalog', 'information_schema')"); do pg_dump --table $table > /sample-data/$table.sql; done;
You'll see the files appear one by one in VSCode's file explorer, inside the new directory (give the bigger ones a couple of minutes).
Note that GitHub only allows files up to 100 Mb, so you might want to split up the bigger ones like here.
What's next?
To improve this project, you could:
-
Replace the models' simple
string
definitions with GTFS-compliant formats and database constraints (e.g. primary/foreign keys, unique columns, indexes, etc.). -
Update the database with fresh data rather than recreating it from scratch every time.
-
Support the ingestion of multiple feeds in the same database.