diff options
Diffstat (limited to 'packages/pipeline/README.md')
-rw-r--r-- | packages/pipeline/README.md | 166 |
1 files changed, 166 insertions, 0 deletions
diff --git a/packages/pipeline/README.md b/packages/pipeline/README.md new file mode 100644 index 000000000..794488cac --- /dev/null +++ b/packages/pipeline/README.md @@ -0,0 +1,166 @@ +## @0xproject/pipeline + +This repository contains scripts used for scraping data from the Ethereum blockchain into SQL tables for analysis by the 0x team. + +## Contributing + +We strongly recommend that the community help us make improvements and determine the future direction of the protocol. To report bugs within this package, please create an issue in this repository. + +Please read our [contribution guidelines](../../CONTRIBUTING.md) before getting started. + +### Install dependencies: + +```bash +yarn install +``` + +### Build + +```bash +yarn build +``` + +### Clean + +```bash +yarn clean +``` + +### Lint + +```bash +yarn lint +``` + +### Migrations + +Create a new migration: `yarn migrate:create --name MigrationNameInCamelCase` +Run migrations: `yarn migrate:run` +Revert the most recent migration (CAUTION: may result in data loss!): `yarn migrate:revert` + +## Testing + +There are several test scripts in **package.json**. You can run all the tests +with `yarn test:all` or run certain tests seprately by following the +instructions below. Some tests may not work out of the box on certain platforms +or operating systems (see the "Database tests" section below). + +### Unit tests + +The unit tests can be run with `yarn test`. These tests don't depend on any +services or databases and will run in any environment that can run Node. + +### Database tests + +Database integration tests can be run with `yarn test:db`. These tests will +attempt to automatically spin up a Postgres database via Docker. If this doesn't +work you have two other options: + +1. Set the `DOCKER_SOCKET` environment variable to a valid socket path to use + for communicating with Docker. +2. Start Postgres manually and set the `ZEROEX_DATA_PIPELINE_TEST_DB_URL` + environment variable. If this is set, the tests will use your existing + Postgres database instead of trying to create one with Docker. + +## Running locally + +`pipeline` requires access to a PostgreSQL database. The easiest way to start +Postgres is via Docker. Depending on your platform, you may need to prepend +`sudo` to the following command: + +``` +docker run --rm -d -p 5432:5432 --name pipeline_postgres postgres:11-alpine +``` + +This will start a Postgres server with the default username and database name +(`postgres` and `postgres`). You should set the environment variable as follows: + +``` +export ZEROEX_DATA_PIPELINE_DB_URL=postgresql://postgres@localhost/postgres +``` + +First thing you will need to do is run the migrations: + +``` +yarn migrate:run +``` + +Now you can run scripts locally: + +``` +node packages/pipeline/lib/src/scripts/pull_radar_relay_orders.js +``` + +To stop the Postgres server (you may need to add `sudo`): + +``` +docker stop pipeline_postgres +``` + +This will remove all data from the database. + +If you prefer, you can also install Postgres with e.g., +[Homebrew](https://wiki.postgresql.org/wiki/Homebrew) or +[Postgress.app](https://postgresapp.com/). Keep in mind that you will need to +set the`ZEROEX_DATA_PIPELINE_DB_URL` environment variable to a valid +[PostgreSQL connection url](https://stackoverflow.com/questions/3582552/postgresql-connection-url) + +## Directory structure + +``` +. +├── lib: Code generated by the TypeScript compiler. Don't edit this directly. +├── migrations: Code for creating and updating database schemas. +├── node_modules: +├── src: All TypeScript source code. +│ ├── data_sources: Code responsible for getting raw data, typically from a third-party source. +│ ├── entities: TypeORM entities which closely mirror our database schemas. Some other ORMs call these "models". +│ ├── parsers: Code for converting raw data into entities. +│ ├── scripts: Executable scripts which put all the pieces together. +│ └── utils: Various utils used across packages/files. +├── test: All tests go here and are organized in the same way as the folder/file that they test. +``` + +## Adding new data to the pipeline + +1. Create an entity in the _entities_ directory. Entities directly mirror our + database schemas. We follow the practice of having "dumb" entities, so + entity classes should typically not have any methods. +2. Create a migration using the `yarn migrate:create` command. Create/update + tables as needed. Remember to fill in both the `up` and `down` methods. Try + to avoid data loss as much as possible in your migrations. +3. Add basic tests for your entity and migrations to the **test/entities/** + directory. +4. Create a class or function in the **data_sources/** directory for getting + raw data. This code should abstract away pagination and rate-limiting as + much as possible. +5. Create a class or function in the **parsers/** directory for converting the + raw data into an entity. Also add tests in the **tests/** directory to test + the parser. +6. Create an executable script in the **scripts/** directory for putting + everything together. Your script can accept environment variables for things + like API keys. It should pull the data, parse it, and save it to the + database. Scripts should be idempotent and atomic (when possible). What this + means is that your script may be responsible for determining _which_ data + needs to be updated. For example, you may need to query the database to find + the most recent block number that we have already pulled, then pull new data + starting from that block number. +7. Run the migrations and then run your new script locally and verify it works + as expected. + +#### Additional guidelines and tips: + +* Table names should be plural and separated by underscores (e.g., + `exchange_fill_events`). +* Any table which contains data which comes directly from a third-party source + should be namespaced in the `raw` PostgreSQL schema. +* Column names in the database should be separated by underscores (e.g., + `maker_asset_type`). +* Field names in entity classes (like any other fields in TypeScript) should + be camel-cased (e.g., `makerAssetType`). +* All timestamps should be stored as milliseconds since the Unix Epoch. +* Use the `BigNumber` type for TypeScript code which deals with 256-bit + numbers from smart contracts or for any case where we are dealing with large + floating point numbers. +* [TypeORM documentation](http://typeorm.io/#/) is pretty robust and can be a + helpful resource. |