aboutsummaryrefslogtreecommitdiffstats
path: root/packages/pipeline/README.md
blob: c647950a29b5eec78d5406df649ea46ad2f66f32 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
## @0xproject/pipeline

This repository contains scripts used for scraping data from the Ethereum blockchain into SQL tables for analysis by the 0x team.

## Contributing

We strongly recommend that the community help us make improvements and determine the future direction of the protocol. To report bugs within this package, please create an issue in this repository.

Please read our [contribution guidelines](../../CONTRIBUTING.md) before getting started.

### Install dependencies:

```bash
yarn install
```

### Build

```bash
yarn build
```

### Clean

```bash
yarn clean
```

### Lint

```bash
yarn lint
```

### Migrations

Create a new migration: `yarn migrate:create --name MigrationNameInCamelCase`
Run migrations: `yarn migrate:run`
Revert the most recent migration (CAUTION: may result in data loss!): `yarn migrate:revert`

## Connecting to PostgreSQL

Across the pipeline package, any code which accesses the database uses the
environment variable `ZEROEX_DATA_PIPELINE_DB_URL` which should be a properly
formatted
[PostgreSQL connection url](https://stackoverflow.com/questions/3582552/postgresql-connection-url).

## Test environment

The easiest way to start Postgres is via Docker. Depending on your
platform, you may need to prepend `sudo` to the following command:

```
docker run --rm -d -p 5432:5432 --name pipeline_postgres postgres:11-alpine
```

This will start a Postgres server with the default username and database name.
You should set the environment variable as follows:

```
export ZEROEX_DATA_PIPELINE_DB_URL=postgresql://postgres@localhost/postgres
```

First thing you will need to do is run the migrations:

```
yarn migrate:run
```

Now you can run scripts locally:

```
node packages/pipeline/lib/src/scripts/pull_radar_relay_orders.js
```

To stop the Postgres server (you may need to add `sudo`):

```
docker stop pipeline_postgres
```

This will remove all data from the database.

If you prefer, you can also install Postgres with e.g.,
[Homebrew](https://wiki.postgresql.org/wiki/Homebrew) or
[Postgress.app](https://postgresapp.com/). As long as you set the
`ZEROEX_DATA_PIPELINE_DB_URL` environment variable appropriately, any Postgres
server will work.

## Directory structure

```
.
├── lib: Code generated by the TypeScript compiler. Don't edit this directly.
├── migrations: Code for creating and updating database schemas.
├── node_modules:
├── src: All TypeScript source code.
│   ├── data_sources: Code responsible for getting raw data, typically from a third-party source.
│   ├── entities: TypeORM entities which closely mirror our database schemas. Some other ORMs call these "models".
│   ├── parsers: Code for converting raw data into entities.
│   ├── scripts: Executable scripts which put all the pieces together.
│   └── utils: Various utils used across packages/files.
├── test: All tests go here and are organized in the same way as the folder/file that they test.
```

## Adding new data to the pipeline

1.  Create an entity in the _entities_ directory. Entities directly mirror our
    database schemas. We follow the practice of having "dumb" entities, so
    entity classes should typically not have any methods.
2.  Create a migration using the `yarn migrate:create` command. Create/update
    tables as needed. Remember to fill in both the `up` and `down` methods. Try
    to avoid data loss as much as possible in your migrations.
3.  Create a class or function in the _data_sources_ directory for getting raw
    data. This code should abstract away pagination and rate-limiting as much as
    possible.
4.  Create a class or function in the _parsers_ directory for converting the raw
    data into an entity. Also add tests in the _tests_ directory to test the
    parser.
5.  Create an executable script in the _scripts_ directory for putting
    everything together. Your script can accept environment variables for things
    like API keys. It should pull the data, parse it, and save it to the
    database. Scripts should be idempotent and atomic (when possible). What this
    means is that your script may be responsible for determining **which** data
    needs to be updated. For example, you may need to query the database to find
    the most recent block number that we have already pulled, then pull new data
    starting from that block number.
6.  Run the migrations and then run your new script locally and verify it works
    as expected.

#### Additional guidelines and tips:

*   Table names should be plural and separated by underscores (e.g.,
    `exchange_fill_events`).
*   Any table which contains data which comes directly from a third-party source
    should be namespaced in the `raw` PostgreSQL schema.
*   Column names in the database should be separated by underscores (e.g.,
    `maker_asset_type`).
*   Field names in entity classes (like any other fields in TypeScript) should
    be camel-cased (e.g., `makerAssetType`).
*   All timestamps should be stored as milliseconds since the Unix Epoch.
*   Use the `BigNumber` type for TypeScript code which deals with 256-bit
    numbers from smart contracts or for any case where we are dealing with large
    floating point numbers.
*   [TypeORM documentation](http://typeorm.io/#/) is pretty robust and can be a
    helpful resource.