Fundamentals¶
Configuration¶
The primeight Python library is a unified interface to Cassandra.
Each Cassandra table is defined by a Yaml configuration file.
This Yaml file has a specific structure that is enforced by the parser.
If any thing is not as expected the parser will throw a SyntaxError
and provide an explanation why.
Yaml structure¶
Required Fields¶
The yaml has 5 required fields: version, keyspace, name, columns, and query.
Version¶
The version field represents the version of the template.
This field is used to keep track of the updates done to a table.
Keyspace¶
The keyspace field specifies the keyspace a table belongs.
Since there are cases where the data may be associated with a dynamic keyspace, we the keyspace name can be overwritten for each action (e.g. while querying or inserting data).
Name¶
The name field is the name of the table.
Columns¶
The columns field is where the table columns are defined.
It is a list of objects with a type required field.
The object can be complemented with an alias name, a description,
a min, and max values.
The columns can have all the
Cassandra Data Types,
like int, float, text, etc..., or h3hex,
a custom type used to partition the that geo-spatially.
Cassandra Collections
data types are also supported.
For example, you can define a set of text attributes with set<text>, or
a more complex example may be to define a list of GPS points with
list<tuple<float,float,float>>.
If you require the definition of minimum (min) and maximum (max) values,
both values must be numbers (i.e. int or float).
Finally, since we also allow segmentation by keyspace,
a table can define its keyspace in the yaml configuration using the keyspace field,
or in an operation basis.
The Cassandra table columns may also be complemented
with generated_columns (see Generated Columns).
Query¶
The query field is used to optimize queries.
Here you can define the queries performed on the Cassandra table,
and internally the system will optimize for speed.
The base query is always required, ideally it should be the most used query.
It reflects the Cassandra table, while the remaining queries are
Materialized Views derived from this table.
Every query has three types of columns:
the required, the optional, and all the remaining columns.
The required columns are the columns that should be specified each time you do a query.
It can be segmented by time, space, and/or id.
When using time and space as required columns you need to use generated columns.
This is enforces a consistent search pattern.
On the other hand, when using an id you can use any column.
The optional field is used to enumerate columns that may be used for
filtering with methods like between, or higher_then, or any of the available methods.
Be aware that the order in the optional field is crucial,
as you can not restrict a column without also defining the attributes that precede it.
Additionally, you can specify an order to sort how the data is saved.
There are two options, asc for ascending order, and desc for descending order.
Note: You can only specify the order for required or optional attributes.
Optional Fields¶
Generated Columns¶
Generated columns are columns that are created automatically using
other attributes through the predefined generators.
It is defined using the generated_columns field,
that receives an object of mappings of generator to input attribute(s).
When specifying more than one attribute, separate the attributes using a comma.
For example:
...
generated_columns:
month: tsin
h3: lat,lon
...
Generators¶
- day: receives a timestamp attribute and produces a timestamp for the same day at midnight
- week: receives a timestamp attribute and produces a timestamp for the first day of the week at midnight
- month: receives a timestamp attribute and produces a timestamp for the first day of the month at midnight
- year: receives a timestamp attribute and produces a timestamp for the first day of the year at midnight
- hX: receives latitude and longitude coordinates to produce an h3 hexadecimal identifier of level X. Available levels span from 3 to 12.
- hX_begin: receives latitude and longitude coordinates to produce an h3 hexadecimal identifier of level X. Available levels span from 3 to 12. This identifier is intended to specify the begin of something, e.g. a trip.
- hX_end: receives latitude and longitude coordinates to produce an h3 hexadecimal identifier of level X. Available levels span from 3 to 12. This identifier is intended to specify the end of something, e.g. a trip.
Examples¶
The simplest example is:
version: '0.1'
name: 'devices'
keyspace: 'meight'
columns:
device_id:
type: text
query:
base:
required:
id: device_id
Partitions¶
One of Cassandra's strongest points is how data is arranged and stored making it quickly accessible. This mechanism revolves around partitions. primeight recognises 3 major types of partition: time, space and ids. This makes the structuring of the tables consistent and fast to access.
Each table has its partitions defined in the query required attribute.