Skip to main content

Queries

In Shapelets, data is stored in DataFrame-like data structures of type dataset. These are tabular structures with an index, rows and columns, in which data is stored.

In order to access data stored in Shapelets datasets, you can execute SQL queries directly or, alternatively, you can write queries in python using a simple syntax based on list comprenhensions.

If you know how to operate with lists, you know how to operate Shapelets datasets and you don’t need to learn anything. Learn more about list comprehensions here.

For this section, we’ll start by loading a well-known dataset, the iris dataset. We will do this using the load_test_data() function.

import shapelets as sh

session = sh.sandbox()
data = session.load_test_data()
data.head()
note

Remember to create a Shapelets session first to work with the Smart Storage System.

Shapelets Syntax

Shapelets has a very pythonic API, so if you know how to code in Python you will move very fast with Shapelets.

You can use the map() function in order to write list comprehension queries to interact with the data.

session.map(x for x in data)

However, if you are more familiar with SQL or feel more comfortable using SQL syntax, you can use the execute_sql() function which will help you access any file and write SQL queries.

r = session.execute_sql(f'SELECT * FROM {data.alias} LIMIT 10')

The object returned by execute_sql is a Shapelets dataset object.

Selecting Data

Shapelets provides a function called map() in which we can write list comprehension loops that will be used to query the dataset. For example, to get all the rows, you can use:

session.map(x for x in data)

The map() function returns a DataSet. This code will return all rows with all columns. This functions retrieves the same result if you execute a SELECT * FROM TABLE query in a classical SQL Database.

If you want to select single columns, you can do:

 # Select Sepal_Width and Petal_Width columns from dataset.

session.map((x.Sepal_Width, x.Petal_Width) for x in data)

What about filters? If you want select data using a filter, you only need an if statement in your list comprehension. Here is an example:

# Select rows that Petal_Length > 1.4

session.map(x for x in data if x.Petal_Length > 1.4)

If we want to sort the results of a query, we can use the sort_by() function of a dataset and specify by which column(s) we want to sort. This functions receive two parameters, the columns you want to sort the dataset, and a boolean parameter to indicate whether you want an ascending or descending order. By default, this method sort the results in ascending ordering.

  • If you want an ascending order, pass a string with the column name to the function.
data.sort_by('Sepal_Length')
  • If you want a descending order, include False in sort_by()
data.sort_by('Sepal_Length', False)
  • If you want to combine, or sort by multiple columns, just pass lists with the values to the function.
data.sort_by(['Sepal_Length', 'Petal_Length'], [False, True])
note

You can visualize the results using the functions explained in Contents of a DataSet

Aggregate Data

In order to compute some statistics for our data grouping by some columns, in a SQL-based syntax you would typically use the GROUP BY statement.

In Shapelets, the syntax for querying data are list comprehensions, so you only need to call agregate functions in the list comprehension to get a group by clause. For example, if you want to get the average of a column grouping by another column that has string values, you can do it this way:

from shapelets.functions import avg

session.map((x.Class, avg(x.Sepal_Length)) for x in data)

You can see the complete list of aggregation functions in the API Reference.

Manipulate Data

Add a column

In Shapelets you can add/append a new column to the DataSet using the add_column() function.

data.add_column('new_feature', lambda row: row.Sepal_Length*row.Sepal_Width)

In this example we create a new column as the result of multiplying two existing columns.

Rename a column

You can rename a column using rename_columns()

data.rename_columns({'Sepal_Length': 'Lenght_sepal'})

Remove a column

You can remove a column using drop_columns()

data.drop_columns(['Class'])

Getting a subset of columns

You can select a subset of columns using the select_columns() function. Here’s an example:

data.select_columns(['Sepal_Length', 'Petal_Length'])

Also, you can pass to the function indexes of columns like:

data.select_columns([0, 2])

In addition, you can do it using the map() function selecting the column that you want

session.map(x.Sepal_Length, x.Petal_Length for x in data)
note

You can visualize the results using functions viewed in Contents of a DataSet

Join Data

Shapelets Join

Shapelets provides a way for easily combining different datasets. Let’s see this with examples! First, we need a new dataset to join with our dataset.

import pandas as pd
df_uses = pd.DataFrame(
{
'Class': ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],
'common_uses': ['Unknown', 'Stimulating', 'Creating ointments']}
)

In general, there are different ways of joining tables such as: left join, right join, inner join, outer join or full join. Let’s see these with some examples.

If you want to join current dataset a new dataset, you can do for example a full join like this

session.map(
(x.Class, y.common_uses)
for x in data
for y in uses_dataset
if (x.Class == y.Class)
)

If you want another kind of join, for example a left join you should use the left() function with the dataset that you want to merge. Here is an example:

from shapelets.functions import left

session.map(
(x.Class, y.common_uses)
for x in data
for y in left(uses_dataset)
if (x.Class == y.Class)
)