Query Amazon S3 Like a Relational Database to Provide Fresh Skill Content

Michael Griffiths Oct 23, 2019
Share:
Tips & Tools Advanced Optimize Tutorial
Blog_Header_Post_Img

Having fresh and customized content goes a long way in keeping customers engaged with your skill. This means you will need to not only include data in your skill, but regularly update it as well. There are a number of ways you can include data in your skill—from including it directly in the code to calling an external service for it, to storing it in a database. The best option will clearly depend on the use case and the data. However, for use cases like a lookup table or a single table query, Amazon S3 has an inexpensive and simple option called S3 Select. This service treats a file as a relational database table where read-only queries can retrieve data.

How Amazon S3 Select Works

Alexa Blog

S3 Select allows you to treat individual files (or objects) stored in an S3 bucket as relational database tables, where you can issue SQL commands like “SELECT column1 FROM S3Object WHERE column2 > 0” against a single file at a time to retrieve data from that file. By using this feature, your skill can query data without having to embed information in your code, set up or access an external service, or manage a database. For example, if your skill needs to access information only about New York City from a list of national events, or your skill needs to look up translations of different descriptions based on the user’s locale, your skill can query for just the needed data from files in S3.

Using S3 Select also allows you to easily update the content. Simply upload a new version of the file to your S3 bucket. Data files can be in CSV, JSON, or Apache Parquet format.

Note: If you have non-technical staff updating your content, consider using  CSV files since they can be edited by popular spreadsheet programs.

Additionally, with S3 Select, no provisioning is required. Upload the file and you’re ready to go. When there’s a surge in your skill usage, you do not need to make any adjustments. Each request counts just like any other GET request to S3. If you are eligible for the AWS Free Tier, these requests to the S3 bucket will count towards your usage. Another advantage to this approach is that your Lambda function doesn’t have to be scaled up to a higher memory setting since it only has to process the needed data (not the entire file), and the entire file doesn’t need to be loaded into memory.

Next, we’ll walk through the code to use S3 Select for both Python and Node.js skills. In these examples, we are selecting a time zone for a given zip code from a CSV formatted file. This example would be useful if you wanted to look up the time in another city.

Note: The Alexa Settings API is a great way to easily get the time zone for the device communicating with your skill.

How to Add S3 Select to Your Python Skill

To add S3 Select to your Python skill, you first need to ensure the AWS SDK for Python (boto3) is imported. The package is automatically included in all AWS-provided Lambda runtimes, so you won’t need to add it to your requirements file. Next, create an S3 client object:

Copied to clipboard
import boto3
s3 = boto3.client('s3')

At the point in your code where you want to select data, add the following block, modifying the Bucket, Key, Expression, and other attributes to match your situation.

Copied to clipboard
try:
        response = s3.select_object_content(
            Bucket='bucket-name',
            Key='datafiles/zipcode_lookup.csv',
            Expression="SELECT timezone from S3Object s WHERE s.zip ='98101' LIMIT 1",
            ExpressionType='SQL',
            InputSerialization={
                'CSV': {
                    'FileHeaderInfo': 'USE'
                },
                'CompressionType': 'NONE'
            },
            OutputSerialization={
                'CSV': {
                }
            },
            RequestProgress={
                'Enabled': False
            }
        )
        
        data = ""
        event_stream = response['Payload']
        end_event_received = False
        for event in event_stream:
            # capture the data from the records event, 
            if 'Records' in event:
                data += event['Records']['Payload']
            elif 'Progress' in event:
                print(event['Progress']['Details'])
            elif 'End' in event:
                print('Result is complete')
                end_event_received = True
        if not end_event_received:
            raise RuntimeError("End event not received, request incomplete.")
    except Exception as e:
        print(e)
        raise e

From the code, you can see the S3 Select returns an event stream. The “for” loop then processes each event looking for the records events. That’s it! Your skill now has the ability to read from a CSV file like it’s a database.

How to Add S3 Select to Your Node.js Skill

To add S3 Select to your Node.js skill, first you need to require the AWS SDK for Node.js in your skill. It is automatically included in the Node.js Lambda runtimes, so you don’t need to add it to your package.json. Next, create an S3 client object.

Note: The example below is set up to work with an Alexa-hosted skill, but it can also work in any Lambda function.

Copied to clipboard
const aws = require('aws-sdk');
const s3 = new aws.S3();

Add the following block to set up querying for the data, modifying the bucketName, keyName, query, and other attributes to match your situation.

Copied to clipboard
const lookupTimezone = (zipCode) => {
    return new Promise((resolve, reject) => {
        try {
            const bucketName = process.env.S3_PERSISTENCE_BUCKET;
            const keyName = 'zipcode.csv';
            const query = `SELECT timezone from S3Object s WHERE s.zip ='${zipCode}' LIMIT 1`;
            let returnVal = 0;
            
            const params = {
              Bucket: bucketName,
              Key: keyName,
              ExpressionType: 'SQL',
              Expression: query,
              InputSerialization: {
                CSV: {
                  FileHeaderInfo: 'USE',
                },
                CompressionType: 'NONE',
              },
              OutputSerialization: {
                CSV: {
                },
              }
            };
            
            console.log('start select');
            s3.selectObjectContent(params, (err, data) => {
            	if (err) {
            		reject(0);
            	}
            
            	const eventStream = data.Payload;
            	
            	eventStream.on('data', (event) => {
            		if (event.Records) {
            			returnVal = event.Records.Payload.toString();
            			resolve(returnVal);
            		} else if (event.Stats) {
            			//console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);
            		} else if (event.End) {
            		    //console.log('SelectObjectContent completed');
            		}
            	});
            
            	// Handle errors encountered during the API call
            	eventStream.on('error', (err) => {
            		switch (err.name) {
            			// Check against specific error codes that need custom handling
            		}
            	});
            	eventStream.on('end', () => {
            		// Finished receiving events from S3
            		console.log(`returning: ${returnVal}`);
            	    resolve(returnVal);
            	});
            });
        } catch (e) {
            console.log(e);
            reject(0);
        }
    })
};

From the code, you can see the S3 Select returns an event stream. Given the asynchronous nature of Node.js, this is wrapped in a promise so all the events can be processed before the requesting code resumes. (If you are new to promises, check out this blog post about requesting data from an external API.) The nested function that processes each event watches for the data event. At the point in your code where you want to query the data, call it using this code:

Copied to clipboard
const zip = await lookupTimezone('98101');

That’s it! Your skill now has the ability to read from a CSV file like it is a database.

Data Access Patterns

The S3 Select feature is best suited for read-heavy scenarios. If your use case includes data specific to a user, check out persistent attributes. Persistent attributes store data in S3 or Amazon DynamoDB using the customers’ userId as the primary key. If your use case includes occasional writes to the S3 file, and having a slight delay in making those updates is acceptable, consider queueing those updates in an Amazon SQS queue, and having a separate Lambda function to make the updates. However, if you need to routinely both read and write data as part of your skill, the eventual consistency model of S3 for updates to an existing object might not be best fit. You may be better served with using Amazon DynamoDB or Amazon Aurora Serverless instead.

Conclusion

Using Amazon S3 like a relational database provides an easy way to include fresh and dynamic content in your skill. Content can updated by non-technical staff, and easily searched for and retrieved by your skill code—so you can provide an engaging experience for your customers. We’re excited to see how you improve your customer experience with these tips!

Resources

Amazon S3 Select Documentation

Blog: Making HTTP Requests to Get Data from an External API using the ASK SDK for Node.js V2

Boto3 SDK Documentation for S3

Node.js SDK Documentation for S3