Monthly Archives: September 2019

How does S3 generate the URL with putObject method?

Recently, I noticed a question on a forum about the AWS SDK S3Client class.

The person was using the putObject method of S3Client to upload a file to an Amazon S3 bucket.

After that, he needed to figure out the URL which could be used to access that file. He had figured out that uploading a file called cat.gif could be accessed with the URL “https://s3.eu-west-3.amazonaws.com/aws.mybucket.es/mysite/httpdocs/cat.gif”.

The problem was that when he uploaded a file whose name included special characters, such as an accented o – “ó” – he couldn’t figure out a consistent way to construct the URL. A character with an accent got URL encoded, but the parenthesis character in a file name did not!

He was trying to figure out the implementation details for the putObject method, and couldn’t find any documentation about it.

The answer to his question was that he was asking the wrong question! There’s a software principle that you should “write code to the interface, not to the implementation“.

As consumers of the S3Client API, we should not be trying to figure out the URL to an uploaded file. Rather, we should be asking the interface for the URL. If AWS revealed the details of their URL construction scheme, it would be very painful if they ever decided to change it, both for them and for users of S3. Further, programmers everywhere would be forced to implement the algorithm that AWS declared for URL construction in all the different languages that are supported by the AWS SDK. That’s a lot of duplicated effort.

Fortunately, AWS gives us an interface that can be used to obtain the URL after a file is uploaded. The result of S3Client->putObject contains an ObjectURL property. We can use that to get the URL, which we can record however we want for later use. Here’s an example:

...
$result = $s3->putObject(...);
$url = $result['ObjectURL'];
...

The full source code for this example of using the S3Client putObject method is at github.

So you see that there’s no need to figure out how AWS implements the URL for our file. AWS gives us the URL immediately when our file is uploaded.

Got comments? Send me an email at fullstackdev@fullstackoasis.com. If you found this interesting, you can hit the subscribe button above. I post new content about once a week.

How to use Amazon AWS Translate with PHP 7.0

Amazon AWS Translate is a pretty cool translation service. You can get started free of charge. Let’s give it a try. This demo assumes you’ve got an AWS account (if not, first go get that). I’m using PHP 7.0 on an Ubuntu 16.04 box.

First, create a new IAM (Identity and Access Management) group. Let’s call it TranslateGroup. Give it TranslateReadOnly permissions. Don’t know how to do this? Sign into your AWS console, and search for “IAM”. That will take you to the right place for dealing with IAM.

Add a new user to this group. Let’s call this user TranslateUser. Give it programmatic access only.

When you see your Access key ID and secret, copy them into your AWS credentials file (in Linux, this is located under ~/.aws/credentials). Set the header for the profile to be [TranslateUser].

Now that you’ve created a user, make sure you’ve installed the AWS PHP SDK. I did this in my demo directory, just by downloading the SDK and unzipping it. The contents of my directory are pretty simple:

~/TranslateDemo$ ls -lairt
total 164
18226436 drwxr-xr-x   3 fullstackdev fullstackdev     4096 Jul 11 15:06 Psr
18226304 drwxr-xr-x   2 fullstackdev fullstackdev     4096 Jul 11 15:06 JmesPath
18226324 drwxr-xr-x   7 fullstackdev fullstackdev     4096 Jul 11 15:06 GuzzleHttp
18226301 -rw-r--r--   1 fullstackdev fullstackdev   129259 Jul 11 15:06 aws-autoloader.php
18226446 drwxr-xr-x 197 fullstackdev fullstackdev    12288 Jul 11 15:06 Aws
   6961244 -rw-rw-r-- 1 fullstackdev fullstackdev      958 Sep 16 20:32 test_translate.php
...

It’s quick and easy to code up the rest. Here’s some demo code (test_translate.php):

<?php
require './aws-autoloader.php';

use Aws\Translate\TranslateClient;
use Aws\Exception\AwsException;

$client = new Aws\Translate\TranslateClient([
    'profile' => 'TranslateUser',
    'region' => 'us-west-2',
    'version' => 'latest'
]);

// Translate from English (en) to Spanish (es).
$currentLanguage = 'en';
$targetLanguage= 'es';
$textToTranslate = "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.";

echo "Calling translateText function on '".$textToTranslate."'\n";

try {
    $result = $client->translateText([
        'SourceLanguageCode' => $currentLanguage,
        'TargetLanguageCode' => $targetLanguage,
        'Text' => $textToTranslate,
    ]);
    echo $result['TranslatedText']."\n";
} catch(AwsException $e) {
    // output error message if fails
    echo "Failed: ".$e->getMessage()."\n";
}

Run this from the command line: php test_translate.php. The output is:

Calling translateText function on 'Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.'
Llámame Ishmael. Hace algunos años, no importa cuánto tiempo precisamente— teniendo poco o ningún dinero en mi bolso, y nada particular que me interesara en la costa, pensé que navegaría un poco y vería la parte acuosa del mundo.

Pretty easy, right? If you found this interesting, hit the subscribe button above. Got comments? Send me an email at fullstackdev@fullstackoasis.com. I post new content just about every week.

How to use the Amazon AWS SDK for Textract with PHP 7.0

The Amazon AWS Textract API lets you do OCR (optical character recognition) on digital files. It’s actually pretty easy to use, although there’s some prep work.

This post has instructions for using the Textract API with their PHP SDK. I’m using PHP version 7.0 on an Ubuntu 16.2 operating system. This demo works as of September 2019.

Step 1: Create the project

Create a folder for your project, for example:

mkdir ~/TextractDemo ; cd ~/TextractDemo

Instructions for getting started with the SDK for PHP are here. First, download the .zip file as described on that page. Then, extract the zip file to the root of your project. That adds a lot of files and folders to the project root. For example, the “Aws” folder is added. This is what you should see when listing the contents of this directory:

~/TextractDemo$ ls -lairt
total 676
  396747 -rw-r--r--   1 fullstackdev fullstackdev  10129 Sep 12 14:11 README.md
  531373 drwxr-xr-x   3 fullstackdev fullstackdev   4096 Sep 12 14:11 Psr
  396739 -rw-r--r--   1 fullstackdev fullstackdev   2881 Sep 12 14:11 NOTICE.md
  399132 -rw-r--r--   1 fullstackdev fullstackdev   9202 Sep 12 14:11 LICENSE.md
  926072 drwxr-xr-x   2 fullstackdev fullstackdev   4096 Sep 12 14:11 JmesPath
  396755 drwxr-xr-x   7 fullstackdev fullstackdev   4096 Sep 12 14:11 GuzzleHttp
  399129 -rw-r--r--   1 fullstackdev fullstackdev 478403 Sep 12 14:11 CHANGELOG.md
  396748 -rw-r--r--   1 fullstackdev fullstackdev 132879 Sep 12 14:11 aws-autoloader.php
  531270 drwxr-xr-x 203 fullstackdev fullstackdev  12288 Sep 12 14:11 Aws
  396729 drwxr-xr-x   6 fullstackdev fullstackdev   4096 Sep 15 09:48 .
13500418 drwxr-xr-x  46 fullstackdev fullstackdev  20480 Sep 15 09:49 ..

Step 2: Create an IAM User

In order to use the Textract API, you need an Amazon AWS account. So if you don’t have that already, go follow the instructions to do that now.

Assuming you’ve got an AWS account, next, you need to create an IAM (Identity and Access Management) user. If you are signed in to your AWS console, just search for “Identity and Access Management”, and it takes you to the right place to create an IAM user. There’s an area called “Create individual IAM users”. Go there, click the “Manage Users” button, click the “Add User” button, choose a name like TextractUser, and give this user programmatic access only. Once you’ve created the name, go to the next step, where you can add the user to a specific group. Create a group which has the AmazonTextractFullAccess policy name. Name it something like TextractFullAccessGroup, and save that. Add the user you just created to this group. The next step lets you add tags to the user, but you can leave that blank.

In the Review (last) step, you are given the user’s access key ID and secret key (which is hidden – you will have to reveal it to copy it). Save these in a secure place! As the documentation says, “This is the last time these credentials will be available to download. However, you can create new credentials at any time.” (So if you lose them somehow, you can always generate a new set.)

The credentials that you just created may be saved in the file ~/.aws/credentials on Linux systems. Here’s a quick rundown about that file.

If this file already exists, you can add to it. Here’s the documentation for adding lines to an AWS credentials file. On that page, it gives you an example credentials file with this content:

[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

[user1]
aws_access_key_id=AKIAI44QH8DHBEXAMPLE
aws_secret_access_key=je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY


Instead of user1, add the line [TextractUser] (or whatever user name you used in the “creating user” step above). Copy and paste your access key id and secret key as shown.

The credentials file is normally created when installing the AWS CLI. So if you do not already have a credentials file, install the CLI first. Then you can add users to the file.

Now we’re ready to use Textract. Let’s try to detect text in a sample “document” – the image file shown below. If you are following along, you can right click and save this image, or you can try it on one of your own image files (i.e. one that contains text!).

Test file for Textract

Call Textract using the SDK

You can have Textract analyze images that are in an S3 bucket. However, for demo purposes, that is overkill! It is simpler and quicker to read in an image file as bytes, and send that to Textract for analysis. That’s what we will do.

The source code only needs to do three things. First, it needs to create a Textract client. Second, it needs to read in the image file as bytes. Third, the client needs to call the Textract API. Here’s the demo code:

<?php
/*
 * To run this project, make sure that the AWS PHP SDK has been unzipped in the current directory.
 * 
 * Caution: this is not production quality code. There are no tests, and there is no error handling.
 */
require './aws-autoloader.php';

use Aws\Credentials\CredentialProvider;
use Aws\Textract\TextractClient;

// If you use CredentialProvider, it will use credentials in your .aws/credentials file.
/*
$provider = CredentialProvider::env();
$client = new TextractClient([
	'profile' => 'TextractUser',
    'region' => 'us-west-2',
	'version' => '2018-06-27',
	'credentials' => $provider
]);
*/
$client = new TextractClient([
    'region' => 'us-west-2',
	'version' => '2018-06-27',
	'credentials' => [
        'key'    => 'AKIAI44QH8DHBEXAMPLE',
        'secret' => 'je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY'
	]
]);

// The file in this project.
$filename = "aws_cli_text_document.jpg";
$file = fopen($filename, "rb");
$contents = fread($file, filesize($filename));
fclose($file);
$options = [
    'Document' => [
		'Bytes' => $contents
    ],
    'FeatureTypes' => ['FORMS'], // REQUIRED
];
$result = $client->analyzeDocument($options);
// If debugging:
// echo print_r($result, true);
$blocks = $result['Blocks'];
// Loop through all the blocks:
foreach ($blocks as $key => $value) {
	if (isset($value['BlockType']) && $value['BlockType']) {
		$blockType = $value['BlockType'];
		if (isset($value['Text']) && $value['Text']) {
			$text = $value['Text'];
			if ($blockType == 'WORD') {
				echo "Word: ". print_r($text, true) . "\n";
			} else if ($blockType == 'LINE') {
				echo "Line: ". print_r($text, true) . "\n";
			}
		}
	}
}
?>

You’ll need to edit this source code to use your own AWS credentials. Once you do that, you should be able to run the code and view the output, as shown here:

php textract_demo.php 
Line: The AWS CLI is updated frequently with support for new services and commands.
Word: The
Word: AWS
Word: CLI
...

That’s it! Feel free to email me with any questions. If you found this interesting, hit the subscribe button above. Got comments? Send me an email at fullstackdev@fullstackoasis.com. I post new content just about every week.

References:

[1] Stackoverflow question about AWS Credentials

How to call the reddit REST API using Node.js – Part IV

This is the last of a 4-part series that describes how to call and use the reddit REST API using Node.js.

In Part I, I talked about using curl to get your access token, which gets you permission to use the reddit REST API.

In Part II, I used that access token to call reddit’s search API. But I was still using curl to do this. I have a very large string output from the API, and I don’t know about you, but I’m not keen on using Linux command line tools to process strings. Since I like JavaScript, I decided to move to Node.js for completing my work.

In Part III, I built a Node.js script which gets my access token from reddit. That script does pretty much what I’d been doing in part I, but it does it using Node.js.

Today, I’m going to complete the work by adding a reddit search API call to my Node.js script, and then using JavaScript’s handy string processing functionality to display the information that interests me.

Recall that I’m pretending to be responsible for Starbucks public relations, and I want to find out what’s being said about Starbucks at reddit, in case I need to do damage control!!

Here’s my new Node.js method which uses reddit’s search API to look for new entries which mention “Starbucks”:

const searchReddit = function (d) {
	const options = {
		hostname: "oauth.reddit.com",
		port: 443,
		path: "/r/all/search?q=Starbucks&sort=new",
		method: "GET",
		headers: {
			"Authorization": "Bearer " + d.access_token,
			"User-Agent": "fullStackOasis NewPostsScraper"
		}
	}

	const req = https.request(options, (res) => {
		// console.log(`statusCode: ${res.statusCode}`)
		let chunks = [];
		res.on('data', (d) => {
			// d is a Buffer object.
			chunks.push(d);
		}).on('end', () => {
			let result = Buffer.concat(chunks);
			let tmpResult = result.toString();
			try {
				let parsedObj = JSON.parse(tmpResult);
				// Print the string if you want to debug or prettify.
				// console.log(tmpResult);
				processSelfText(parsedObj);
			} catch (err) {
				console.log("There was an error!");
				console.log(err.stack);
				// I got an error, TypeError: Invalid data, chunk must be a string or buffer, not object
				// Also I got this, when I'd pushed d.toString to chunks:
				// TypeError: "list" argument must be an Array of Buffer or Uint8Array instances
				process.stderr.write(err);
			}
		});
	})

	req.on('error', (error) => {
		process.stderr.write(error);
	})

	req.end();	
};

As before, you do not have to understand this code in detail to see what’s going on. The input to my searchReddit function has the access_token which I’d previously obtained in Part III. This new code uses that access token to call the reddit search API, doing a search for “Starbucks”.

Buried in that code above is a call to a function processSelfText. I need that because it’s not helpful to have a giant wall of text displayed to me! I need to process this blob of data, and have the script display only the interesting parts.

My function processSelfText grabs the blob of JSON which was returned from reddit’s search API, and loops through it for all the individual reddit threads. It prints out a substring of the thread that contains the mention of “Starbucks”, and also prints out the reddit URL in case I want to read the whole thread. I can quickly skim through the results to see if the thread looks potentially harmful to Starbucks. If it does, then I can go to reddit to respond.

Here’s the string processing code:

const processSelfText = function (obj) {
	if (obj.data && obj.data.children && obj.data.children.length) {
		obj.data.children.forEach(function (item, n) {
			// data is an Object. It may have selftext property
			if (item.data) {
				console.log("Item #" + n);
				if (!item.data.selftext) {
					console.log("Only found a url, no text:");
					console.log(item.data.url);
				} else {
					console.log("Found url and text:");
					console.log(item.data.url);
					showSurroundingText(item.data.selftext);
				}
			}
		});
	}
}

/**
 * Process the input string to 
 * @param {*} str 
 */
const showSurroundingText = function (str) {
	let maxchars = 150;
	// Have to do a lowercase search.
	let found = str.toLowerCase().indexOf("starbucks");
	if (found > -1) {
		// See https://davidwalsh.name/remove-multiple-new-lines
		str = str.replace(/[\r]+/g, " ");
		str = str.replace(/[\n]+/g, " ");
		// If first argument is too large, it's okay, just returns front of string.
		// If second argument is too small, also okay.
		var substring = str.substring(found - maxchars, found + maxchars);
		// Remove the new lines.
		console.log("..." + substring + "...");
	}
};

I run the script from the command line, like this: node reputation-checker.js. I am using Node.js version 8.3, and Ubuntu 16.04, but I think this script will work for most other operating systems and platforms. This is what the output of my script looks like:

Item #0
Found url and text:
https://www.reddit.com/r/aznidentity/comments/cybh97/we_are_not_honorary_white_people_and_do_not/
 …tention span which many folks nowadays unfortunately do not.  I’ll order my morning coffee at an Asian-owned establishment rather than an overpriced Starbucks where it ends up tasting burnt anyway.  I could not care less for Mexican food or a Westerners version of Chinese food. Authentic Asian cuisi…
 Item #1
 Found url and text:
 https://www.reddit.com/r/exmormon/comments/cybd66/mom_why_dont_you_get_a_blessing_me_because_they/
 … of my skin alternates between burning and itching. Sometimes, I get both at once. Which is what happened yesterday. While I was waiting in line at Starbucks, talking to my mom on the phone.  My mom knows I left the church. And she knew I was at Starbucks. But none of that should matter. She’s livin…
 Item #2
 …

I could do more to refine this code – normally, I’d refactor it, and write some tests, and do more error handling, maybe automate it to send me an email periodically… but this is good enough for demonstration purposes. Here’s a link to the entire script, if you want to download it and mess around with it.

I hope you enjoyed this tutorial! Please feel free to use the “subscribe” form if you’d like to keep posted on updates to the “Full Stack Oasis” blog. I only post about once a week.

How to call the reddit REST API using Node.js – Part III

In Part I and Part II, I called the reddit REST API using the curl command line tool.

Now, I’m going to create a Node.js script that does this. Why create a script, when I can already do what I want with curl?

(1) I can more easily reuse the code that I’ve written. In Part II, I mentioned performing a search for the word “Starbucks” at reddit. If I wanted to do a different type of search, I could alter my script to search for something else instead.

(2) I can more easily execute the code that I’ve written. For example, I could run the script on a daily or hourly basis using a cron job, without having to do anything manually. “Set it and forget it” is awesome!

(3) I can more easily use the code that I’ve written. For example, I can bring in Node.js libraries to process the output of my script, and easily get from it what interests me.

So, now I’m going to move from curl to Node.js to do what I want, repeatedly, in an automated way.

Recall that I am pretending to be in charge of reputation management for Starbucks. I want to get recent comments that come up on reddit that mention Starbucks so I can quickly look for problems (or, hopefully, compliments!), and respond.

Below, I’ve shown just the Node.js code which can be used to retrieve my access token. Notice how much more complicated this is than the single curl command that I used previously! By the way, there are definitely simpler ways to do this using Node.js. I’m writing this example using just the built-in libraries that come with Node.js, which makes things a bit more complicated than they have to be.

const https = require('https');

let postData = "grant_type=password&username=my_user_name&password=MyExcellentPassword";
let username = "my_reddit_id";
let password = "my_reddit_secret";

/**
 * A method to get an access token to call reddit Search API
 */
const getAccessToken = function () {
	const options = {
		hostname: "www.reddit.com",
		port: 443,
		path: '/api/v1/access_token',
		method: 'POST',
		headers: {
			"Content-Type": "application/x-www-form-urlencoded",
			"Content-Length": postData.length,
			"Authorization": "Basic " + new Buffer(username + ":" + password).toString("base64"),
			"User-Agent": "my test bot 1.0"
		}
	}

	const req = https.request(options, (res) => {
		if (res.statusCode === 200) {
			let chunks = [];
			res.on('data', (d) => {
				/*
				* the output data has the format
				* {"access_token": "271295382352-tV_vIeKVRgq7Juh3iYHmW4oyT64",
				* "token_type": "bearer", "expires_in": 3600, "scope": "*"}
				* But d is a Buffer object, and has to be translated into an
			    * object at the end.
				*/
				chunks.push(d);
			}).on('end', () => {
				let result = Buffer.concat(chunks);
				let tmpResult = result.toString();
				try {
					let parsedObj = JSON.parse(tmpResult); // TODO do something with this Object, which contains my access token

				} catch (err) {
					process.stderr.write(err);
				}
			});
		} else {
			console.log("Received a statusCode " + res.statusCode);
		}
	});

	req.on('error', (error) => {
		process.stderr.write(error);
	})

	req.write(postData);
	req.end();
};

getAccessToken();

You don’t have to understand all this code in detail. If you skim it, you should get an idea what it’s doing. You are making an https request to reddit’s API. When a web browser makes an https request, there’s a lot of “stuff” going on under the hood. We have to code up some of that stuff here; that’s what the options object is for. Once the request is made to reddit, the information is returned in “chunks” of binary data over the network. Node.js waits for all of that data to get to us. When it’s finished, we use the built-in Buffer.concat method to concatenate all the binary data into one Buffer object, turn it into a JSON string, and parse that into an Object. The Object contains our access_token property. In my next post, we’ll use that to access the reddit API to search for recent posts about “Starbucks”.