Chapter 17 of the textbook (“Hi Ho, Hi Ho – Data Mining We Go”) introduces association rules mining – a very flexible and easy to understand form of unsupervised machine learning. Association rules mining is also sometimes called market basket analysis, and indeed the chapter works an example that focuses on groceries. Yet association rules mining can be applied to a variety of types of data, basically any data set where you have a list of “containers” and each container has a list of stuff inside it. Association rules mining looks for commonalities across these containers – what are the combinations of items that frequently occur together.

If you think about it for a minute, you might see that this idea applies to documents (e.g., emails or web pages) and the words that appear in them. Each document can be thought of as a container of words, and within each container certain combinations of words may appear together. In a previous chapter, we explored this idea by creating a “termdocument” matrix. In this exercise, we are going to apply association rules mining to a term-document matrix.

You can create and/or find all of the code you need to accomplish these steps:

LS0tCnRpdGxlOiAiTGFiIDk6IFVzaW5nIGV4cGxvcmF0b3J5IGFuYWx5c2lzIGFuZCBhcnVsZXMiCmF1dGhvcjogCi0gW1lPVVIgTkFNRV0KLSBbWU9VUiBQQVJUTkVSUyBOQU1FXQpkYXRlOiAiYHIgU3lzLnRpbWUoKWAiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KQ2hhcHRlciAxNyBvZiB0aGUgdGV4dGJvb2sgKOKAnEhpIEhvLCBIaSBIbyDigJMgRGF0YSBNaW5pbmcgV2UgR2/igJ0pIGludHJvZHVjZXMgYXNzb2NpYXRpb24gcnVsZXMgbWluaW5nIOKAkyBhIHZlcnkgZmxleGlibGUgYW5kIGVhc3kgdG8gdW5kZXJzdGFuZCBmb3JtIG9mIHVuc3VwZXJ2aXNlZCBtYWNoaW5lIGxlYXJuaW5nLiBBc3NvY2lhdGlvbiBydWxlcyBtaW5pbmcgaXMgYWxzbyBzb21ldGltZXMgY2FsbGVkIG1hcmtldCBiYXNrZXQgYW5hbHlzaXMsIGFuZCBpbmRlZWQgdGhlIGNoYXB0ZXIgd29ya3MgYW4gZXhhbXBsZSB0aGF0IGZvY3VzZXMgb24gZ3JvY2VyaWVzLiBZZXQgYXNzb2NpYXRpb24gcnVsZXMgbWluaW5nIGNhbiBiZSBhcHBsaWVkIHRvIGEgdmFyaWV0eSBvZiB0eXBlcyBvZiBkYXRhLCBiYXNpY2FsbHkgYW55IGRhdGEgc2V0IHdoZXJlIHlvdSBoYXZlIGEgbGlzdCBvZiDigJxjb250YWluZXJz4oCdIGFuZCBlYWNoIGNvbnRhaW5lciBoYXMgYSBsaXN0IG9mIHN0dWZmIGluc2lkZSBpdC4gQXNzb2NpYXRpb24gcnVsZXMgbWluaW5nIGxvb2tzIGZvciBjb21tb25hbGl0aWVzIGFjcm9zcyB0aGVzZSBjb250YWluZXJzIOKAkyB3aGF0IGFyZSB0aGUgY29tYmluYXRpb25zIG9mIGl0ZW1zIHRoYXQgZnJlcXVlbnRseSBvY2N1ciB0b2dldGhlci4gCgpJZiB5b3UgdGhpbmsgYWJvdXQgaXQgZm9yIGEgbWludXRlLCB5b3UgbWlnaHQgc2VlIHRoYXQgdGhpcyBpZGVhIGFwcGxpZXMgdG8gZG9jdW1lbnRzIChlLmcuLCBlbWFpbHMgb3Igd2ViIHBhZ2VzKSBhbmQgdGhlIHdvcmRzIHRoYXQgYXBwZWFyIGluIHRoZW0uIEVhY2ggZG9jdW1lbnQgY2FuIGJlIHRob3VnaHQgb2YgYXMgYSBjb250YWluZXIgb2Ygd29yZHMsIGFuZCB3aXRoaW4gZWFjaCBjb250YWluZXIgY2VydGFpbiBjb21iaW5hdGlvbnMgb2Ygd29yZHMgbWF5IGFwcGVhciB0b2dldGhlci4gSW4gYSBwcmV2aW91cyBjaGFwdGVyLCB3ZSBleHBsb3JlZCB0aGlzIGlkZWEgYnkgY3JlYXRpbmcgYSDigJx0ZXJtZG9jdW1lbnTigJ0gbWF0cml4LiBJbiB0aGlzIGV4ZXJjaXNlLCB3ZSBhcmUgZ29pbmcgdG8gYXBwbHkgYXNzb2NpYXRpb24gcnVsZXMgbWluaW5nIHRvIGEgdGVybS1kb2N1bWVudCBtYXRyaXguIAoKWW91IGNhbiBjcmVhdGUgYW5kL29yIGZpbmQgYWxsIG9mIHRoZSBjb2RlIHlvdSBuZWVkIHRvIGFjY29tcGxpc2ggdGhlc2Ugc3RlcHM6IAoKKiBUaGVyZSBpcyBhIG5pY2UsIG1hbmFnZWFibGUgdGVybS1kb2N1bWVudCBtYXRyaXggdGhhdCBZYW5jaGFuZyBaaGFvIGhhcyBjcmVhdGVkIGJhc2VkIG9uIGEgc2V0IG9mIHR3ZWV0cyBhYm91dCBkYXRhIG1pbmluZyB0aGF0IGhlIGV4dHJhY3RlZDogaHR0cDovL3d3dy5yZGF0YW1pbmluZy5jb20vZGF0YS90ZXJtRG9jTWF0cml4LnJkYXRhIAoKICAgICsgSWYgZm9yIHNvbWUgcmVhc29uIGhhdCBsaW5rIGRvZXNu4oCZdCB3b3JrLCB5b3Ugc2hvdWxkIHZpc2l0IHRoaXMgcGFnZSBhbmQgZG93bmxvYWQgdGhlIGZpbGUgZW50aXRsZWQg4oCcdGVybURvY01hdHJpeC5yZGF0YeKAnSBvbnRvIHlvdXIgY29tcHV0ZXI6IGh0dHA6Ly93d3cucmRhdGFtaW5pbmcuY29tL2RhdGEgCiAgICAKICAgICsgQXMgeW91IGhhdmUgZ3Vlc3NlZCBmcm9tIHRoZSBmaWxlIGV4dGVuc2lvbiwgdGhpcyBpcyBhIGRhdGFzZXQgdGhhdCBpcyBhbHJlYWR5IHByZXBhcmVkIGZvciBvcGVuaW5nIGluIFIuIFJ1biBSLVN0dWRpbyBvbiB5b3VyIGxhcHRvcCBhbmQgdXNlIHRoZSBPcGVuIEZpbGUgY29tbWFuZCB0byBsb2FkIHRoaXMgZmlsZS4gQWZ0ZXIgYW5zd2VyaW5nIHRoZSBjb25maXJtYXRpb24gbWVzc2FnZSBhZmZpcm1hdGl2ZWx5LCB5b3Ugd2lsbCBmaW5kIHRoYXQgYSBkYXRhIG9iamVjdCBjYWxsZWQg4oCcdGVybURvY01hdHJpeOKAnSBhcHBlYXJzIGluIHlvdXIgZW52aXJvbm1lbnQgd2luZG93LiBJbnNwZWN0IHRoaXMgZGF0YSBvYmplY3QuIAoKKiBGb3IgYXNzb2NpYXRpb24gcnVsZXMgbWluaW5nIChhbmQgbW9yZSBzcGVjaWZpY2FsbHkgdGhlIGFwcmlvcmkoKSBjb21tYW5kKSB0byB3b3JrIHByb3Blcmx5LCB5b3Ugd2FudCBpdGVtcyBpbiBjb250YWluZXJzL2Jhc2tldHMgdG8gYmUgeW91ciBjb2x1bW5zIGFuZCB0aGUgcm93cyB0byBiZSB5b3VyIGNvbnRhaW5lcnMuIFlvdSB3aWxsIGZpbmQgdGhhdCB0aGlzIGRhdGEgb2JqZWN0IGlzIHNldCB1cCB0aGUgb3Bwb3NpdGUgd2F5OiBJdCBoYXMgdGhlIHRlcm1zIGFzIHJvd3MgYW5kIHRoZSBkb2N1bWVudHMgYXMgY29sdW1ucy4gWW91IHdpbGwgbmVlZCB0byB0cmFuc3Bvc2UgdGhlIGRhdGEgc2V0LCBhbmQgZm9ydHVuYXRlbHkgUiBoYXMgYSBjb21tYW5kIHRvIGRvIHRoYXQgdmVyeSBlYXNpbHkuIERvIHNvbWUgcmVzZWFyY2ggdG8gZmluZCB0aGF0IGNvbW1hbmQgYW5kIGhvdyB0byB1c2UgaXQuIFRoZW4gdHJhbnNwb3NlIHlvdXIgZGF0YSBzZXQgYW5kIHBsYWNlIGl0IGluIGEgbmV3IGRhdGEgb2JqZWN0LiAKCiogTmV4dCwgYXBwbHkgYWxsIG9mIHRoZSB0ZWNobmlxdWVzIHlvdSBsZWFybmluZyBmcm9tIGNoYXB0ZXIgMTcuIFRoaXMgbWVhbnMgdGhhdCB5b3Ugd2lsbCBoYXZlIHRvIGxvYWQgdGhlIGFydWxlcyBwYWNrYWdlLCBydW4gYXByaW9yaSgpLCBzZXQgdGhlIHBhcmFtZXRlcnMgY29ycmVjdGx5LCBpbnNwZWN0IHRoZSByZXN1bHRzLCB2aXN1YWxpemUgdGhlIHJlc3VsdHMgdXNpbmcgdGhlIGFydWxlc1ZpeiBwYWNrYWdlLCBhbmQgbWFrZSBzZW5zZSBvdXQgb2Ygd2hhdCB5b3UgZmluZC4gWW91IHNob3VsZCBzZXQgeW91ciBwYXJhbWV0ZXJzIHNvIHlvdSBnZW5lcmF0ZSBhdCBsZWFzdCAyMCBydWxlcy4gCgoqIEF0IHRoZSBlbmQgb2YgeW91ciBjb2RlIGZpbGUgZm9yIHRoaXMgZXhlcmNpc2UsIHdyaXRlIGEgZmV3IHNlbnRlbmNlcyBpbnRlcnByZXRpbmcgdGhlIHJlc3VsdHMgb2YgdGhpcyBhbmFseXNpcyBhbmQgZGVzY3JpYmluZyBob3cgdGhpcyB0ZWNobmlxdWUgbWlnaHQgYmUgdmFsdWFibGUgaW4gbWFraW5nIHNlbnNlIG91dCBvZiBsYXJnZSBzZXRzIG9mIGRvY3VtZW50cyAoZS5nLiwgZW1haWxzKS4gCgoKCgo=