eBook - ePub

Phishing Detection Using Content-Based Image Classification

Name: Phishing Detection Using Content-Based Image Classification
Author: Shekhar Khandelwal, Rik Das

Shekhar Khandelwal, Rik Das

Share book

130 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Phishing Detection Using Content-Based Image Classification

Shekhar Khandelwal, Rik Das

Book details

Book preview

Table of contents

Citations

About This Book

Phishing Detection Using Content-Based Image Classification is an invaluable resource for any deep learning and cybersecurity professional and scholar trying to solve various cybersecurity tasks using new age technologies like Deep Learning and Computer Vision. With various rule-based phishing detection techniques at play which can be bypassed by phishers, this book provides a step-by-step approach to solve this problem using Computer Vision and Deep Learning techniques with significant accuracy.

The book offers comprehensive coverage of the most essential topics, including:

Programmatically reading and manipulating image data
Extracting relevant features from images
Building statistical models using image features
Using state-of-the-art Deep Learning models for feature extraction
Build a robust phishing detection tool even with less data
Dimensionality reduction techniques
Class imbalance treatment
Feature Fusion techniques
Building performance metrics for multi-class classification task

Another unique aspect of this book is it comes with a completely reproducible code base developed by the author and shared via python notebooks for quick launch and running capabilities. They can be leveraged for further enhancing the provided models using new advancement in the field of computer vision and more advanced algorithms.

Frequently asked questions

How do I cancel my subscription?

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

Can/how do I download books?

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

What is the difference between the pricing plans?

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

What is Perlego?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Is Phishing Detection Using Content-Based Image Classification an online PDF/ePUB?

Yes, you can access Phishing Detection Using Content-Based Image Classification by Shekhar Khandelwal, Rik Das in PDF and/or ePUB format, as well as other popular books in Informatique & Ingénierie de l'informatique. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Chapman and Hall/CRC

Year

2022

ISBN

9781000597691

Edition

Topic

Informatique

Subtopic

Ingénierie de l'informatique

1 Phishing and Cybersecurity

DOI: 10.1201/9781003217381-1

Phishing is a cybercrime intended to trap innocent web users into a counterfeit website, which is visually similar to its legitimate counterpart. Initially, users are redirected to phishing websites through various social and technical routing techniques. Unaware of the illegitimacy of the website, the users may then provide their personal information such as user id, password, credit card details or bank account details, to name a few. The phishers use such information to steal money from banks, damage a brand’s image or even commit graver crimes like identity theft. Although many phishing detection and prevention techniques are available in the existing literature, the advent of smart machine learning and deep learning methods has widened their scope in the cyber-security world.

Structure

In this chapter, we will cover the following topics:

Basics of phishing in cybersecurity
Phishing detection techniques
- List (whitelist/blacklist)-based
- Heuristics (predefined rules)-based
- Visual similarity-based
Race between phishers and anti-phishers
Computer vision-based phishing detection approach

Objective

After studying this chapter, you should know what phishing is and how it affects everyone who has a web footprint. You will learn about the various ways phishers attack web users and how users can protect themselves from phishing attacks. You will also know the various phishing detection mechanisms that play a vital role in protecting web users from phishing attacks.

Basics of Phishing in Cybersecurity

Phishing is a term derived from fishing, by replacing “f” with “ph”, but contextually they mean the same (Phishing Definition & Meaning | What Is Phishing?, n.d.). Just as fish get trapped in fishing nets, so too are innocent web users being trapped by phishing websites. Phishing websites are counterfeit websites that are visually similar to their legitimate counterparts. Web users are redirected to phishing websites by various means. Figure 1.1 depicts the various techniques employed by phishers to circulate spam messages that contain links to phishing websites.

This figure displays different categories of Phishing attacks: social engineering and technical subterfuge. Within social engineering attacks, there are attacks occurring through emails, instant messages, VoIP, Relay chat and Blogs. Within technical subterfuge, there are malwares, Web trojan, keylogger and DNS poisoning. — **FIGURE 1.1** Types of phishing attacks.

Jain and Gupta (2017) stated that spreading infected links is the starting point of any phishing attack. Once users have received the infected links in their inbox through any of the phishing attack mechanisms shown in Figure 1.1, whether they click on those links or not depends on the users’ awareness. Hence, at the outset, user awareness is the most important, yet most ignored, anti-phishing mechanism.

But to protect users from phishing attacks, anti-phishers have explored many technical anti-phishing mechanisms by considering even novice and technically inept users.

Phishing Detection Techniques

Phishing detection mechanisms are broadly categorized into four groups, as depicted in Figure 1.2 (Khonji et al., 2013):

This image displays various phishing detection mechanisms. Two major categories are software-based and user education-based. Within software-based, there are sub-categories like list-based, heuristic-based, visual similarity-based and AI/ML-based. Visual similarity is further classified into other subcategories, in which pixel-based features are further sub-categorized into comparison-based and ML-based. — **FIGURE 1.2** Phishing detection mechanisms.

List (whitelist/blacklist)-based
Heuristics (pre-defined rules)-based
Visual similarity-based
AI/ML-based

List (Whitelist/Blacklist)-Based

In a list-based anti-phishing mechanism, a whitelist and blacklist of URLs are created and are compared against a suspicious website URL to conclude whether the website under scrutiny is a phishing website or a legitimate one (Jain & Gupta, 2016) (Prakash et al., 2010).

There are various limitations with the list-based approach, namely:

It is dependent on a third-party service provider that captures and maintains such lists, like Google safe browsing API (Google Safe Browsing | Google Developers, n.d.).
Adding a newly deployed phishing website to the white/blacklist is a process that takes time. First such a website has to be identified, and then it has to be listed. Since the average lifetime of a phishing website is 24–32 hours, hence zero-day phishing attacks, this is a serious limitation (Zero-Day (Computing) – Wikipedia, n.d.).

Heuristics (Pre-Defined Rules)-Based

In heuristic-based approaches, various website features like image, text, URL and DNS records are extracted and used to build a rule-based engine or a machine learning-based classifier to classify a given website as phishing or legitimate. Although heuristic-based approaches are among quite effective anti-phishing mechanisms, some of their drawbacks have been pointed out by Varshney et al. (2016):

The time and computational resources required for training are too high.
Heuristic-based applications cannot be used as a browser plugin.
The approach would be ineffective once scammers discovered the key features that can be used to bypass the rules.

Visual Similarity-Based

Visual similarity-based techniques are very useful in detecting phishing since phishing websites look similar to their legitimate counterparts. These techniques use visual features like text content, text format, DOM (Document Object Model) features, CSS features, website images, etc., to detect phishing. Here, DOM-, CSS-, HTML tags- and pixel-based features are compared to their legitimate counterparts in order to make a decision.

Within pixel-based techniques, there are two broad categories through which phishing detection is achieved. One approach is through comparison of visual signatures of suspicious website images with the stored visual signatures of legitimate websites. For example, hand-crafted image features like SIFT (Scale Invariant Feature Transform) (Lowe, 2004), SURF (Speeded Up Robust Features) (Bay et al., 2006), HOG (Histogram of Oriented Gradient) (Li et al., 2016), LBP (Local Binary Patterns) (Nhat & Hoang, 2019), DAISY (Tola et al., 2010) and MPEG7 (Rayar, 2017) are extracted from the legitimate websites and stored in a local datastore, which is used as a baseline for comparing similar features from the websites under scrutiny. And based on the comparison result, the phishing website is classified. Another approach is machine learning- or deep learning classifier-based, where image features of phishing and legitimate webpages are extracted and used to build a classifier for phishing detection.

Race between Phishers and Anti-Phishers

Phishers are continually upgrading their skills and devising new and innovative ways to bypass all the security layers and deceive innocent users. For example, many heuristic-based approaches validate if the website under suspicion is SSL-enabled or not, to determine whether it is a legitimate website or a phishing website. However, nowadays, the number of phishing websites hosted on HTTPS is also increasing significantly.

Similarly, for other significant predictors of a phishing website, phishers may find ways to bypass all the rules employed to detect phishing, which is evident from the upward trend of phishing attacks attempted in recent years. Hence, if phishers find a way to bypass list-based, heuristic-based and hybrid anti-phishing detection mechanisms, to redirect the users to the phishing website, then the image processing-based anti-phishing techniques play a vital role in providing the final security layer to the web users. In this book, we propose numerous machine learning and deep learning methods that manually extract features using computer vision techniques for phishing detection. However, there are two major limitations of these methods. First, these methods utilize a comparison-based technique that requires creating a large datastore of baseline values of legitimate websites. Second, these methods rely on manual hand-crafted feature extraction techniques.

2 Image Processing-Based Phishing Detection Techniques

DOI: 10.1201/9781003217381-2

Studies and statistics suggest that phishing is still a pressing issue in the world of cybercrime. Despite all the research, innovations and developments made in phishing detection mechanisms, revenue losses through phishing attacks are humongous, and therefore there is a pressing need to continue research on various aspects of phishing, as anti-phishers are in an arms race with phishers. And in order to win this race, anti-phishers need to think outside the box and close all the doors before phishers enter users’ premises for theft.

Assume that phishers are able to make users bypass all the list-based, heuristic-based and user awareness-based approaches, and finally made the user to land on the phishing website. At this stage, by analyzing the website image, using image processing techniques to classify whether the website in question is legitimate or phishing, can be considered as a final resort to warn users for phishing attack.

Additionally, list-based approaches cannot protect users from zero-day attacks, and heuristics-based approaches are only go...