Text Moderation using Cognitive Services

Spread the love

The Cognitive services Content Moderator is a very useful tool to find explicit or offensive content in images, videos and text. It also supports custom image and text lists to block or allow matching content and can also be used as a feedback tool to assist human moderators. In this tutorial we are going to use the api to find objectionable content in text. To use the Content Moderator with images check out this tutorial.

The service response includes the following information:

  • Profanity: term-based matching with built-in list of profane terms in various languages
  • Classification: machine-assisted classification into three categories
  • Personally Identifiable Information (PII)
  • Auto-corrected text
  • Original text
  • Language

Prerequisites

  1. To run the sample code you must have an edition of  Visual Studio installed.
  2. You will need the Microsoft.Azure.CognitiveServices.ContentModerator NuGet package.
  3. You will need the Microsoft.Rest.ClientRuntime NuGet package.
  4. You will need the Newtonsoft.Json NuGet package.
  5. You will need an Azure Cognitive Services key. Follow this tutorial to get one. If you don’t have an Azure account, you can use the free trial to get a subscription key.

Create the Project

To create the application follow the steps below:

  1. Create a .NET Core Console Application in Visual Studio 2017
  2. Add the Microsoft.Azure.CognitiveServices.ContentModerator NuGet package by using the NuGet Package Manager Console.
    Install-Package Microsoft.Azure.CognitiveServices.ContentModerator
  3. Add the Microsoft.Rest.ClientRuntime NuGet package by using the NuGet Package Manager Console.
    Install-Package Microsoft.Rest.ClientRuntime
  4. Add the Newtonsoft.Json NuGet package by using the NuGet Package Manager Console.
    Install-Package Newtonsoft.Json
  5. Create a new class and name it Clients. Add the following code:
    using System;
    using System.Collections.Generic;
    using System.Text;
    using Microsoft.Azure.CognitiveServices.ContentModerator;
    using Newtonsoft.Json;
    using System.IO;
    using System.Threading;
    
    namespace ContentModerator
    {
    
        public static class Clients
        {
            // The region/location for your Content Moderator account
            private static readonly string AzureRegion = "westeurope";
    
            // The base URL fragment for Content Moderator calls.
            private static readonly string AzureBaseURL =
                $"https://{AzureRegion}.api.cognitive.microsoft.com";
    
            // Your Content Moderator subscription key.
            private static readonly string CMSubscriptionKey = "ENTER KEY HERE";
    
            // Returns a new Content Moderator client for your subscription.
            public static ContentModeratorClient NewClient()
            {
                // Create and initialize an instance of the Content Moderator API wrapper.
                ContentModeratorClient client = new ContentModeratorClient(new ApiKeyServiceClientCredentials(CMSubscriptionKey));
    
                client.Endpoint = AzureBaseURL;
                return client;
            }
        }
    }
  6. Replace in the above code the Azure Region and the Content Moderator Key from the Azure Portal
  7. Open the Program.cs file and add the following code inside Main
    // Load the input text.
                string text = "Damn it. Lzay dog is cute. Is this a grabage or crap email abcdef@test.com, phone: 6657789887, IP: 255.255.255.255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999-99-9999 looks like a social security number (SSN).";
    
                Console.WriteLine("Screening {0}", text);
    
                text = text.Replace(System.Environment.NewLine, " ");
                byte[] byteArray = System.Text.Encoding.UTF8.GetBytes(text);
                MemoryStream stream = new MemoryStream(byteArray);
                // Create a Content Moderator client and evaluate the text.
                using (var client = Clients.NewClient())
                {
                    // Screen the input text: check for profanity,
                    // autocorrect text, check for personally identifying
                    // information (PII), and classify the text into three categories
                    var screenResult =
                    client.TextModeration.ScreenText("text/plain", stream, "eng", true, true, null, true);
                    Console.WriteLine(
                            JsonConvert.SerializeObject(screenResult, Formatting.Indented));
                }
                Console.ReadLine();
  8. That’s it. Populate the text variable with your text and run the program to get the results.

You can find the complete source code in my Github in this repository.

Result Analysis

The Result is returned in a json format. Below we will explain each section.

Profanity
If the API detects any profane terms in any of the supported languages, those terms are included in the response. The response also contains their location (Index) in the original text. The ListId in the following sample JSON refers to terms found in custom term lists if available.

"Terms": [{
      "Index": 0,
      "OriginalIndex": 0,
      "ListId": 0,
      "Term": "damn"
    },
    {
      "Index": 48,
      "OriginalIndex": 48,
      "ListId": 0,
      "Term": "crap"
    }
  ]

Classification
Content Moderator’s machine-assisted text classification feature supports English only, and helps detect potentially undesired content. The flagged content may be assessed as inappropriate depending on context. It conveys the likelihood of each category and may recommend a human review. The feature uses a trained model to identify possible abusive, derogatory or discriminatory language. This includes slang, abbreviated words, offensive, and intentionally misspelled words for review.

Category1 refers to potential presence of language that may be considered sexually explicit or adult in certain situations.
Category2 refers to potential presence of language that may be considered sexually suggestive or mature in certain situations.
Category3 refers to potential presence of language that may be considered offensive in certain situations.
Score is between 0 and 1. The higher the score, the higher the model is predicting that the category may be applicable.
ReviewRecommended is either true or false depending on the internal score thresholds.

"Classification": {
    "Category1": {
      "Score": 2.8099420887883753E-06
    },
    "Category2": {
      "Score": 0.098273187875747681
    },
    "Category3": {
      "Score": 0.98799997568130493
    },
    "ReviewRecommended": true
  },

Personally Identifiable Information (PII)
The PII feature detects the potential presence of this information:

  • Email address
  • US Mailing address
  • IP address
  • US Phone number
  • UK Phone number
  • Social Security Number (SSN)
"PII": {
   "Email": [{
     "Detected": "abcdef@test.com",
     "SubType": "Regular",
     "Text": "abcdef@test.com",
     "Index": 59
   }],
   "SSN": [{
     "Text": "665778988",
     "Index": 83
   }],
   "IPA": [{
     "SubType": "IPV4",
     "Text": "255.255.255.255",
     "Index": 99
   }],
   "Phone": [{
       "CountryCode": "US",
       "Text": "6657789887",
       "Index": 83
     },
     {
       "CountryCode": "US",
       "Text": "870 608 4000",
       "Index": 237
     },
     {
       "CountryCode": "UK",
       "Text": "+44 870 608 4000",
       "Index": 233
     },
     {
       "CountryCode": "UK",
       "Text": "0344 800 2400",
       "Index": 253
     },
     {
       "CountryCode": "UK",
       "Text": "0800 820 3300",
       "Index": 270
     }
   ],
   "Address": [{
     "Text": "1 Microsoft Way, Redmond, WA 98052",
     "Index": 116
   }]
 },

Auto-correction

The api auto-corrects mistakes like we can see in the AutoCorrectedText  section.

  • grabage to garbage
  • Lzay to Lazy
"OriginalText": "Damn it. Lzay dog is cute. Is this a grabage or crap email abcdef@test.com, phone: 6657789887, IP: 255.255.255.255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999-99-9999 looks like a social security number (SSN).",
  "NormalizedText": "Damn it. Lazy dog is cute. Is this a garbage or crap email abide@ test. com, phone: 6657789887, IP: 255. 255. 255. 255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999- 99- 9999 looks like a social security number ( SSN) .",
  "AutoCorrectedText": "Damn it. Lazy dog is cute. Is this a garbage or crap email abide@ test. com, phone: 6657789887, IP: 255. 255. 255. 255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999- 99- 9999 looks like a social security number ( SSN) ."

The complete json result for the above text is outlined below

{
  "OriginalText": "Damn it. Lzay dog is cute. Is this a grabage or crap email abcdef@test.com, phone: 6657789887, IP: 255.255.255.255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999-99-9999 looks like a social security number (SSN).",
  "NormalizedText": "Damn it. Lazy dog is cute. Is this a garbage or crap email abide@ test. com, phone: 6657789887, IP: 255. 255. 255. 255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999- 99- 9999 looks like a social security number ( SSN) .",
  "AutoCorrectedText": "Damn it. Lazy dog is cute. Is this a garbage or crap email abide@ test. com, phone: 6657789887, IP: 255. 255. 255. 255, 1 Microsoft Way, Redmond, WA 98052. These are all UK phone numbers, the last two being Microsoft UK support numbers: +44 870 608 4000 or 0344 800 2400 or 0800 820 3300. Also, 999- 99- 9999 looks like a social security number ( SSN) .",
  "Misrepresentation": null,
  "Classification": {
    "Category1": {
      "Score": 2.8099420887883753E-06
    },
    "Category2": {
      "Score": 0.098273187875747681
    },
    "Category3": {
      "Score": 0.98799997568130493
    },
    "ReviewRecommended": true
  },
  "Status": {
    "Code": 3000,
    "Description": "OK",
    "Exception": null
  },
  "PII": {
    "Email": [{
      "Detected": "abcdef@test.com",
      "SubType": "Regular",
      "Text": "abcdef@test.com",
      "Index": 59
    }],
    "SSN": [{
      "Text": "665778988",
      "Index": 83
    }],
    "IPA": [{
      "SubType": "IPV4",
      "Text": "255.255.255.255",
      "Index": 99
    }],
    "Phone": [{
        "CountryCode": "US",
        "Text": "6657789887",
        "Index": 83
      },
      {
        "CountryCode": "US",
        "Text": "870 608 4000",
        "Index": 237
      },
      {
        "CountryCode": "UK",
        "Text": "+44 870 608 4000",
        "Index": 233
      },
      {
        "CountryCode": "UK",
        "Text": "0344 800 2400",
        "Index": 253
      },
      {
        "CountryCode": "UK",
        "Text": "0800 820 3300",
        "Index": 270
      }
    ],
    "Address": [{
      "Text": "1 Microsoft Way, Redmond, WA 98052",
      "Index": 116
    }]
  },
  "Language": "eng",
  "Terms": [{
      "Index": 0,
      "OriginalIndex": 0,
      "ListId": 0,
      "Term": "damn"
    },
    {
      "Index": 48,
      "OriginalIndex": 48,
      "ListId": 0,
      "Term": "crap"
    }
  ],
  "TrackingId": "e8e458c4-8aa6-4849-aeb5-178c8afd7692"
}

 

Leave a Reply

Your email address will not be published. Required fields are marked *