Commonest[] function doesn't actually show commonest elementsWhat stopwords list is the Wolfram language...

Is this Paypal Github SDK reference really a dangerous site?

Who has more? Ireland or Iceland?

I am the light that shines in the dark

“I had a flat in the centre of town, but I didn’t like living there, so …”

Why is there an extra space when I type "ls" on the Desktop?

Paper published similar to PhD thesis

Why would /etc/passwd be used every time someone executes `ls -l` command?

How to educate team mate to take screenshots for bugs with out unwanted stuff

Increase the space between numerator and denominator

Averaging over columns while ignoring zero entries

Instances in Shas where Tosfos quotes the Rambam?

Why do we say 'Pairwise Disjoint', rather than 'Disjoint'?

Why do we call complex numbers “numbers” but we don’t consider 2-vectors numbers?

I am the person who abides by rules but breaks the rules . Who am I

I've given my players a lot of magic items. Is it reasonable for me to give them harder encounters?

What can I do if someone tampers with my SSH public key?

Why aren't there more Gauls like Obelix?

SHA 256 Algorithm

Can inspiration allow the Rogue to make a Sneak Attack?

How do you make a gun that shoots melee weapons and/or swords?

Short story about an infectious indestructible metal bar?

Sort Array By Month & Year | JavaScript

How to make sure I'm assertive enough in contact with subordinates?

Boss Telling direct supervisor I snitched

Commonest[] function doesn't actually show commonest elements

What stopwords list is the Wolfram language using?Issues with a Counter that is tallying term appearancesHow does `LongestCommonSubsequence` work?Sorting an Array with words in different languagestext analysis: split document in seperated linesGraph showing valid English words obtained by insertion of single charactersAnalyzing frequency of individual letters in a large body of textHow to translate/convert UTF-16 code to its corresponding word/characters by Mathematica?Helping Mandy become a better spellerMost common prefixes in an English corpus (Hamlet)Optimisation of a loop

I'm using Commonest function on a list with 500000 words to get 10 most frequent elements. Then by using WordCloud I found out that the most frequent word is actually "far", and then checked it by StringCount. So the thing I would like to know is why results from WordCloud and Commonest are so different, and how to make Commonest work properly?

File with words I used, also here

(Sorry for all mistakes, English is only my 3rd language..)

session

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
yesterday

$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
yesterday

$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
yesterday

$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
yesterday

$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
yesterday

|
show 1 more comment

File with words I used, also here

(Sorry for all mistakes, English is only my 3rd language..)

session

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
yesterday

$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
yesterday

$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
yesterday

$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
yesterday

$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
yesterday

|
show 1 more comment

File with words I used, also here

(Sorry for all mistakes, English is only my 3rd language..)

session

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

File with words I used, also here

(Sorry for all mistakes, English is only my 3rd language..)

session

list-manipulation string-manipulation

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

edited 17 hours ago

J. M. is computer-less♦

97.3k10303463

asked yesterday

Aplefull

453

asked yesterday

Aplefull

453

asked yesterday

Aplefull

453

$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
yesterday

$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
yesterday

$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
yesterday

$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
yesterday

$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
yesterday

|
show 1 more comment

$begingroup$
It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.
$endgroup$
– Carl Woll
yesterday

$begingroup$
I removed the bugs tag for now, in accordance with the tag description.
$endgroup$
– Szabolcs
yesterday

$begingroup$
@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?
$endgroup$
– Szabolcs
yesterday

$begingroup$
@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.
$endgroup$
– Carl Lange
yesterday

$begingroup$
It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".
$endgroup$
– Carl Lange
yesterday

It's because "far" is a part of many of the words, e.g., "afar", "farsighted", "airfare", etc.

– Carl Woll
yesterday

I removed the bugs tag for now, in accordance with the tag description.

– Szabolcs
yesterday

@CarlWoll Commonest definitely works correctly (can be verified with Counts/Tally). As for the WordCloud behaviour, I'm very sceptical about this, I would not consider it correct ... bug?

– Szabolcs
yesterday

@Szabolcs The bug is within DeleteStopwords, if it is a bug - I show it at the bottom of my answer.

– Carl Lange
yesterday

It seems to me like DeleteStopwords, which WordCloud uses by default, is deleting a lot of words that it ought not be deleting, such as "custom-made", "runner-up", "interest" and so on (Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]). Perhaps it would be good to report this to WRI and see if they consider it a bug. OP, I would rename this question something like "WordCloud processes input text badly".

– Carl Lange
yesterday

|
show 1 more comment

2 Answers
2

active

oldest

votes

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", 

"against", "all", "almost", "alone", "along", "already", "also", 

"although", "always", "among", "an", "and", "another", "any", 

"anyone", "anything", "anywhere", "are", "around", "as", "at", 

"back", "back-to-back", "be", "because", "become", "before", 

"behind", "being", "below", "between", "born-again", "both", 

"built-in", "but", "by", "can-do", "custom-made", "do", "done", 

"down", "during", "each", "either", "enough", "even", "ever", 

"every", "everyone", "everything", "everywhere", "far-off", 

"far-out", "few", "find", "first", "for", "four", "from", "full", 

"further", "get", "give", "go", "have-not", "he", "head-on", "her", 

"here", "hers", "herself", "him", "himself", "his", "how", "however", 

"if", "in", "interest", "into", "it", "its", "itself", "keep", 

"laid-back", "last", "least", "less", "ma'am", "made", "man-made", 

"many", "may", "me", "might", "more", "most", "mostly", "much", 

"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", 

"not", "nothing", "now", "nowhere", "of", "off", "often", "on", 

"once", "one", "only", "other", "our", "ours", "ourselves", "out", 

"over", "own", "part", "per", "perhaps", "put", "rather", 

"runner-up", "same", "seem", "seeming", "see-through", 

"self-interest", "self-made", "several", "she", "show", "side", 

"since", "sit-in", "so", "some", "someone", "something", "somewhere", 

"still", "such", "take", "than", "that", "the", "their", "theirs", 

"them", "themselves", "then", "there", "therefore", "these", "they", 

"this", "those", "though", "three", "through", "thus", "to", 

"together", "too", "toward", "two", "under", "until", "up", "upon", 

"us", "very", "we", "well", "well-to-do", "what", "when", "where", 

"where's", "whether", "which", "while", "who", "whole", "whom", 

"whose", "why", "will", "with", "within", "without", "would-be", 

"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

Here we can see that DeleteStopwords is replacing "far-out" and "far-off" with "far-", which is segmented to "far" by TextWords, which completely throws off WordCloud's counting mechanism in this case.

edited yesterday

answered yesterday

Carl Lange

4,2281735

$begingroup$
What you observed about DeleteStopwords[] is the same issue I touched on here.
$endgroup$
– J. M. is computer-less♦
15 hours ago

add a comment |

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];



AllTrue[words, StringQ]

   True



Take[words, 10]

   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",

    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]

   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",

    "veronica", "crispness", "ashen"}



TakeLargest[Counts[words], 10]

   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,

     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,

     "nematode" -> 27|>



WordCloud[words]

word cloud

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "387"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmathematica.stackexchange.com%2fquestions%2f192806%2fcommonest-function-doesnt-actually-show-commonest-elements%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", 

"against", "all", "almost", "alone", "along", "already", "also", 

"although", "always", "among", "an", "and", "another", "any", 

"anyone", "anything", "anywhere", "are", "around", "as", "at", 

"back", "back-to-back", "be", "because", "become", "before", 

"behind", "being", "below", "between", "born-again", "both", 

"built-in", "but", "by", "can-do", "custom-made", "do", "done", 

"down", "during", "each", "either", "enough", "even", "ever", 

"every", "everyone", "everything", "everywhere", "far-off", 

"far-out", "few", "find", "first", "for", "four", "from", "full", 

"further", "get", "give", "go", "have-not", "he", "head-on", "her", 

"here", "hers", "herself", "him", "himself", "his", "how", "however", 

"if", "in", "interest", "into", "it", "its", "itself", "keep", 

"laid-back", "last", "least", "less", "ma'am", "made", "man-made", 

"many", "may", "me", "might", "more", "most", "mostly", "much", 

"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", 

"not", "nothing", "now", "nowhere", "of", "off", "often", "on", 

"once", "one", "only", "other", "our", "ours", "ourselves", "out", 

"over", "own", "part", "per", "perhaps", "put", "rather", 

"runner-up", "same", "seem", "seeming", "see-through", 

"self-interest", "self-made", "several", "she", "show", "side", 

"since", "sit-in", "so", "some", "someone", "something", "somewhere", 

"still", "such", "take", "than", "that", "the", "their", "theirs", 

"them", "themselves", "then", "there", "therefore", "these", "they", 

"this", "those", "though", "three", "through", "thus", "to", 

"together", "too", "toward", "two", "under", "until", "up", "upon", 

"us", "very", "we", "well", "well-to-do", "what", "when", "where", 

"where's", "whether", "which", "while", "who", "whole", "whom", 

"whose", "why", "will", "with", "within", "without", "would-be", 

"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

edited yesterday

answered yesterday

Carl Lange

4,2281735

$begingroup$
What you observed about DeleteStopwords[] is the same issue I touched on here.
$endgroup$
– J. M. is computer-less♦
15 hours ago

add a comment |

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", 

"against", "all", "almost", "alone", "along", "already", "also", 

"although", "always", "among", "an", "and", "another", "any", 

"anyone", "anything", "anywhere", "are", "around", "as", "at", 

"back", "back-to-back", "be", "because", "become", "before", 

"behind", "being", "below", "between", "born-again", "both", 

"built-in", "but", "by", "can-do", "custom-made", "do", "done", 

"down", "during", "each", "either", "enough", "even", "ever", 

"every", "everyone", "everything", "everywhere", "far-off", 

"far-out", "few", "find", "first", "for", "four", "from", "full", 

"further", "get", "give", "go", "have-not", "he", "head-on", "her", 

"here", "hers", "herself", "him", "himself", "his", "how", "however", 

"if", "in", "interest", "into", "it", "its", "itself", "keep", 

"laid-back", "last", "least", "less", "ma'am", "made", "man-made", 

"many", "may", "me", "might", "more", "most", "mostly", "much", 

"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", 

"not", "nothing", "now", "nowhere", "of", "off", "often", "on", 

"once", "one", "only", "other", "our", "ours", "ourselves", "out", 

"over", "own", "part", "per", "perhaps", "put", "rather", 

"runner-up", "same", "seem", "seeming", "see-through", 

"self-interest", "self-made", "several", "she", "show", "side", 

"since", "sit-in", "so", "some", "someone", "something", "somewhere", 

"still", "such", "take", "than", "that", "the", "their", "theirs", 

"them", "themselves", "then", "there", "therefore", "these", "they", 

"this", "those", "though", "three", "through", "thus", "to", 

"together", "too", "toward", "two", "under", "until", "up", "upon", 

"us", "very", "we", "well", "well-to-do", "what", "when", "where", 

"where's", "whether", "which", "while", "who", "whole", "whom", 

"whose", "why", "will", "with", "within", "without", "would-be", 

"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

edited yesterday

answered yesterday

Carl Lange

4,2281735

$begingroup$
What you observed about DeleteStopwords[] is the same issue I touched on here.
$endgroup$
– J. M. is computer-less♦
15 hours ago

add a comment |

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", 

"against", "all", "almost", "alone", "along", "already", "also", 

"although", "always", "among", "an", "and", "another", "any", 

"anyone", "anything", "anywhere", "are", "around", "as", "at", 

"back", "back-to-back", "be", "because", "become", "before", 

"behind", "being", "below", "between", "born-again", "both", 

"built-in", "but", "by", "can-do", "custom-made", "do", "done", 

"down", "during", "each", "either", "enough", "even", "ever", 

"every", "everyone", "everything", "everywhere", "far-off", 

"far-out", "few", "find", "first", "for", "four", "from", "full", 

"further", "get", "give", "go", "have-not", "he", "head-on", "her", 

"here", "hers", "herself", "him", "himself", "his", "how", "however", 

"if", "in", "interest", "into", "it", "its", "itself", "keep", 

"laid-back", "last", "least", "less", "ma'am", "made", "man-made", 

"many", "may", "me", "might", "more", "most", "mostly", "much", 

"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", 

"not", "nothing", "now", "nowhere", "of", "off", "often", "on", 

"once", "one", "only", "other", "our", "ours", "ourselves", "out", 

"over", "own", "part", "per", "perhaps", "put", "rather", 

"runner-up", "same", "seem", "seeming", "see-through", 

"self-interest", "self-made", "several", "she", "show", "side", 

"since", "sit-in", "so", "some", "someone", "something", "somewhere", 

"still", "such", "take", "than", "that", "the", "their", "theirs", 

"them", "themselves", "then", "there", "therefore", "these", "they", 

"this", "those", "though", "three", "through", "thus", "to", 

"together", "too", "toward", "two", "under", "until", "up", "upon", 

"us", "very", "we", "well", "well-to-do", "what", "when", "where", 

"where's", "whether", "which", "while", "who", "whole", "whom", 

"whose", "why", "will", "with", "within", "without", "would-be", 

"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

edited yesterday

answered yesterday

Carl Lange

4,2281735

You should be using TextWords to segment your data into words. Things like StringCount[data, "far"] will also count "fart".

Commonest[TextWords[txt], 10]

{"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode", "veronica", "crispness", "ashen"}

WordCloud[TextWords[txt]]

enter image description here

You can use Counts to get the counts of each word as well:

TakeLargest[Counts[TextWords[txt]], 20]

<|"affirm" -> 29, "equation" -> 28, "veronica" -> 28, "ashen" -> 28, "crispness" -> 28, "knacker" -> 27, "validly" -> 27, "squander" -> 27, "nematode" -> 27, "autoimmune" -> 27, "calligrapher" -> 27, "pus" -> 26, "sledding" -> 26, "tablecloth" -> 26, "inclusive" -> 26, "variegated" -> 26, "gastrointestinal" -> 26, "undercoat" -> 26, "washout" -> 26, "reconnoitering" -> 26|>

It seems to me that the issue with WordCloud is actually an issue within DeleteStopwords, which WordCloud is using internally when the input is a string.

enter image description here

You can prevent WordCloud from using DeleteStopwords by passing PreprocessingRules -> None:

enter image description here

It seems to me that DeleteStopwords is deleting many words that perhaps it shouldn't be:

Complement[TextWords[txt], TextWords[DeleteStopwords[txt]]]

{"a", "about", "above", "across", "add-on", "after", "again", 

"against", "all", "almost", "alone", "along", "already", "also", 

"although", "always", "among", "an", "and", "another", "any", 

"anyone", "anything", "anywhere", "are", "around", "as", "at", 

"back", "back-to-back", "be", "because", "become", "before", 

"behind", "being", "below", "between", "born-again", "both", 

"built-in", "but", "by", "can-do", "custom-made", "do", "done", 

"down", "during", "each", "either", "enough", "even", "ever", 

"every", "everyone", "everything", "everywhere", "far-off", 

"far-out", "few", "find", "first", "for", "four", "from", "full", 

"further", "get", "give", "go", "have-not", "he", "head-on", "her", 

"here", "hers", "herself", "him", "himself", "his", "how", "however", 

"if", "in", "interest", "into", "it", "its", "itself", "keep", 

"laid-back", "last", "least", "less", "ma'am", "made", "man-made", 

"many", "may", "me", "might", "more", "most", "mostly", "much", 

"must", "my", "myself", "never", "next", "nobody", "nor", "no-show", 

"not", "nothing", "now", "nowhere", "of", "off", "often", "on", 

"once", "one", "only", "other", "our", "ours", "ourselves", "out", 

"over", "own", "part", "per", "perhaps", "put", "rather", 

"runner-up", "same", "seem", "seeming", "see-through", 

"self-interest", "self-made", "several", "she", "show", "side", 

"since", "sit-in", "so", "some", "someone", "something", "somewhere", 

"still", "such", "take", "than", "that", "the", "their", "theirs", 

"them", "themselves", "then", "there", "therefore", "these", "they", 

"this", "those", "though", "three", "through", "thus", "to", 

"together", "too", "toward", "two", "under", "until", "up", "upon", 

"us", "very", "we", "well", "well-to-do", "what", "when", "where", 

"where's", "whether", "which", "while", "who", "whole", "whom", 

"whose", "why", "will", "with", "within", "without", "would-be", 

"write-off", "yet", "you", "your", "yours", "yourself"}

I agree with some of those stopwords, but not really any of them that contain the - character. This is perhaps where the issue lies.

What appears to be happening is that DeleteStopwords is deleting part of some words, and what's left over is counted. We can see the outcome:

Counts[TextWords[txt]]["far"]

19

Counts[TextWords[DeleteStopwords[txt]]]["far"]

39

We can see that this behaviour is weird by comparing the following:

Select[TextWords[txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"farinaceous" -> 19, "far" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "far-off" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6, "far-out" -> 6|>

Select[TextWords[DeleteStopwords@txt], StringStartsQ["far"]] // Counts // ReverseSort

<|"far" -> 39, "farinaceous" -> 19, "fare" -> 19, "faro" -> 18, "farther" -> 17, "farmer" -> 17, "farcical" -> 17, "farthing" -> 17, "faraway" -> 16, "farmstead" -> 16, "farrier" -> 15, "farthermost" -> 14, "farming" -> 14, "farrago" -> 13, "farm" -> 13, "farcically" -> 13, "farrowing" -> 12, "farce" -> 11, "farsighted" -> 11, "farmland" -> 10, "farsightedness" -> 10, "farmhouse" -> 9, "farseeing" -> 9, "farad" -> 8, "farina" -> 8, "farthest" -> 8, "farmhand" -> 7, "farewell" -> 7, "farrow" -> 6, "farmyard" -> 6|>

edited yesterday

answered yesterday

Carl Lange

4,2281735

edited yesterday

answered yesterday

Carl Lange

4,2281735

answered yesterday

Carl Lange

4,2281735

answered yesterday

Carl Lange

4,2281735

$begingroup$
What you observed about DeleteStopwords[] is the same issue I touched on here.
$endgroup$
– J. M. is computer-less♦
15 hours ago

add a comment |

$begingroup$
What you observed about DeleteStopwords[] is the same issue I touched on here.
$endgroup$
– J. M. is computer-less♦
15 hours ago

What you observed about DeleteStopwords[] is the same issue I touched on here.

– J. M. is computer-less♦
15 hours ago

add a comment |

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];



AllTrue[words, StringQ]

   True



Take[words, 10]

   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",

    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]

   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",

    "veronica", "crispness", "ashen"}



TakeLargest[Counts[words], 10]

   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,

     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,

     "nematode" -> 27|>



WordCloud[words]

word cloud

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

add a comment |

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];



AllTrue[words, StringQ]

   True



Take[words, 10]

   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",

    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]

   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",

    "veronica", "crispness", "ashen"}



TakeLargest[Counts[words], 10]

   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,

     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,

     "nematode" -> 27|>



WordCloud[words]

word cloud

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

add a comment |

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];



AllTrue[words, StringQ]

   True



Take[words, 10]

   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",

    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]

   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",

    "veronica", "crispness", "ashen"}



TakeLargest[Counts[words], 10]

   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,

     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,

     "nematode" -> 27|>



WordCloud[words]

word cloud

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

As already noted by Carl, you have blamed the wrong function. Had you imported the text file in the proper format, you would have gotten the expected results:

words = Import["https://pastebin.com/raw/Z0hd3huU", "List"];



AllTrue[words, StringQ]

   True



Take[words, 10]

   {"inconsiderate", "weighting", "unneeded", "issuing", "intemperately", "perverse",

    "disgruntled", "ninja", "artificially", "seduce"}

Note that this import format already yielded a list of strings as opposed to a single string.

Commonest[words, 10]

   {"affirm", "calligrapher", "squander", "validly", "autoimmune", "equation", "nematode",

    "veronica", "crispness", "ashen"}



TakeLargest[Counts[words], 10]

   <|"affirm" -> 29, "equation" -> 28, "ashen" -> 28, "crispness" -> 28, "veronica" -> 28,

     "squander" -> 27, "validly" -> 27, "calligrapher" -> 27, "autoimmune" -> 27,

     "nematode" -> 27|>



WordCloud[words]

word cloud

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

answered 16 hours ago

J. M. is computer-less♦

97.3k10303463

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Mathematica Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

ywiuUBDx,yHO4C4lk3,rO4,X2bNi2a4WQxmBv1iGtZo

搜尋此網誌

Dtjytyk