Statistics
eBook - ePub

Statistics

Practical Concept of Statistics for Data Scientists

John Slavio

Partager le livre
  1. English
  2. ePUB (adapté aux mobiles)
  3. Disponible sur iOS et Android
eBook - ePub

Statistics

Practical Concept of Statistics for Data Scientists

John Slavio

DĂ©tails du livre
Aperçu du livre
Table des matiĂšres
Citations

À propos de ce livre

Statistics are not a tool but rather a set of techniques that you have access to that will help you analyze a set of data that you either generate, receive, or give. Statistics are absolutely vital for those attempting to study Big Data because it allows the scientists studying the data to make sense of the information when the information is on such a large and global scale. Unlike local neighborhood statistics or marketing statistics, big data encompasses a huge range of information and often this big data will be populated by thousands if not millions of data points. Statistics help you break down these data points so that you can reasonably understand them and work with the data that comes into you.

Here's What's Included In This Book

Basics of StatisticsExploratory Data AnalysisDifferent Sampling MethodsDifferent Types of Structured DataRun Charts and Statistical Process ControlVariation AnalysisPractical Application of Statistics

Foire aux questions

Comment puis-je résilier mon abonnement ?
Il vous suffit de vous rendre dans la section compte dans paramĂštres et de cliquer sur « RĂ©silier l’abonnement ». C’est aussi simple que cela ! Une fois que vous aurez rĂ©siliĂ© votre abonnement, il restera actif pour le reste de la pĂ©riode pour laquelle vous avez payĂ©. DĂ©couvrez-en plus ici.
Puis-je / comment puis-je télécharger des livres ?
Pour le moment, tous nos livres en format ePub adaptĂ©s aux mobiles peuvent ĂȘtre tĂ©lĂ©chargĂ©s via l’application. La plupart de nos PDF sont Ă©galement disponibles en tĂ©lĂ©chargement et les autres seront tĂ©lĂ©chargeables trĂšs prochainement. DĂ©couvrez-en plus ici.
Quelle est la différence entre les formules tarifaires ?
Les deux abonnements vous donnent un accĂšs complet Ă  la bibliothĂšque et Ă  toutes les fonctionnalitĂ©s de Perlego. Les seules diffĂ©rences sont les tarifs ainsi que la pĂ©riode d’abonnement : avec l’abonnement annuel, vous Ă©conomiserez environ 30 % par rapport Ă  12 mois d’abonnement mensuel.
Qu’est-ce que Perlego ?
Nous sommes un service d’abonnement Ă  des ouvrages universitaires en ligne, oĂč vous pouvez accĂ©der Ă  toute une bibliothĂšque pour un prix infĂ©rieur Ă  celui d’un seul livre par mois. Avec plus d’un million de livres sur plus de 1 000 sujets, nous avons ce qu’il vous faut ! DĂ©couvrez-en plus ici.
Prenez-vous en charge la synthÚse vocale ?
Recherchez le symbole Écouter sur votre prochain livre pour voir si vous pouvez l’écouter. L’outil Écouter lit le texte Ă  haute voix pour vous, en surlignant le passage qui est en cours de lecture. Vous pouvez le mettre sur pause, l’accĂ©lĂ©rer ou le ralentir. DĂ©couvrez-en plus ici.
Est-ce que Statistics est un PDF/ePUB en ligne ?
Oui, vous pouvez accĂ©der Ă  Statistics par John Slavio en format PDF et/ou ePUB ainsi qu’à d’autres livres populaires dans Economics et Statistics for Business & Economics. Nous disposons de plus d’un million d’ouvrages Ă  dĂ©couvrir dans notre catalogue.

Informations

Éditeur
Abhishek Kumar
ISBN
9781979455473

Basics of Statistics

Finding a Range and Usage
Before you begin messing around with your data, you need a way to organize it and the most common method of organizing a dataset is to turn that dataset into a range of numbers. But what does a range mean? A range represents the smallest value on the left side of the dataset with the representation of the highest values on the right side of the dataset or the top side of the dataset. So, for a group of numbers [0,10,5,6,7] the range would be [0-10].
Once you have placed your numbers from least to greatest, in a basic sense, you are now able to look at the data points and categorize them by using qualitative measures or by using quantitative measures. If one is looking at a dataset where the numbers correlate to qualities in an item or a thing, then a qualitative approach should be taken but if the numbers in your dataset are representative of quantities then a quantitative approach should be adopted.
Median and Interquartile Range
Median and Interquartile range are very simplistic topics with very complex words attached to it and all it involves is finding the middles of the middle. Let’s say that we have a range of numbers from 1 to 20 where we are dealing with the even numbers, as well as numbers 1 and 11. This means that we would have 1, 2, 4, 6, 8, 10, 11, 12, 14, 16, 18, 20. The first step in the process is to find the median number, and so since we have twelve numbers that means that we need to find the middle two numbers (6th and 7th number) and find the common number between them. The common number between 10 and 11 is 10.5. Now that we have found our median, we need to find the middle median of the left and right sides. The middle median of the left side is 4 (3rd number) while that of the right side is 16 (9th number). Next, we need to find the difference between these two numbers. This would mean that we subtract 4 from 16 and this gives us 12 and so our interquartile range is 12. As I said, this is an easy thing to figure out, on a small scale, but is attached to a somewhat complex sounding name. The reasoning behind the name is that you are providing a range between two different percentiles in your dataset, specifically the 75-percentile marker and the 25-percentile marker.
The Five Number Summary
Most of the data that is worked on relies on the five number summaries and though this elusive name suggests that there are five numbers that summarizes the entirety of your data, this is the truth. The five-number summary is made up of the smallest quantitative measurement that you can possibly have in your data set, the first 25% of the dataset otherwise known as the first quartile, the median of a dataset, the 75% of the dataset otherwise known as the third quartile, the maximum possible quantitative value and the inside of a dataset. As you already saw with interquartile range, we utilized the median to find the first quartile and the third quartile and found our interquartile range by subtracting the first quartile from the third quartile. These five numbers will represent almost all your data all the time if you are ever taking a quantitative approach towards your data.
Differentiating Different Sampling Methods
Probability Sampling
In order to go through this topic, we need to go over a few variables and functions that you will need to keep in mind as we move forward. In Probability Sampling, you have a quantitative number representing the sampling frame and the sample pertaining to the amount of cases in each one. This is represented by N for frames and n for samples. The functions are NCn, which just represents the number of times that n to N form combinations or subsets and f, which is the fraction of samples found in a sample frame a.k.a. n/N. Now that we got that out of the way, there are five methods that fall underneath Probability Sampling, but what is Probability Sampling? Simply put, it is a series of methods (from which you choose one of them) used to randomly select a sample.
Simple Random Sampling
Alright, so let’s get started with the simplest of them all. The objective of this sampling method is to select the sample size out of the sample frame where each sample has an equivalent claim to be used. The procedure to do such a thing is usually just a simple randomizer made in a program, such as:
image
As you can see from the code, we chose to print out a random number from 1-100 ten times. In the Simple Random Sampling, this would be explained as using a sample of 10 from a sample frame of 100. We would then utilize these randomly chosen 10 things inside of the sample set to determine something. Our NCn in this case would be 100C10 and our f would be 10% as 100/10 is 10. Simple, right?
Stratified Random Sampling
Stratified Random Sampling is similar, but it takes it a little bit further. Let’s say that we have a sample frame of 100 black haired individuals and we initially chose a sample set of 10. We would then divide the 100 individuals into 10 sample groups where each group shared one common aspect that was different from the other groups. We would then take 1 from each group, just as random as before, in order to get our sample group of 10. The whole point of this is to divide the sample frame into groups that don’t overlap each other so this is more of a categorical methodology since things like red and black don’t naturally overlap.
Systematic Random Sampling
Systematic Random Sampling is a little bit different from the previous two because it is a bit more complicated. First, you number all the given elements in your range in an ordinal fashion. You need to choose your sample set from the sample frame. Now here is where it differs; your interval size will be f. You will now need to select a random number from 1 to your output off. You will now go to each sample set and take the random number for that sample set. So, going back to the example, we would take the 6th person from each group.
Cluster (Area) Random Sampling
Cluster Sampling was primarily invented to handle situations where your data points are stretched way outside of a normally controllable environment, such as travelling to different counties in a state to visit inspection sites for your sample set. It’s pretty straightforward:
  1. Divide up your sample frame into bordered off sections (clusters)
  2. Use a randomizer technique to select your sections
  3. The selected sections now become your samples where you will use them to do all of your needed measurements
Multi-Stage Sampling
The above are the most common forms of sampling and they are taught to beginners in this field, but the truth is that statistical analysts almost never use something so simple. Imagine if you have a sample from of one billion individuals and you were to select a sample set of 10% from that, would you be able to perform the measurements in your lifetime? Needless to say, we must combine these methods and create our own method to randomly select our samples in order to represent a wide array of things we wish to study. This combination of sampling methods is known as Multi-Stage Sampling.
Non-probability Sampling
Non-probability sampling means that you did not randomize your choice. Often, many statistical analysts will avoid this unless they have a specific set that they want to investigate. This set may be something they want to investigate because of density issues, limitations on methodologies, or simply lack of the considerable amount of time it might take to complete randomized sampling. In this case, we would choose a non-probable sample in order save on something but we risk not truly representing the sample set we’re considering.
Convenience Sampling
Convenience sampling is simply choosing whatever is convenient for you to choose at the time and using that as the sample in your sample frame...

Table des matiĂšres