genuine office 2010 Statistics and Business Data - Free Advertising Forums | Free Advertising Board | Post Free Ads Forum | Free Advertising Forums Directory | Best Free Advertising Methods

**xingqiuda762** · 03-13-2011, 06:57 PM

Today's author, Gabhan Berry, a Program Manager on the Excel team. A copy of the spreadsheet discussed in this post can be found as an attachment at the bottom of this post.
When faced with a table of data, what kind of techniques do we employ to extract information from it? Maybe the first thing we do is create summaries. We might total up the columns or we might chart the data. Perhaps,microsoft office 2010 Professional Plus upgrade key, we might even create a PivotTable and drill down into subtotals. all, summarizing data is useful. It's easy to understand and easy to do. that we're running a store and we have a small dataset of the sales we made in the first quarter. Using Excel we can easily create summary reports telling us how many products we sold to different types of people; how many to men and how many to women; how many to people over 40 and how many to people under 40. totals are useful. We can compare them and search for interesting differences. We might notice that we sold more products to men than we did to women or that managers bought more of our products than other professions. we get more information than this? buy an unusually high number of products compared to other professions? Is this number so high that it might indicate a bias of some kind? to these types of questions enable us to make better business decisions. Perhaps if we know that our products are significantly more attractive to managers we might change our advertising policies to target managers more than other professions. think that we need data warehouses and data mining tools to answer these types of questions. But that's not necessarily true. Data warehousing and data mining are excellent tools, but so is Excel and Excel contains some sophisticated statistical functions which we can use to help us answer these types of questions. Basics
First of all, let's take a look at our data set. contains customer information which we have gathered over the last quarter of trading. Each time we made a sale we recorded certain information about the customer. For instance, we recorded their Marital Status, their Gender, their Education level and their Occupation (amongst other fields). We have 1,000 records in our data set. say that we are interested in analyzing the Occupation field. It would be useful to know whether or not certain professions tend to buy more of our products than other professions. This would be a key piece of business intelligence for us. do we decide this? all,microsoft office 2010 pro plus 32bit key, let's build some summaries. Let's count how many sales we made to each profession and each profession's percentage of the overall sales and display these numbers in a table. We'll also insert a pie chart to give some simple, visual indication of the proportions sold to each profession. done, let's now ask ourselves the question again: Do certain professions buy more of our products than other professions? see that Professional and Skilled Manual occurred more frequently in the data than the other professions and that Manual occurred the least. But is this enough information to enable us to generalize and state that Professional and Skilled Manual buy more of our products in general than the other professions? really. before we can state anything about these totals we need to have some benchmark to compare them to. We need to have some way of calculating expected numbers for each profession. If a profession had more sales than the expected number then that takes us one step closer to being able to state that there is an unexpected difference between the amounts of products sold to each profession. the expected values we have to think a little about probability. sale, what is the probability of the occupation being Professional? There are 5 occupations and each sale can only be assigned one occupation so the probability of it being Professional is 1 in 5 i.e., 20%. In fact the probability for each occupation is 20%, assuming there is no bias in the data. any set of unbiased data, we would expect 20% to be Professional, 20% Clerical, 20% Management,cheap office 2010 product key, 20% Manual and 20% Skilled Manual. 1000 records, we would expect each occupation to have 200 sales. sense tells us that we're very unlikely to get exactly 200 sales for each profession. So what we really mean is that we expect to get approximately 200 sales for each profession. the actual data, we see that Clerical appears in 177 of the 1000 records. That's a difference of -23 from the expected value. Is this close enough to 200 to be considered approximately 200? has 255 records; a difference of +55. Is this difference big enough to be significantly different? data there are 255 Skilled Manual records. What is the probability of this? That is, what is the probability of there being 255 Skilled Manual records out of 1000 when each record could have any one of five occupations? we expect 200 Skilled Manual records it seems logical to also expect that the probability of having 200 records is higher than the probability of having any other number. We'd also expect that the probability of getting 205 records would be higher than the probability of getting 220 (because 205 is closer to 200 than 220 is). we move away from 200 we would expect the probabilities to decrease and decrease towards 0. would expect that there would come a point when the probabilities are so small it would be extremely unlikely for these values to occur in unbiased data. Therefore, if these values did occur then our data has a high likelihood of being biased. do we calculate these probabilities? them using something called the binomial distribution. Binomial Distribution
Data is expected to have a binomial distribution if: are a finite number of records; The value of one record is not influenced by the value of any other record; Each record has a value of either true or false; The probability of a record having a value of true is the same for each record;
At first, it doesn't seem like our data has these properties. But it does - we just need to look at it slightly differently. in the Occupation field and we know that there are five occupations. But instead of considering all five occupations at once, let's consider just one occupation at a time. Let's start with Skilled Manual. a finite number of records (1000); the occupation of one record is not influenced by any other record; the probability that the record has a value of Skilled Manual is always the same (20%); and each record has an occupation of either Skilled Manual or not Skilled Manual. Skilled Manual data exhibits the required properties and therefore we expect that the Skilled Manual data is binomially distributed. same principles we can see that the same is true for the remaining four occupations. Instead of considering all five occupations at once, we consider them one at a time. Each occupation should be binomially distributed. If we find an occupation that isn't, then there is a likelihood of bias in our data. a function called BINOMDIST which calculates the probability of getting X true values out of a fixed number of records where the probability of each record being true has a certain, fixed value. In other words, BINOMDIST calculates the binomial probability. we can calculate the probability of getting 150 true records out of 1000 where each record has a 20% probability of being true by doing: 0.2, FALSE) how the probabilities are distributed (i.e., where the high values are and where the low values are) let's calculate the BINOMDIST for each value between 1 and 1000. In other words, we will calculate the probability of getting 1 true record out of 1000, 2 true records, 3 true records etc. We'll then plot these values on a chart. following chart. the highest value (the peak of the graph) is at 200? Also notice that as we move away from 200 the probabilities decrease towards 0?
at the chart we can state that, if our data is binomially distributed, we are extremely unlikely to have 150 or fewer records for any occupation. Therefore, if we indeed have fewer than 150 records for any occupation it implies that our data is unlikely to be binomially distributed. careful use of unlikely in the previous sentence. When dealing with probabilities we can rarely be 100% certain of anything. Instead we state facts as having a certain likelihood of being true. This is called a confidence level. level is the final tool we need in order to conclude whether our data is biased towards certain occupations or whether the differences between occupations occur purely by chance. draw conclusions about our data we need to decide what confidence level we are happy with. A 95% confidence level means that there is a 5% chance that we will draw the wrong conclusion; a 99% confidence level means that there is a 1% chance we will get it wrong. do we need to be? level is something that we need to decide for ourselves. Typically it is 95% or 99%. For the purposes of this article, let's stick with 95%. we have our confidence level we can calculate what's called the critical values of the binomial distribution. calculate two critical values. These are: bound: the smallest number of true records for a binomially distributed variable (at a specific confidence level); Upper Bound: the largest number of true records for a binomially distributed variable (at a specific confidence level);
We calculate the critical values using the CRITBINOM function. the lower bound we do: 0.025) a value of 176. the upper bound we do: 1-0.025) a value of 225. the critical values depend on the total number of records in our data (1000), the probability each record has of being true (20% or 0.2) and the confidence level we want to work with (0.5/2 = 0.025). If these values change, the critical values change. So,microsoft office Professional 2007 serial key, for example,genuine office 2010, if we had 2000 records instead of 1000, the critical values would be different. that we specify the confidence level as an error level. That is, we don't supply a 0.95 confidence level we supply a 0.05 error level. we are performing what's called a two-tailed test which means that we are looking for differences that are significantly lower or significantly higher than the expected value. Therefore we have to halve our error level from 0.05 to 0.025. we have the bounding values. These enable us to determine whether or not our data is binomially distributed. each occupation to occur 200 times out of 1000. These critical values tell us how close to 200 the actual numbers of occurrences have to be in order for them to be considered close enough. the number of records for a given occupation is between 176 and 225, that occupation has a 95% likelihood of being binomially distributed. Otherwise, there is 95% likelihood that the occupation is not binomially distributed. discover that our data is not binomially distributed then something other than chance is determining the data. Report
Armed with this knowledge, we are ready to build our report for the Occupation field. Let's restate the question we're asking ourselves: Do certain professions buy more of our products than other professions? report:
Count column simply uses COUNTIF to count the number of occurrences of each occupation. We have added some conditionally formatted data bars to offer some visual indication of the distributions. column also uses conditional formatting. An up arrow means that that the occupation is more frequent than expected; a horizontal arrow means that the occupation occurs the expected number of times (i.e., is binomially distributed) and a down arrow means the occupation occurs less frequent than expected. in the Score column is as follows (use the color highlights to interpret the cell references): the formula compares the count to the lower and higher bounds and returns -1 (less than lower bound), 0 (between lower bound and higher bound) or 1 (greater than higher bound). in the Lower Bound and Higher Bound cells use CRITBINOM as before. function in the Lower Bound cell: Results: What Does this Really Mean?
Let's look again at the table of results: this analysis, only Clerical is binomially distributed. All the other occupations are not. So, what conclusions can we draw from this? What does this analysis really tell us? Skilled Manual as an example. tells us we can be 95% certain that the Skilled Manual occupation occurs significantly more frequently in the data than we would expect it to. really mean that Skilled Manual workers prefer our products? numbers alone do not tell us that. All the numbers tell us is that we can be 95% sure that an unexpectedly high number of Skilled Manual workers bought our products. But there may be various reasons for this. Maybe our shop is next to a construction site or maybe it is next to a factory. real-world factors are not conveyed in the numbers and instead must come from ourselves, as information workers. We must take the numbers and apply our business knowledge to them. don't contain magical answers, but they do provide us with some interesting properties of our data that we should seek to explain and understand. use the numbers as a tool to help us make intelligent business decisions, but the numbers alone don't make the decisions for us. <div