View statlib-20050214 pollen (public)

2010-11-06 10:00 by mldata | Version 1 | Rating Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star
Rating
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Overall (based on 0 votes)
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Interesting
Empty StarEmpty StarEmpty StarEmpty StarEmpty StarEmpty Star Documentation
Summary

(No information yet)

License
unknown (from Weka repository)
Dependencies
Tags
arff slurped Weka
Attribute Types
Integer,Floating Point
Download
# Instances: 3848 / # Attributes: 6
HDF5 (178.6 KB) XML CSV ARFF LibSVM Matlab Octave

Files are converted on demand and the process can take up to a minute. Please wait until download begins.

Completeness of this item currently: 55%.
You can edit this item to add more meta information and make use of the site's premium features.
Original Data Format
arff
Name
pollen
Version mldata
0
Comment

This dataset is synthetic. It was generated by David Coleman at RCA Laboratories in Princeton, N.J. For convenience, we will refer to it as the POLLEN DATA. The first three variables are the lengths of geometric features observed sampled pollen grains - in the x, y, and z dimensions: a "ridge" along x, a "nub" in the y direction, and a "crack" in along the z dimension. The fourth variable is pollen grain weight, and the fifth is density.

There are 3848 observations, in random order (for people whose software packages cannot handle this much data, it is recommended that the data be sampled). The dataset is broken up into eight pieces, POLLEN1.DAT - POLLEN8.DAT, each with 481 observations.
We will call the variables:

  1. RIDGE

  2. NUB

  3. CRACK

  4. WEIGHT

  5. DENSITY

  6. OBSERVATION NUMBER (for convenience)

The data analyst is advised that there is more than one "feature" to these data. Each feature can be observed through various graphical techniques, but analytic methods, as well, can help "crack" the dataset.

Additional Info:

I no longer have the description handed out during the JSM, but can tell you how I generated the data, in minitab.

  1. Part A was generated: 5000 (I think) 5-variable, uncorrelated, i.i.d. Gaussian observations.

  2. To get part B, I duplicated part A, then reversed the sign on the observations for 3 of the 5 variables.

  3. Part B was appended to Part A.

  4. The order of the observations was randomized.

  5. While waiting for my tardy car-pool companion, I took a piece of graph paper, and figured out a dot-matrix representation of the word, "EUREKA." I then added these observations to the "center" of the datatset.

  6. The data were scaled, by variable (something like 1,3,5,7,11).

  7. The data were rotated, then translated.

  8. A few points in space within the datacloud were chosen as ellipsoid centers, then for each center, all observations within a (scaled and rotated) radius were identified, and eliminated - to form ellipsoidal voids.

  9. The variables were given entirely ficticious names.

FYI, only the folks at Bell Labs, Murray Hill, found everything, including the voids.

Hope this is helpful!

References:

Becker, R.A., Denby, L., McGill, R., and Wilks, A. (1986). Datacryptanalysis: A Case Study. Proceedings of the Section on Statistical Graphics, 92-97.

Slomka, M. (1986). The Analysis of a Synthetic Data Set. Proceedings of the Section on Statistical Graphics, 113-116.

Information about the dataset CLASSTYPE: numeric CLASSINDEX: none specific

Names
RIDGE,NUB,CRACK,WEIGHT,DENSITY,OBSERVATION_NUMBER,
Types
  1. numeric
  2. numeric
  3. numeric
  4. numeric
  5. numeric
  6. numeric
Data (first 10 data points)
    RIDGE NUB CRACK WEIGHT DENS... OBSE...
    -2.34... 3.6314 5.0289 10.8... -1.38... 1.0
    -1.152 1.4805 3.2375 -0.59... 2.1235 2.0
    -2.52... -6.86... -2.80... 8.4631 -3.41... 3.0
    5.7523 -6.50... -5.151 4.348 -10.3... 4.0
    8.7494 -3.89... -1.38... -14.8... -2.41... 5.0
    10.4... -3.16... 12.7... -14.8... -6.49... 6.0
    -3.60... 4.6081 6.554 5.9773 4.0404 7.0
    -5.63... -0.81... -3.812 1.1674 7.0468 8.0
    9.5434 4.0865 2.7542 -18.9... -0.06... 9.0
    -9.02... 2.9723 3.6759 13.882 4.2106 10.0
    ... ... ... ... ... ...
Description

A gzip'ed tar containing StatLib datasets (statlib-20050214.tar.gz, 12,785,582 Bytes)

URLs
(No information yet)
Publications
    Data Source
    http://lib.stat.cmu.edu/datasets/
    Measurement Details
    Usage Scenario
    revision 1
    by mldata on 2010-11-06 10:00

    No one has posted any comments yet. Perhaps you would like to be the first?

    Leave a comment

    To post a comment, please sign in.

    This item was downloaded 3234 times and viewed 2140 times.

    No Tasks yet on dataset statlib-20050214 pollen

    Submit a new Task for this Data item

    Data

    Sort by

    Disclaimer

    We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

    Data | Task | Method | Challenge

    Acknowledgements

    This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
    PASCAL Logo
    http://www.pascal-network.org/.