datahelperdata crunching
datahelper
read my profile
sign my guestbook

Name: Robert
Gender: Male


Message: message me


Member Since: 9/14/2006

SubscriptionsSites I Read

Blogrings
chown linux:users /world
previous - random - next


Posting Calendar

|<< oldest | newest >>|
view all weblog archives

Get Involved!

Suggest a link

Recommend to friend

Create a site


Tuesday, January 30, 2007

The data processing function, a really simple example



So let's show a wicked simple example of how you would use the Vilno programming language:

Here's a sample program:

directoryref a="/home/tom/mydata" ;

printoptions a/printout1 ;

convertfileformat asciitobinary("/home/tom/asciidata.txt"->a/datafile1)
delimiter=',' varnames(name age weight) datatypes(str int float)
strlengths(15) ;


inlist a/datafile1 ;
if (name=="Sam") name="Samuel" ;
sendoff(a/datafile2) name age weight ;
turnoff ;


convertfileformat binarytoascii(a/datafile2->"/home/tom/newasciidata.txt")
delimiter='|' varnames(name age weight) ;

print(a/datafile1) "Here is the input dataset" ;
print(a/datafile2) "Here is the new output dataset, created by the data processing function" ;



OK, first off, the core of the Vilno programming language is the data processing function (DPF for short). See the paragraph of code that begins with "inlist" and ends with "turnoff" ? That's a data processing function. This data processing function is as simple as you can get, because it does a very slight and easy data modification, changing the spelling of "Sam" to "Samuel" in the NAME column. The data processing function reads in input datasets, crunches data, and writes out output datasets. The different types of data transformations that the data processing function can do are huge, I deliberately made a wicked simple data processing function here.
Also, this program has only one data processing function, again to keep it simple. And the data processing function here reads one and only one input dataset, and writes only one output dataset, again to keep it simple.

The data processing function is where you get the real work done, the data transformation.

"turnoff" really means "end of paragraph" or "end of this data processing function". The spelling was a bad choice.

If you are reading and writing datasets that are in the binary format native to the Vilno software product, there is no need for the convertfileformat statements that are shown above. The first convertfileformat (just before the DPF) creates a binary dataset from the ascii data file. This binary dataset is a/datafile1, which, via the directoryref statement, is actually /home/tom/mydata/datafile1.dat . Then the data processing function reads in a/datafile1, does some calculations, and writes out a/datafile2 ( which is actually /home/tom/mydata/datafile2.dat) . Then the second convertfileformat statement, (just after the DPF) converts a/datafile2 to a new ascii data file.

So the convertfilestatement imports data from ascii data files ( typically with a comma or vertical bar as a delimiter ), and exports data out to ascii data files. The current version of Vilno does not yet read/write directly specialized formats (such as Oracle, MySQL, SAS, SPSS, etc.). But of course, you have the option of exporting/importing ascii data files from such products.

The print statements create data listings, printed to /home/tom/mydata/printout1.prt (because of the printoptions statement). These are data listings of the input dataset (a/datafile1) and the output dataset (a/datafile2). Printout1.prt is an ascii file, but it is not an ascii data file: it has page breaks, and the columns are aligned, with a title at the top of a page. It's a printout for the human eye to look at, not a dataset that a later program can read.

If you are tempted to play around with this thing, and you have a Linux computer, and you haven't updated your Linux distribution for eighteen months, by all means, go to the www.my.opera.com/datahelper site, go to the August 31 blog article, and there you will find a tarball-file to download, called vilnoAUG2006package.tgz . I'm sorry, but the GCC 3.x and GCC 4.x toolchains appear to be binary incompatible, from testing last summer. I will upload a tarball compatible with the newer GCC at a later date.





Here's an interesting idea: suppose one added functionality to import/export data from a variety of different sources( not just ascii data files), oh, and if you like a user interface, suppose one put a graphical user interface on top of the programming language( the GUI gives ease-of-use to beginners for very simple data situations, but the programming language gives the power and flexibility to deal with unexpected and messy data situations - that's why a GUI layer on top of a programming language layer is really the best of both worlds, kind of like SPSS). So suppose you added that sort of stuff? What do you get?

ETL software. (Extract, Transform, Load).

The data processing function does the "T" part of the "ETL" , (Transformation).

The convertfileformat statements do the "E" and "L" part, but only with ascii data files.

So the current version of Vilno is very strong on Transformation, but pretty simplistic on Extract and Load.


Vilno has transformation features ( data preparation ) but not yet statistical features (ANOVA, regression).

Well, if you have a Linux computer, you can use Vilno for the preparation of (often messy) data, and use R for statistics after the data is ready. (R uses the S programming language, which is not used that much for data preparation).




Monday, January 29, 2007

Maybe Some Extremely Simple Examples ....


I ran across a discussion at the comp.lang.lisp Usenet group, about comparing Python and Lisp.
It has 1070 responses, which is just amazing. ( merits of Lisp vs Python, Dec 8 2006, you can use groups.google.com to interface with Usenet, it's something older than the internet we are used to, before web browsers ).

Here's a response from Carl Banks, one of the Lisp programmers joining in:

Mark Tarver wrote:
> This confirms my suspicion
> that Lisp is losing out to newbies because of its
> lack of standard support for the things many people want to do.

Whoa there, that's a pretty big logical jump there, don't you think?
Consumer choice can never be boiled down to one thing; there are so
many factors. No one knows the whole answer. I certainly don't. (If
I did, I'd be courteously turning down the Nobel Prize for Economics on
account of being so rich I really didn't need the extra pocket change.)

I have no doubt that what you say is a contributing factor, but if I
had to guess the main reason why Lisp is losing out to newbies, I'd say
it's first impressions. When newbies see Python they say, "Ok, I can
kind of follow that, it doesn't look too hard to learn." When they see
Lisp they say, "WTF IS THAT???"

It's kind of sad, in a way, that a superficiality would be so crucial.
(Not that I think outward appearance is all superficial--I think humans
have evolved and/or learned to regard as beautiful that which minimizes
effort--but it's not the whole story and not basis for a whole
judgment.)

Carl Banks


***********************************************

OK, so that maybe shows me something I should have noticed earlier. The code examples that I've shown so far are where the data processing function is used to solve a pretty tough data crunching problem, and many data transforms are all thrown into one paragraph of code. The code looks complicated because the problem I chose to solve is complicated. The vast majority of people who see it just don't understand the data crunching problem I chose.

What I really need to do is post some super-simple examples, where you read in the input data , make a very easy modification to the data, write out the output data ....
When a data processing function is used to do a very simple data modification, the paragraph of code should be very small and clear.

Also , I'll need to blog about the directoryref, print, asciitobinary, and binarytoascii statements, all of which are much simpler things than the data processing function....

The directoryref statement is wicked simple, at the top of the program:

directoryref a="/home/tom/mydata" ;

So that when you do :
inlist a/datafile1 ;
You mean : read in the input dataset "/home/tom/mydata/datafile1"

When you do :
sendoff(a/newdatafile) ;
You mean : write out the current dataset to "/home/tom/mydata/newdatafile" , writing a new output dataset.

The print statement just does a page-by-page printout (as an ascii file) of a dataset.

The asciitobinary statement creates a binary dataset from an ascii data file( typically comma delimited ).

The binarytoascii statement creates an ascii data file from a binary dataset.

These "helper" statements are needed because the data processing function does not (yet) read directly ascii data or write out ascii data, and the data processing function does not produce print-outs. The data processing function reads in input datasets ( that are in the binary data format native to this application ) , transforms the data (that's where the power of the product is), and writes out the output datasets (also in binary data format).

All of these "helper" statements are wicked easy to explain.

Truth is Vilno is a very easy language to learn. Well, since SPSS, SAS ( SAS datastep ), and Vilno are all in the same language family, they all have a fairly similar learning curve. (You can also compare all three to SQL SELECT, also very easy to learn).

One guy said that Python is both easier to learn and better than Vilno for this sort of data crunching. This is absolute rubbish. It's true that Python is easier to learn than C++ ( I've used both ). But specialized programming languages have an easier learning curve than all-purpose languages: that would imply that all three languages ( SAS, SPSS, Vilno) are easier to learn than Python. That indeed is true. For the same reason SQL is easier to learn than Python.

Python is powerful and all-purpose, so there is a great deal to learn.
People think Python is super-easy to learn because Python is cool and van Rossum is well-respected.

That's an emotional reaction. All purpose programming languages that are powerful are not super-easy to learn.







Tuesday, January 09, 2007

LISP Is For Losers .....

.... It must be, right? If we use the groupthink (herd mentality) reasoning method that most of us do for most our lives, then languages such as Visual Basic, C, C++, and Java which have huge market share must be the best programming languages out there. Since LISP has such a tiny market share , it must be a lousy programming language.
So C++ and Java are for winners, for those who want a good paying job.
LISP, clearly, is for losers.

Just make the same choices that the majority of people make, follow the crowd.
Don't be intellectually curious.
For the job market, this does have benefits.

But groupthink, which is the way we all behave, myself included, can be very dangerous.
Remember the NASDAQ bubble, an entire country of people, all looking stupid at the same time?

The older I get, the more convinced I become that herd-mentality-reasoning-methods dominate the human mind (myself included). When an entire population chooses mass stupidity, groupthink is why. It's a human weakness. But, in fact, choosing to walk down a path that no one else wishes to trod, such a choice can carry a heavy price. Since no one travels that path, everyone thinks theres a good reason not to trod that path. So they expect anyone walking that path to find only failure.

I could go into the damage that groupthink has caused in fields as diverse as theoretical physics and the development of programming languages. The damage done in the field of economics is particularly huge. But I'm going off topic.

LISP is for losers.
Only a handful of programmers learn it, let alone use it.
People who actually make money coding in LISP, are there such people?

***************************

And yet.

LISP refuses to go away. LISP was invented in 1958. And people still use LISP. All fifteen of them.

LISP is the first of a family of languages - the functional programming languages ( ANSI Common LISP, Scheme ( another LISP dialect ), ML, Haskell). Haskell is notable because the current Perl 6 interpreter is implemented in Haskell, and some believe that rewriting the interpreter in C would be far more difficult.

Hey, Larry Wall, what do you think? I would appreciate your feedback.

So how do LISP and Haskell compare?

Can an expert LISP programmer solve practical problems in one tenth the time a Java programmer requires? Or is learning LISP a total waste of time? I don't know. Can functional programming languages be used to create an explosion of productivity in the domain of statistical programming, data preparation, and data crunching? Or is that just a pipe-dream?

**********************

A LISP program is a list. A data structure is a list. The output of a LISP program is also a list. So a LISP program can produce LISP programs. Should we care? I don't know.

But I am going to learn LISP.

At this very early stage, I can see some difficulties that can make LISP code harder to maintain.

First, all those parentheses, scattered all over the place.

But there's more:
A programmer must be careful about which lists in his program are transformed ( or evaluated ) and which lists are left untouched. That's a code maintenance cost.
If you write a list in your program, it is typically, by default, evaluated. So any list is automatically transformed into another list ( or a number or string ).
So (+ 3 5) becomes 8.
If you want the list of items to remain untouched, you need to do:
( quote (+ 3 5) )
or
'(+ 3 5)

The quote keyword ( with extra parentheses to boot) can quickly make your code very verbose. So the quote punctuation character, ' , is used. But then the quote punctuation character can take on a role even worse than the dreaded semicolon of legend. You have to keep track of all the quotes in your code. A single typed or missing quote punctuation character can completely mess up the entire program.

Do functional programming languages provide huge gains in productivity relative to more conventional programming languages, and if so, for what problem domains? Or are such claims just rubbish?





Friday, November 24, 2006

Example of Vilno, with ages and weights

inlist b/data1 b/data2 ;
mergeby patid ;
addgridvars str: agegrp 1 ;
sendoff(b/showdat1) patid age weight agegrp ;
if (age<50) agegrp = "A" ;
else agegrp = "B" ;
sendoff(b/showdat2) patid age weight agegrp ;
select pat_wgt = avg(weight) by agegrp patid ;
sendoff(b/showdat3) agegrp patid pat_wgt ;
if (patid==4) pat_wgt=2.2*pat_wgt ;
sendoff(b/showdat4) agegrp patid pat_wgt;
select avg_weight = avg(pat_wgt) by agegrp ;
sendoff(b/resultdata) agegrp avg_weight;
turnoff ;

So that's an example of a data processing function in Vilno,
that "cleans" and modifies the data, merges two data sources ,
and calculates average weight for two age groups.

The intermediate calculations are output to datasets, which can then be
printed. If that is not needed, all but the last SENDOFF statement can be
left out.

One or more patients accidentally had two or more weights(wasn't supposed to
happen) measured . Well, OK, for those patients, create a single row with the average of the
available weight values. That's what the first SELECT statement is for.

As you might guess, patient # 4 had weight measured in kilograms ( and everyone else, in
pounds). So convert patient #4 's weight to pounds.

The last SELECT statement creates a dataset with two rows, one for the age-group 0-49, and one
for the age-group 50-100. The average weight for each agegroup is calculated.


Sunday, November 19, 2006

Change Baseline, 6 lines in Vilno, 23 lines in SAS


CHANGE FROM (MESSY) BASELINE
6 LINES IN VILNO
VS
23 LINES IN SAS

Calculate change from baseline , for visits 1, 2, 3 , etc, where the
baseline values are messy (multiple dates, missing values, even 2 or 3
baseline values on the same date for a few patients)

Discard baseline dates that only have missing values.
Use the value from the most recent baseline date, or the average value
if there is more than one row for the most recent baseline date.

The gridfunc transform here uses the composite where clause which , while not
in version 0.85 will be in a later version. (Perhaps a month of coding work). If
later versions of Vilno allow for implicit variable declaration when parsing the classical
transform, then the addgridvars statement becomes unnecessary(making it 5 lines instead
of 6). (However, you can still use the screen, recode, and addgridvars statements simply as
a matter of style, to keep track of all the variables you expect to be there.) I do not include
"turnoff;" which marks the end of one data processing function and the beginning of a later paragraph of
code, because it's a simple matter to upgrade that out in the next version.

6 lines in Vilno (counting the gridfunc statement as 2 lines)

inlist labdata ;
addgridvars float: change ;
gridfunc baseval=avg(value) by labtest patid
where (visit==-1 and value is not null) and highest date ;
change = value - baseval ;
sendoff(labdata2) labtest patid visit date value change baseval ;


----------------------------------------


23 lines in SAS:

proc sort data=labdata ;
by labtest patid visit date ;

data base1 ;
set labdata ;
where visit=-1 and value!=. ;

data bestdate1 ;
set base1 ;
by labtest patid ;
if last.patid ;
rename date=recentdate ;
keep labtest patid recentdate ;

data base2 ;
merge base1 bestdate1 ;
by labtest patid ;
if date=recentdate ;

proc means data=base2 ;
by labtest patid ;
var value ;
output out=base3 mean=meanbase;

data labdata2 ;
merge labdata base3 ;
by labtest patid ;
change = value - meanbase ;






Next 5 >>