Page images
PDF
EPUB

Statistical Reporting Service, and Economic Research Service of the Department of Agriculture-account for about 60 percent of a total Federal statistical budget which is currently on the order of $125 million a year. A decade ago, the largest four agencies accounted for 71 percent of a much smaller budget. By 1970 the total statistical budget of the Federal Government will probably exceed $200 million and, yet, unless we do something about it, decentralization will increase further. Yet, it already has been clear for some time that the Federal statistical system was too decentralized to function effectively and efficiently.

The committee proposed a National Data Center as a way to deal with the problem of effective use of available information. This proposal came under the scrutiny of the Subcommittee on the Invasion of Privacy of the Government Operations Committee of the House in a series of hearings in the summer of 1966. The hearings in turn generated a great deal of press comment. These hearings and the press comment together raised the question as to whether the proposed Center was a threat to personal privacy, and might even lead to a greatly increased intrusion of government into the life of the individual There is no question that a large-scale centralized data system which had no inhibitions on the information which it collected and no restraint on what it made public or how it made information available to other parts of the Government might indeed constitute a serious threat to privacy and liberty.

The crucial questions, of course, are what information would be put into the data center, and how access to it would be controlled. In the words of the task force report, the Center would assemble in a single facility all large-scale systematic bodies of demographic, economic and social data generated by the present data-collection or administrative processes of the Federal Government . . . integrate the data to the maximum feasible extent, and in such a way as to preserve as much as possible of the original information content of the whole body of records, and provide ready access to the information, within the laws governing disclosure, to all users in the Government, and where appropriate to qualified users outside the Government on suitably compensatory terms. (Report, pp. 17-18.)

The phrase "large-scale systematic bodies of demographic, economic, and social data” translates, in more concrete terms, into the existing bodies of data collected by Census, the Bureau of Labor Statistics, the Department of Agriculture, the National Center for Health Statistics, the Office of Education, and so on. It also includes the large bodies of data generated as a byproduct of the administration of the Federal income tax and social security systems. It does not include police dossiers from the FBI, personnel records of the Civil Service Commission or the individual Government agencies or personnel records of the armed services, and other dossier information, none of which fits what is meant by the phrase "large-scale, systematic bodies of social, economic, and demograhpic data."

For the data center to achieve its intended purposes, the material in it must identify individual respondents in some way, by social security number, for individuals, or an analogous code number, now used within the census for business enterprises, called the Alpha number. Without such identification, the center cannot meet its prime purpose of integrating the data collected by various agencies into a

single consistent body. Whether these social security or alpha numbers need in turn be keyed to a list of respondents whicħ identifies them by name and address within the data collecting agencies is a technical detail. That it must be done some place is perfectly clear as it now is within the several agencies that collect the information as it now is.

On the other hand, it is not, in general, necessary that the files in the data center contain a complete replica of every file on every respondent who has provided information to any of the original collectors. In many cases-for example, the social security files—a properly designed sample would serve the same purposes more economically. To this extent then, the data center will not contain a file on every individual, every household, every business, et cetera, but a mixture of a collection of samples--some of them relatively large--and complete files of some groups of reporting units which are particularly interesting and important from an analytical point of view. But here again, the significance of the difference between reproducing for the data center a complete file which already exists in some other agency, and reproducing only a sample therefrom can easily be overemphasized.

The content of information now in the inventory of government agencies is controlled ultimately by the Congress, operating through the appropriations process; and more immediately by the separate bureaucratic hierarchies of each data collecting agency, subject to the overall review of the Director of the Budget. He has a specific statutory responsibility for reviewing all governmental questionnaires directed to the public, with a view to eliminating duplication and keeping the total burden on respondents at a reasonable level. If this process seems to be working ineffectively, in the sense of ignoring persistent complaints, then the Appropriation Subcommittees that deal with the budget requests of each data-collecting agency are readily able to exercise a further control. In practice, the existence of this restraint operates to reinforce powerfully the caution of the collecting agencies in expanding their requests.

A new data center would operate within the same framework of controls. Indeed, the Congress, in authorizing its creation should define the kind of information which it would assemble, and could follow the line of demarcation of large-scale systematic demographic, economic, and social statistics suggested above. The inclusion of dossier information could be specifically prohibited. A clear distinction between “a dossier" and "statistical data file” on an individual can be made in principle; namely, a dossier, the specific identity of the individual is central to its purpose; while for a file of data it is merely a technical convenience for assembling in the same file the connected set of characteristics which are the object of information. The purpose of the one is the assembly of information about specific people; the purpose of the other the assembly of statistical frequency distributions of the many characteristics which groups of individuals (or households, business enterprises or other reporting units) share. In practice, of course, this distinction is not self-applying, and administrators and bureaucrats, checked and overseen by politicians, have to apply it. But so is it ever.

The present law and practice governing the use of census data offer a model which could well be applied to the new data center. The law provides that information contained in an individual census return may not be disclosed either to the general public or to other agencies of the Government, nor may such information be used for law-enforcement, regulatory, or tax-collection activity in respect to any individual respondent. This statutory restriction has been effectively enforced, and the Census Bureau has maintained for years the confidence of respondents in its will and ability to protect the information they give to it. The same statutory restraint could and should be extended to the data center, and the same results could be expected of it. The data center would supply to all users, inside and outside the Government, frequency distributions, summaries, analyses, but never data on individuals or other single reporting units. The technology of machine storage and processing would make it possible for these outputs to be tailored closely to the needs of individual users without great expense and without disclosure of individual data. This is just what is not possible under our decentralized system.

In my statement I talk about the question of cracking the system, penetrating it, and I think I will skip that, Senator, if I may, and say it is a technical question, and the technicians have told me that it can be handled.

Senator LONG. Doctor, we are somewhat concerned about the technical questions involved in this, and I would like to have your comment on that.

Dr. KAYSEN. I will be glad to comment. Perhaps, instead of reading the paragraph, I will answer a question.

Bearing all this in mind, I conclude that the risky potentials which might be inherent in a data center are so unlikely to materialize if faced beforehand, in the design and administration of the center, that they are outweighed, on balance, by the real improvement in understanding of our economic and social processes this enterprise would make possible, with all the concomitant gains in intelligent and effective public policy that such understanding could lead to. Thank you.

(The prepared statement of Dr. Kaysen follows:) PREPARED STATEMENT OF DR. CARL KAYSEN, BEFORE THE SUBCOMMITTEE OF AD

MINISTRATIVE PRACTICE AND PROCEDURE, OF THE U.S. SENATE COMMITTEE ON THE JUDICIARY, MARCH 14, 1967

My name is Carl Kaysen; I live at 97 Olden Lane, Princeton, New Jersey. I am Director of the Institute for Advanced Study in Princeton. By profession I am an economist, and it is in this capacity that I undertook the responsibility of being Chairman of the Task Force on Storage Of and Access To Government Statistics, that reported to the Director of the Budget. At that time I did so last year I was Littauer Professor of Political Economy and Associate Dean of the Graduate School of Public Administration at Harvard University.

The purpose of the Task Force was to examine a problem in government organization and operation which the members of the Committee thought was of importance to the government and to the public, looking at the problem from a perspective which most of us shared as users of government statistics. As economists we are aware that both the intellectual development of econ mics and its practical success have depended greatly on the large body of quantitative information on the whole range of economic activity that is publicly available in modern, democratic states. Much of this material is the by-product of regulatory, administrative, and revenue-raising activities of government, and its public availability reflects the democratic ethos. In the United States there is a central core of demographic, economic, and social information that is collected, organized and published by the Census Bureau in response to both governmental and public demands for information, rather than simply as the reflex of other governmental activities. Over time, and especially in the last three or four decades, there has been a continuing improvement in the coverage, consistency, and quality of these data that has in great part resulted from the continuing efforts of social scientists and statisticians both within and without the government. Without these improvements in the stock of basic quantitative information, our recent success in the application of sophisticated economic analyses to problems of public policy would have been impossible. We were moved by professional concern for the quality and usability of the enormous body of government data to take on what they thought to be a necessary, important, and totally unglamorous task.

The central problem which the Task Force addressed was the consequences of the trend toward increasing decentralization in the Federal Statistical System at a time when the demand for more and more detailed quantitative information was growing rapidly. Currently, twenty-one agencies of government have significant statistical programs. The largest four of these the Census, the Bureau of Labor Statistics, the Statistical Reporting Service and Economic Research Service of the Department of Agriculture-account for about 60% of a total Federal Statistical budget of nearly $125 millions. A decade ago, the largest four agencies accounted for 71% of a much smaller budget. By 1970, the total statistical budget of the Federal Government will probably exceed $200 millions and, in the absence of deliberate countervailing effort, decentralization will have further increased. Yet, it had already been clear for some time that the Federal statistical system was too decentralized to function effectively and efficiently.

The Committee proposed a National Data Center as a way to deal with the problem of effective use of available information. This proposal came under the scrutiny of the Subcommittee on the Invasion of Privacy of the Government Operations Committee of the House in a series of hearings in the summer of 1966. The hearings in turn generated a great deal of press comment. These hearings and the press comment together raised the question as to whether the proposed Center was a threat to personal privacy, and might even lead to a greatly increased intrusion of government into the life of the individual. There is no question that a large scale centralized data system which had no inhibitions on the information which it collected and no restraint on what it made public or how it made information available to other parts of the government might indeed constitute a Serious threat to privacy and liberty.

The crucial questions, of course, are what information would be put into the data center, and how access to it would be controlled. In the words of the Task Force Report, the “Center would assemble in a single facility all large-scale systematic bodies of demographic, economic and social data generated by the present data-collection or administrative processes of the Federal Government,

.. integrate the data to the maximum feasible extent, and in such a way as to preserve as much as possible of the original information content of the whole body of records, and provide ready access to the information, within the laws governing disclosure, to all users in the Government, and where appropriate to qualified users outside the Government on suitably conpensatory terms.” (Report, pp. 17-18)

The phrase "large scale systematic bodies of demographic economic and social data" translates, in more concrete terms, into the existing bodies of data collected by Census, the Bureau of Labor Statistics, the Department of Agriculture, the National Center for Health Statistics, the Office of Education, and so on. It also includes the large bodies of data generated as a by-product of the administration of the Federal income tax and Social Security systems. It does not include police dossiers from the FBI, personnel records of the Civil Service Commission or the individual government agencies, or personnel records of the armed services, and other dossier information, none of which fits what is meant by the phrase "large scale, systematic bodies of social, economic, and demographic data.”

For the data center to achieve its intended purposes, the material in it must identify individual respondents in some way, by Social Security number, for individuals, or an analogous code number, now used within the Census for business enterprises, called the Alpha number. Without such identification, the Center cannot meet its prime purpose of integrating the data collected by various agencies into a single consistent body. Whether these Social Security or Alpha numbers need in turn to be keyed to a list of respondents which identifies them by name and address within the data center itself, or whether that need be done only within the actual data collecting agencies is a technical detail. That it must be done someplace is perfectly clear, as it now is within the several agencies that collect the information,

On the other hand, it is not, in general, necessary that the files in the data center contain a complete replica of every file on every respondent who has provided information to any of the original collectors. In many cases-e.g., the Social Security files—a properly-designed sample would serve the same purposes more economically. To this extent then, the data center will not contain a file on every individual, every household, every business, etc., but a mixture of a collection of samples—some of them relatively large and complete files of some groups of reporting units which are particularly interesting and important from an analytical point of view. But here again, the significance of the difference between reproducing for the data center a complete file which already exists in some other agency, and reproducing only a sample therefrom can easily be overemphasized.

The content of information now in the inventory of government agencies is controlled ultimately by the Congress, operating through the appropriations process; and more immediately by the separate bureaucratic hierarchies of each data collecting agency, subject to the overall review of the Director of the Budget. He has a specific statutory responsibility for reviewing all governmental questionnaires directed to the public, with a view to eliminating duplication and keeping the total burden on respondents at a reasonable level. If this process seems to be working ineffectively, in the sense of ignoring persistent complaints, then the Appropriations Subcommittees that deal with the budget requests of each data-collecting agency are readily able to exercise a further control. In practice, the existence of this restraint operates to reinforce powerfully the caution of the collecting agencies in expanding their requests.

A new data center would operate within the same framework of controls. Indeed, the Congress, in authorizing its creation should define the kind of information which would assemble, and could follow the line of demarcation of large-scale systematic demographic, economic, and social statistics suggested above. The inclusion of dossier information could be specifically prohibited. A clear distinction between "a dossier" and "statistical data file” on an individual can be made in principle; namely, for a dossier, the specific identity of the individual is central to its purpose; while for a file of data it is merely a technical convenience for assembling in the same file the connected set of characteristics which are the object of information. The purpose of the one is the assembly of information about specific people; the purpose of the other the assembly of statistical frequency distributions of the many characteristics which groups of individuals (or households, business enterprises or other reporting units) share. In practice, of course, this distinction is not self-applying, and administrators and bureaucrats, checked and overseen by politicians, have to apply it. But so is it ever.

The present law and practice governing the use of Census data offer a model which could well be applied to the new data center. The law provides that information contained in an individual Census return may not be disclosed either to the general public or to other agencies of the government, nor may such information be used for law-enforcement, regulatory, or tax-collection activity in respect to any individual respondent. This statutory restriction has been effectively enforced, and the Census Bureau has maintained for years the confidence of respondents in its will and ability to protect the information they give to it. The same statutory restraint could and should be extended to the data center, and the same results could be expected of it. The data center would supply to all users, inside and outside the government, frequency distributions, summaries, analyses, but never data on individuals or other single reporting units. The technology of machine storage and processing would make it possible for these outputs to be tailored closely to the needs of individual users without great expense and without disclosure of individual data. This is just what is not possible under our present system.

It has been argued that the great richness of the data files in the proposed center would in itself create a problem. The temptation to "crack" the files by illegitimate means would be strong. So would the temptation to abuse the information by those government servants whose positions give them legitimate access to the data. It is clearly the case that centralized storage in machine

« PreviousContinue »