Good morning everyone !
I have a problem of calculating the decile by group.
Here is the datset that I use:
I want to collapse the education level and other demographics, by whether be promoted or not.
but now the data stucture is like below so I need your kindly help with how to achieve the goal :
id | time | edu | occupation |
1 | 1 | 3 | 1 |
1 | 2 | 3 | 1 |
1 | 3 | 3 | 2 |
1 | 4 | 3 | 2 |
2 | 1 | 5 | 2 |
2 | 2 | 5 | 2 |
2 | 3 | 5 | 4 |
3 | 1 | 4 | 2 |
3 | 2 | 4 | 2 |
3 | 3 | 4 | 2 |
for individual 1 and 2, because they are both promoted (id1: from occupation 1 to 2; id2: from occupation 2 to 4) during the sample period, so they are catogorized as "promoted" group, and 3 is not promoted during this time, so it is catogorized as "non-promoted" group.
Now I'd like to know the mean education level for both groups. So the mean edu is (3*4+5*3)/7 for "promoted" and (2*3)/3 for "non-promoted" group.
It would be very nice if you could tell me how to write the code !!
Thanks ^_^
I will make the assumption that "promoted" correspond in your data to going from a lower numbered occupation to a higher numbered one. At least that much is true for the examples you describe. If it is not true generally, you need to explain in detail how to know whether an occupation change is a promotion or not.
The following code, I believe, does what you want:
* Example generated by -dataex-. For more info, type help dataex clear input byte(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 1 4 2 3 2 4 2 3 3 4 2 end by id (occupation), sort: gen byte promoted = occupation[_N] > occupation[1] by promoted, sort: egen mean_group_education = mean(edu) isid id time, sort
Note: the reults do not agree with your calculation for the non-promoted group in the example. The "non-promoted group" here consists only of id 3, and id3 has three observations, each with edu = 4, so the mean education for the group is (3*4)/4, not (3*2)/2. Did you make a mistake, or do I misunderstand the problem?
In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
I will make the assumption that "promoted" correspond in your data to going from a lower numbered occupation to a higher numbered one. At least that much is true for the examples you describe. If it is not true generally, you need to explain in detail how to know whether an occupation change is a promotion or not.
The following code, I believe, does what you want:
* Example generated by -dataex-. For more info, type help dataex clear input byte(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 1 4 2 3 2 4 2 3 3 4 2 end by id (occupation), sort: gen byte promoted = occupation[_N] > occupation[1] by promoted, sort: egen mean_group_education = mean(edu) isid id time, sort
Note: the reults do not agree with your calculation for the non-promoted group in the example. The "non-promoted group" here consists only of id 3, and id3 has three observations, each with edu = 4, so the mean education for the group is (3*4)/4, not (3*2)/2. Did you make a mistake, or do I misunderstand the problem?
In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Thanks a lot, that really helps!!
Actually I made a mistake , it's (3*4) not (3*2).
Here I've got a further problem :
I would like to compare different kinds of promotion, like from 1 to 2 , from 1 to 3 ,from 2 to 3 from 2 to 4 ,etc. So that I could compare different pairs' education level like :
for those once in occupation2 ,non-promoted VS promoted
for those once in occupation2 , non-promoted VS promoted to 3 VS promoted to 4.
* Example generated by -dataex-. For more info, type help dataex clear input float(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 2 4 2 3 3 4 2 3 4 4 2 4 1 2 2 4 2 2 2 4 3 2 3 4 4 2 3 5 1 3 1 5 2 3 1 6 3 2 1 6 4 2 1 6 5 2 3 end
Would you like to tell me how should I write the code?
Thanks again!!
* Example generated by -dataex-. For more info, type help dataex clear input float(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 2 4 2 3 3 4 2 3 4 4 2 4 1 2 2 4 2 2 2 4 3 2 3 4 4 2 3 5 1 3 1 5 2 3 1 6 3 2 1 6 4 2 1 6 5 2 3 end by id (occupation), sort: gen first_occ = occupation[1] by id (occupation): gen last_occ = occupation[_N] by first_occ last_occ, sort: egen pair_mean_education = mean(edu) isid id time, sort
Note: in your example data, every id has either just one occupation, or just two. In real data, I can imagine that there are people with 3 or more different occupations represented. This code groups those people by their lowest level occupation and their highest. For that matter, if there is somebody in the data who goes from a higher level occupation to a lower one (demoted), that person is classified the same as a person who was promoted from the lower one to the higher one. And there might be somebody who is initially promoted but then demoted back, or other complicated situations may arise. All of these situations are handled by this code as if they were a single promotion from the lowest level occupation to the highest level occupation. If that is not how you want to handle these situations, post back with an explanation of how you do.
Last edited by Clyde Schechter; 19 Dec 2021, 20:59 .* Example generated by -dataex-. For more info, type help dataex clear input float(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 2 4 2 3 3 4 2 3 4 4 2 4 1 2 2 4 2 2 2 4 3 2 3 4 4 2 3 5 1 3 1 5 2 3 1 6 3 2 1 6 4 2 1 6 5 2 3 end by id (occupation), sort: gen first_occ = occupation[1] by id (occupation): gen last_occ = occupation[_N] by first_occ last_occ, sort: egen pair_mean_education = mean(edu) isid id time, sort
Note: in your example data, every id has either just one occupation, or just two. In real data, I can imagine that there are people with 3 or more different occupations represented. This code groups those people by their lowest level occupation and their highest. For that matter, if there is somebody in the data who goes from a higher level occupation to a lower one (demoted), that person is classified the same as a person who was promoted from the lower one to the higher one. And there might be somebody who is initially promoted but then demoted back, or other complicated situations may arise. All of these situations are handled by this code as if they were a single promotion from the lowest level occupation to the highest level occupation. If that is not how you want to handle these situations, post back with an explanation of how you do.
Thanks, Clyde! It's really helpful !!
Yes, just as you said, there are more complicated situations, like for one person, he may have 3 or more occupations . And demotion may also occur.
So I want to specify these different situations, like for id7, I would like to get the mean education of him when promoted from 1 to 3 (together with id 6 because they can be seen as the same promotion group: from 1 to 3 ), so to compare with non-promoted 1. And get the mean education of him when promoted from 3 to 4 , so to compare with non-promoted 3.
* Example generated by -dataex-. For more info, type help dataex clear input float(id time edu occupation) 1 1 3 1 1 2 3 1 1 3 3 2 1 4 3 2 2 1 5 2 2 2 5 2 2 3 5 4 3 2 4 2 3 3 4 2 3 4 4 2 4 1 2 2 4 2 2 2 4 3 2 3 4 4 2 3 5 1 3 1 5 2 3 1 6 3 2 1 6 4 2 1 6 5 2 3 7 1 4 1 7 2 4 1 7 3 4 3 7 4 4 3 7 5 4 3 7 6 4 4 7 7 4 4 7 8 4 3 7 9 4 3 end
So this is actually pretty complicated.
by id (time), sort: gen spell = sum(occupation != occupation[_n-1]) by id spell (time), sort: egen nobs_now = count(time) by id spell (time): egen total_ed_now = total(edu) // ADJOIN STATISTICS FROM NEXT OCCUPATION TO PRESENT ONE rangestat (first) next_occupation = occupation /// (count) nobs_next = time (sum) total_ed_next = edu, by(id) interval(spell 1 1) // CREATE A SEPARATE FRAME FOR AGGREGATING OVER PAIRS OF OCCUPATION & NEXT OCCUPATION frame put id *occupation spell nobs_* total_ed_*, into(totals) frame change totals duplicates drop keep if next_occupation > occupation & !missing(next_occupation) // PROMOTIONS ONLY // AGGREGATE UP TO THE OCCUPATION: NEXT OCCUPATION PAIR LEVEL collapse (sum) nobs_* total_ed_*, by(occupation next_occupation) gen wanted = (total_ed_now + total_ed_next)/(nobs_now + nobs_next) keep occupation next_occupation wanted // RESULTS ARE HERE // ATTACH THE RESULTS TO THE ORIGINAL DATA SET frame change default frlink m:1 occupation next_occupation, frame(totals) frget wanted, from(totals)
If you just want a data set showing all combinations of an occupation and following occupation that constitute a promotion, along with the mean education (during both occupations) of those people, you will find it in frame totals at the end of the code. The mean education is in the variable called wanted. (But the code switches back to frame default, so to see it you have to change to frame totals again.) What you will find in frame default is the original data set, to which several variables have been added, including the variable wanted, which has been adjoined only to the observations of the starting job of any promotion.
Note: this code calls on the program -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, available from SSC. It is possible to revise the code to use only native Stata commands, but I think the code is pretty opaque as it is, and using -rangestat- instead of a series of other commands makes it a bit less so.